# Item Analysis

A basic principle of all item analysis is to see if the individual questions in your composite scale “hang together”. We often do item analysis when building a new composite scale.

https://www.r-bloggers.com/2016/08/five-ways-to-calculate-internal-consistency/



###Import, Clean, Reverse Data

For this post, we’ll be using data on a Big 5 measure of personality that is freely available from Personality Tests. You can download the data yourself HERE, or running the following code will handle the downloading and save the data as an object called d:



In [None]:
#importing dataset

import requests
import zipfile
import io
import pandas as pd

# Download the zip file
url = "http://personality-testing.info/_rawdata/BIG5.zip"
response = requests.get(url)
response.raise_for_status()  # Raise an exception for bad status codes

# Extract the data file from the zip archive
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    with zf.open("BIG5/data.csv") as data_file:
        df = pd.read_csv(data_file, sep="\t")

print(df.head(10))

# Now you have the data in a pandas DataFrame called 'df'
# You can proceed with your analysis using pandas functionalities

   race  age  engnat  gender  hand  source country  E1  E2  E3  ...  O1  O2  \
0     3   53       1       1     1       1      US   4   2   5  ...   4   1   
1    13   46       1       2     1       1      US   2   2   3  ...   3   3   
2     1   14       2       2     1       1      PK   5   1   1  ...   4   5   
3     3   19       2       2     1       1      RO   2   5   2  ...   4   3   
4    11   25       2       2     1       2      US   3   1   3  ...   3   1   
5    13   31       1       2     1       2      US   1   5   2  ...   4   2   
6     5   20       1       2     1       5      US   5   1   5  ...   3   1   
7     4   23       2       1     1       2      IN   4   3   5  ...   3   1   
8     5   39       1       2     3       4      US   3   1   5  ...   3   3   
9     3   18       1       2     1       5      US   1   4   2  ...   4   2   

   O3  O4  O5  O6  O7  O8  O9  O10  
0   3   1   5   1   4   2   5    5  
1   3   3   2   3   3   1   3    2  
2   5   1   5   1  

In [None]:
#cut down on first 500 participants
sample = df.head(500)

#cut down on extraversion items (e1 to e10)
data = sample[['E1', 'E2', 'E3', 'E4', 'E5', 'E6', 'E7', 'E8', 'E9', 'E10']]

print(data.head(10))


   E1  E2  E3  E4  E5  E6  E7  E8  E9  E10
0   4   2   5   2   5   1   4   3   5    1
1   2   2   3   3   3   3   1   5   1    5
2   5   1   1   4   5   1   1   5   5    1
3   2   5   2   4   3   4   3   4   4    5
4   3   1   3   3   3   1   3   1   3    5
5   1   5   2   4   1   3   2   4   1    5
6   5   1   5   1   5   1   5   4   4    1
7   4   3   5   3   5   1   4   3   4    3
8   3   1   5   1   5   1   5   2   5    3
9   1   4   2   5   2   4   1   4   1    5


Here is a list of the extraversion items that people are rating from 1 = Disagree to 5 = Agree:

E1 I am the life of the party.

E2 I don’t talk a lot.

E3 I feel comfortable around people.

E4 I keep in the background.

E5 I start conversations.

E6 I have little to say.

E7 I talk to a lot of different people at parties.

E8 I don’t like to draw attention to myself.

E9 I don’t mind being the center of attention.

E10 I am quiet around strangers.

In [None]:
#reverse coding (1=disagree to 5=agree)
#for an introvert...(1=1) (2=5) (3=1) (4=5) (5=1) (6=5) (7=1) (8=5) (9=1) (10=5)
#so we need to reverse code 2, 4, 6, 8, 10

def reverse_score(x):
    if x == 1:
        return 5
    elif x == 2:
        return 4
    elif x == 3:
        return 3
    elif x == 4:
        return 2
    elif x == 5:
        return 1
    else:
        return x  # Handle unexpected values (e.g., NaN)

cols_to_reverse = ['E2', 'E4', 'E6', 'E8', 'E10']
data[cols_to_reverse] = data[cols_to_reverse].applymap(reverse_score)

print(data.head(10))


   E1  E2  E3  E4  E5  E6  E7  E8  E9  E10
0   4   4   5   4   5   5   4   3   5    5
1   2   4   3   3   3   3   1   1   1    1
2   5   5   1   2   5   5   1   1   5    5
3   2   1   2   2   3   2   3   2   4    1
4   3   5   3   3   3   5   3   5   3    1
5   1   1   2   2   1   3   2   2   1    1
6   5   5   5   5   5   5   5   2   4    5
7   4   3   5   3   5   5   4   3   4    3
8   3   5   5   5   5   5   5   4   5    3
9   1   2   2   1   2   2   1   2   1    1


  data[cols_to_reverse] = data[cols_to_reverse].applymap(reverse_score)


We’ve now got a data frame of responses with each column being an item (scored in the correct direction) and each row being a participant. Let’s get started!



## Average inter-item correlation


Run a correlation matrix on all items in your dataset

Take the mean of the correlations between each item and the others (excluding correlations between each item and itself). This gives you the average inter-item correlation

Evaluate which items are more strongly correlated with the other items


In [None]:
#step 1: run a correlation matrix on all items in datset

corr_matrix = data.corr()

print(corr_matrix)

           E1        E2        E3        E4        E5        E6        E7  \
E1   1.000000  0.452889  0.500233  0.523752  0.537856  0.365783  0.636062   
E2   0.452889  1.000000  0.479203  0.554911  0.591700  0.569459  0.473167   
E3   0.500233  0.479203  1.000000  0.493029  0.616185  0.329619  0.568452   
E4   0.523752  0.554911  0.493029  1.000000  0.512350  0.471474  0.499934   
E5   0.537856  0.591700  0.616185  0.512350  1.000000  0.499664  0.620543   
E6   0.365783  0.569459  0.329619  0.471474  0.499664  1.000000  0.372577   
E7   0.636062  0.473167  0.568452  0.499934  0.620543  0.372577  1.000000   
E8   0.449812  0.379180  0.417710  0.450576  0.385078  0.331025  0.402892   
E9   0.528037  0.395857  0.475309  0.463138  0.485178  0.328002  0.528329   
E10  0.490886  0.448787  0.500074  0.523423  0.552519  0.413774  0.517827   

           E8        E9       E10  
E1   0.449812  0.528037  0.490886  
E2   0.379180  0.395857  0.448787  
E3   0.417710  0.475309  0.500074  
E4   0.4

In [None]:
# step 2: obtain the average correlation of each item with all others by computing the means for each column


# but first, since the diagonal is 1.00, we need to make it NA

import numpy as np

# Replace diagonal with NaN
np.fill_diagonal(corr_matrix.values, np.nan)

# Calculate the average correlation for each item
average_correlations = corr_matrix.mean(skipna=True)  # Skip NaN values

# Display the results
print(average_correlations)


# overall average inter-item correlation
print(average_correlations.mean())

E1     0.498368
E2     0.482795
E3     0.486646
E4     0.499176
E5     0.533452
E6     0.409042
E7     0.513309
E8     0.427019
E9     0.473190
E10    0.481450
dtype: float64
0.4804446310889225


In [None]:
# Evaluate which items are more strongly correlated with the other items

# see if average inter-item correlation > .2
# looks like E5 is the strongest correlated

## Cronbach's Alpha

This is another way to evaluate items in a scale

Assessment: .6+ is good, .8+ is excellent. A high enough alpha score -- more than .6 -- tells us that the scores of our items tend to co-vary toegether

In [None]:
# install package

!pip install pingouin==0.5.3

Collecting pingouin==0.5.3
  Downloading pingouin-0.5.3-py3-none-any.whl.metadata (1.2 kB)
Collecting pandas-flavor>=0.2.0 (from pingouin==0.5.3)
  Downloading pandas_flavor-0.6.0-py3-none-any.whl.metadata (6.3 kB)
Collecting outdated (from pingouin==0.5.3)
  Downloading outdated-0.2.2-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting littleutils (from outdated->pingouin==0.5.3)
  Downloading littleutils-0.2.4-py3-none-any.whl.metadata (679 bytes)
Downloading pingouin-0.5.3-py3-none-any.whl (198 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas_flavor-0.6.0-py3-none-any.whl (7.2 kB)
Downloading outdated-0.2.2-py2.py3-none-any.whl (7.5 kB)
Downloading littleutils-0.2.4-py3-none-any.whl (8.1 kB)
Installing collected packages: littleutils, outdated, pandas-flavor, pingouin
Successfully installed littleutils-0.2.4 outdated-0.2.2 pandas-flavor-0.6.0 pingouin-0.5.3


In [None]:
#need to fix this for some reason

df = data

In [None]:
import pingouin as pg

# Calculate Cronbach's alpha
alpha = pg.cronbach_alpha(data=df)

# Display the result
print(alpha)

(0.9022248126779574, array([0.889, 0.915]))
