# Link utili

* https://medium.com/uxness/survey-research-ways-to-clean-survey-data-before-analysis-52c1c6309754
* https://www.statology.org/cronbachs-alpha-in-python/

# Data info

This data was collected through an interactive on-line version of the *Generic Conspiracist Beliefs Scale* in 2016.
See Brotherton, Robert, Christopher C. French, and Alan D. Pickering. *"Measuring belief in conspiracy theories: the generic conspiracist beliefs scale."* Frontiers in psychology 4 (2013).

Visitors completed the test primarily for personal amusement.
At the end of the test but before the results were displayed, users were asked if they would be willing to complete an additional survey and allow their responses to be saved for research. Only users who agreed yes are in this dataset. 

Individuals with age < 13 were not recorded.

The responses to the GCBS are in (question numbers match to items in TABLE A1 of Brotherton, et. al. 2013.):

*Q1 - Q15*

The time spent (ms) answering each question was also recorded, and are stored in variables

*E1 - E15*

The other following time elapses were also recorded:

* *introelapse*		The time spent on the introduction/landing page (in seconds)
* *testelapse*		The time spent on the GCBS questions
* *surveyelapse*	The time spent answering the rest of the demographic and survey questions

The *Ten Item Personality Inventory* was administered (see Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). *A Very Brief Measure of the Big Five Personality Domains.* Journal of Research in Personality, 37, 504-528.):

* *TIPI1*	Extraverted, enthusiastic.
* *TIPI2*	Critical, quarrelsome.
* *TIPI3*	Dependable, self-disciplined.
* *TIPI4*	Anxious, easily upset.
* *TIPI5*	Open to new experiences, complex.
* *TIPI6*	Reserved, quiet.
* *TIPI7*	Sympathetic, warm.
* *TIPI8*	Disorganized, careless.
* *TIPI9*	Calm, emotionally stable.
* *TIPI10*	Conventional, uncreative.

The TIPI items were rated "I see myself as:" _____ such that

1 = Disagree strongly
2 = Disagree moderately
3 = Disagree a little
4 = Neither agree nor disagree
5 = Agree a little
6 = Agree moderately
7 = Agree strongly


The following items were presented as a check-list and subjects were instructed "In the grid below, check all the words whose definitions you are sure you know":

* *VCL1*	boat
* *VCL2*	incoherent
* *VCL3*	pallid
* *VCL4*	robot
* *VCL5*	audible
* *VCL6*	cuivocal
* *VCL7*	paucity
* *VCL8*	epistemology
* *VCL9*	florted
* *VCL10*	decide
* *VCL11*	pastiche
* *VCL12*	verdid
* *VCL13*	abysmal
* *VCL14*	lucid
* *VCL15*	betray
* *VCL16*	funny

A value of 1 is checked, 0 means unchecked. The words at VCL6, VCL9, and VCL12 are not real words and can be used as a validity check.

A bunch more questions were then asked:


* *education*			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
* *urban*				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
* *gender*				"What is your gender?", 1=Male, 2=Female, 3=Other
* *engnat*				"Is English your native language?", 1=Yes, 2=No
* *age*					"How many years old are you?"
* *hand*				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
* *religion*			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
* *orientation*			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
* *race*				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian, Native American or White***, 5=Other
* *voted*				"Have you voted in a national election in the past year?", 1=Yes, 2=No
* *married*				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
* *familysize*			"Including you, how many children did your mother have?"		
* *major*				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

All missing data from the last block (demographics) were recorded as 0.

There was an error in the coding of they survey, race options *'Native American'* or *'White'* were separate but all were given the same numerical value

# Dataset loading

In [1]:
import pandas as pd
import numpy as np
import json
import requests
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)

In [2]:
gcbs_df = pd.read_csv('https://raw.githubusercontent.com/bolps/conspiracy/main/openpsychometrics_gcbs_data.csv')

In [3]:
gcbs_df.head(5)

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,engnat,age,hand,religion,orientation,race,voted,married,familysize,major
0,5,5,3,5,5,5,5,3,4,5,...,2,28,1,2,1,5,2,1,1,ACTING
1,5,5,5,5,5,3,5,5,1,4,...,1,14,1,1,2,4,2,1,1,
2,2,4,1,2,2,2,4,2,2,4,...,2,26,1,1,1,4,1,1,2,philosophy
3,5,4,1,2,4,5,4,1,4,5,...,1,25,1,12,1,4,1,1,3,history
4,5,4,1,4,4,5,4,3,1,5,...,1,37,1,2,2,4,2,2,2,Psychology


In [4]:
gcbs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2495 entries, 0 to 2494
Data columns (total 72 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Q1            2495 non-null   int64 
 1   Q2            2495 non-null   int64 
 2   Q3            2495 non-null   int64 
 3   Q4            2495 non-null   int64 
 4   Q5            2495 non-null   int64 
 5   Q6            2495 non-null   int64 
 6   Q7            2495 non-null   int64 
 7   Q8            2495 non-null   int64 
 8   Q9            2495 non-null   int64 
 9   Q10           2495 non-null   int64 
 10  Q11           2495 non-null   int64 
 11  Q12           2495 non-null   int64 
 12  Q13           2495 non-null   int64 
 13  Q14           2495 non-null   int64 
 14  Q15           2495 non-null   int64 
 15  E1            2495 non-null   int64 
 16  E2            2495 non-null   int64 
 17  E3            2495 non-null   int64 
 18  E4            2495 non-null   int64 
 19  E5    

In [5]:
gcbs_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Q1,2495.0,3.472545,1.455552,0.0,2.0,4.0,5.0,5.0
Q2,2495.0,2.963527,1.494669,0.0,2.0,3.0,4.0,5.0
Q3,2495.0,2.046894,1.387236,0.0,1.0,1.0,3.0,5.0
Q4,2495.0,2.636072,1.451371,0.0,1.0,2.0,4.0,5.0
Q5,2495.0,3.254108,1.471855,0.0,2.0,4.0,5.0,5.0
Q6,2495.0,3.108617,1.506676,0.0,2.0,3.0,4.0,5.0
Q7,2495.0,2.666934,1.509954,0.0,1.0,2.0,4.0,5.0
Q8,2495.0,2.450501,1.569256,0.0,1.0,2.0,4.0,5.0
Q9,2495.0,2.232866,1.419266,0.0,1.0,2.0,3.0,5.0
Q10,2495.0,3.502204,1.388713,1.0,2.0,4.0,5.0,5.0


# Data munging

Before starting the whole process of data cleaning and analysis, raw data must be trasformed in the most convenient way for the upcoming steps.

The following munging will be applied:
* dropping useless columns
* assigning the correct data type
* labelling categorical data (according to the codebook)
* identifying missing values (according to the codebook)
* grouping values from text fields
* renaming columns (with prefixes for better organization)
* computing variables for survey cleaning (straightliners, speed responses, outliers, validity check errors)
* computing *GCBS* and *TIPI* scores (according to literature). *This step is crucial as it allows us to transform raw data (indicators) into a value (score) that reflects an individual's position on an underlying construct (such as personality or beliefs)*

In [6]:
# dropping 'introelapse','testelapse','surveyelapse' (overall time spent in every block) columns as I'm not going to use them
gcbs_df = gcbs_df.drop(['introelapse','testelapse','surveyelapse'], axis=1)

In [7]:
# download/parsing json labels
survey_labels_json = requests.get("https://raw.githubusercontent.com/bolps/conspiracy/main/survey_labels.json")
survey_labels_dict = json.loads(survey_labels_json.text)

In [8]:
# mapping labels on dataframe, converting 'Missing' label to NaN, casting to 'category' (or ordered category)
for column in ['education','urban','gender','engnat','hand','religion','orientation','race','voted','married']:
    gcbs_df[column] = gcbs_df[column].astype(str).map(survey_labels_dict[column])
    gcbs_df[column] = gcbs_df[column].replace('Missing', np.nan)
    gcbs_df[column] = gcbs_df[column].astype("category")
gcbs_df['education'] = gcbs_df['education'].cat.reorder_categories(['Less than high school', 'High school', 'University degree', 'Graduate degree'], ordered=True)

In [9]:
# assigning NaN to values which makes no sense (i.e. too old or too young people)

gcbs_df['familysize'] = gcbs_df['familysize'].replace(0, np.nan)
gcbs_df.loc[gcbs_df.familysize > 10, 'familysize'] = np.nan

gcbs_df.loc[gcbs_df.age > 90, 'age'] = np.nan
gcbs_df.loc[gcbs_df.age < 10, 'age'] = np.nan

gcbs_df.loc[gcbs_df.education < 'University degree', 'major'] = np.nan

In [10]:
# Normalizing text for the field major (free-form). Example: '           aCTING' = 'Acting', 'ACTING' = 'Acting'
gcbs_df['major'] = gcbs_df['major'].map(lambda x : x.strip().capitalize() if isinstance(x, str) else np.nan)

In [11]:
# download/parsing json labels. Major labels has been manually grouped in macro-categories
major_labels_json = requests.get("https://raw.githubusercontent.com/bolps/conspiracy/main/major_labels.json")
major_labels_dict = json.loads(major_labels_json.text)

In [12]:
# mapping labels on dataframe in order to crate a new column with the 'major cluster'. Example: Graphic design (Major) - Arts (Cluster) 
gcbs_df['major_cluster'] = gcbs_df['major'].map(major_labels_dict['Cluster'])
gcbs_df['major_cluster'] = gcbs_df['major_cluster'].replace('Missing', np.nan)
gcbs_df['major_cluster'] = gcbs_df['major_cluster'].astype("category")

In [13]:
# renaming GCBS items for consistency and readability (columns Q1-Q15 belongs to GCBS scale). Example: Q1 becomes GCBS1
gcbs_df.columns = [column.replace('Q', 'GCBS') for column in gcbs_df.columns]

In [14]:
# computing scores for GCBS scale according to the codebook. Note: GCBS scale doesn't use reverse items.
gcbs_cols = list(gcbs_df.filter(like='GCBS',axis=1).columns)
gcbs_df['GCBS_Overall'] = (gcbs_df[gcbs_cols].sum(axis=1))/len(gcbs_cols)

In [15]:
# computing scores for TIPI (personality) scale and subscales according to the codebook. Note: Some of the items are reversed (formula for reversing: reverse = (number_of_levels + 1) - raw_value).
reverse = lambda x : (8-x)
gcbs_df['TIPI_Extraversion'] = (gcbs_df['TIPI1'] + reverse(gcbs_df['TIPI6']))/2
gcbs_df['TIPI_Agreeableness'] = (reverse(gcbs_df['TIPI2']) + gcbs_df['TIPI7'])/2
gcbs_df['TIPI_Conscientiousness'] = (gcbs_df['TIPI3'] + reverse(gcbs_df['TIPI8']))/2
gcbs_df['TIPI_Emotional_Stability'] = (reverse(gcbs_df['TIPI4']) + gcbs_df['TIPI9'])/2
gcbs_df['TIPI_Openness'] = (gcbs_df['TIPI5'] + reverse(gcbs_df['TIPI10']))/2

In [16]:
# computing variable for counting validity check errors (validity check items are simple questions used to probe user attention and engagement)
gcbs_df['SURV_ValidityCheck_Errors'] = gcbs_df['VCL6'] + gcbs_df['VCL9'] + gcbs_df['VCL12']
gcbs_df = gcbs_df.drop(['VCL1','VCL2','VCL3','VCL4','VCL5','VCL6','VCL7','VCL8','VCL9','VCL10','VCL11','VCL12','VCL13','VCL14','VCL15','VCL16'], axis=1)

In [17]:
# computing variable to spot straightliners for GCBS (respondents who selected the same answer to all items of the scale).
# the respondent is a straightliner if the std deviation of the scale is 0 (always the same answer)
gcbs_df['SURV_CheckStraightliners_GCBS'] = gcbs_df[['GCBS1','GCBS2','GCBS3','GCBS4','GCBS5','GCBS6','GCBS7','GCBS8','GCBS9','GCBS10','GCBS11','GCBS12','GCBS13','GCBS14','GCBS15']].std(axis=1)
gcbs_df['SURV_CheckStraightliners_GCBS'] = gcbs_df['SURV_CheckStraightliners_GCBS'] == 0

In [18]:
# computing variable to spot straightliners for TIPI (respondents who selected the same answer to all items of the scale).
# the respondent is a straightliner if the std deviation of the scale is 0 (always the same answer)
gcbs_df['SURV_CheckStraightliners_TIPI'] = gcbs_df[['TIPI1','TIPI2','TIPI3','TIPI4','TIPI5','TIPI6','TIPI7','TIPI8','TIPI9','TIPI10']].std(axis=1)
gcbs_df['SURV_CheckStraightliners_TIPI'] = gcbs_df['SURV_CheckStraightliners_TIPI'] == 0

In [19]:
# computing variable to spot speed responses for GCBS (the core construct measured). Note: Unfortunately we have no data on TIPI response times.
# I considered a speed response if the response time of at least one item of the scale is less than the 5th quantile. Note: I dropped response times as I don't need them anymore
for item in ['E1','E2','E3','E4','E5','E6','E7','E8','E9','E10','E11','E12','E13','E14','E15']:
    gcbs_df[item] = gcbs_df[item] < gcbs_df[item].quantile(.05)
gcbs_df['SURV_QuickResponse_GCBS'] = gcbs_df[['E1','E2','E3','E4','E5','E6','E7','E8','E9','E10','E11','E12','E13','E14','E15']].any(axis=1)
gcbs_df = gcbs_df.drop(['E1','E2','E3','E4','E5','E6','E7','E8','E9','E10','E11','E12','E13','E14','E15'], axis=1)

In [20]:
# renaming columns for better organization.
# I logically divided data in blocks: RAW (raw GCBS and TIPI data), DEMO (demographics), GCBS (gcbs scores), TIPI (tipi scores) and SURV (survey related info)
gcbs_df.columns = [('RAW_'+column) for column in gcbs_df.columns[:gcbs_df.columns.get_loc('TIPI10')+1]] + [('DEMO_'+column) for column in gcbs_df.columns[gcbs_df.columns.get_loc('TIPI10')+1:gcbs_df.columns.get_loc('major_cluster')+1]] + list(gcbs_df.columns[gcbs_df.columns.get_loc('major_cluster')+1:])

In [21]:
gcbs_df.head(5).T

Unnamed: 0,0,1,2,3,4
RAW_GCBS1,5,5,2,5,5
RAW_GCBS2,5,5,4,4,4
RAW_GCBS3,3,5,1,1,1
RAW_GCBS4,5,5,2,2,4
RAW_GCBS5,5,5,2,4,4
RAW_GCBS6,5,3,2,5,5
RAW_GCBS7,5,5,4,4,4
RAW_GCBS8,3,5,2,1,3
RAW_GCBS9,4,1,2,4,1
RAW_GCBS10,5,4,4,5,5


In [22]:
gcbs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2495 entries, 0 to 2494
Data columns (total 49 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   RAW_GCBS1                      2495 non-null   int64   
 1   RAW_GCBS2                      2495 non-null   int64   
 2   RAW_GCBS3                      2495 non-null   int64   
 3   RAW_GCBS4                      2495 non-null   int64   
 4   RAW_GCBS5                      2495 non-null   int64   
 5   RAW_GCBS6                      2495 non-null   int64   
 6   RAW_GCBS7                      2495 non-null   int64   
 7   RAW_GCBS8                      2495 non-null   int64   
 8   RAW_GCBS9                      2495 non-null   int64   
 9   RAW_GCBS10                     2495 non-null   int64   
 10  RAW_GCBS11                     2495 non-null   int64   
 11  RAW_GCBS12                     2495 non-null   int64   
 12  RAW_GCBS13                     249

In [23]:
gcbs_df.isnull().sum()[gcbs_df.isnull().sum() > 0]

DEMO_education          38
DEMO_urban              39
DEMO_gender              6
DEMO_engnat              8
DEMO_age                 2
DEMO_hand               13
DEMO_religion           31
DEMO_orientation        39
DEMO_race               25
DEMO_voted              23
DEMO_married            16
DEMO_familysize         53
DEMO_major            1561
DEMO_major_cluster    1578
dtype: int64

In [24]:
gcbs_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RAW_GCBS1,2495.0,3.472545,1.455552,0.0,2.0,4.0,5.0,5.0
RAW_GCBS2,2495.0,2.963527,1.494669,0.0,2.0,3.0,4.0,5.0
RAW_GCBS3,2495.0,2.046894,1.387236,0.0,1.0,1.0,3.0,5.0
RAW_GCBS4,2495.0,2.636072,1.451371,0.0,1.0,2.0,4.0,5.0
RAW_GCBS5,2495.0,3.254108,1.471855,0.0,2.0,4.0,5.0,5.0
RAW_GCBS6,2495.0,3.108617,1.506676,0.0,2.0,3.0,4.0,5.0
RAW_GCBS7,2495.0,2.666934,1.509954,0.0,1.0,2.0,4.0,5.0
RAW_GCBS8,2495.0,2.450501,1.569256,0.0,1.0,2.0,4.0,5.0
RAW_GCBS9,2495.0,2.232866,1.419266,0.0,1.0,2.0,3.0,5.0
RAW_GCBS10,2495.0,3.502204,1.388713,1.0,2.0,4.0,5.0,5.0


# Data cleaning

In order to remove noise and invalid responses, data related to participants who gave quick answers, failed validity checks, or provided straightline responses was removed. I also excluded all individuals with age<18 as at present the *GCBS* scale has only been assessed on adults (age>18). This choice guarantees both reliability (how consistently a method measures something) and validity (how accurately a method measures what it is intended to measure) of the measure.

In [25]:
age_mask = gcbs_df['DEMO_age'] >= 18
validity_check_erros_mask = gcbs_df['SURV_ValidityCheck_Errors'] == 0
check_straightliners_gcbs_mask = gcbs_df['SURV_CheckStraightliners_GCBS'] == False
check_straightliners_tipi_mask = gcbs_df['SURV_CheckStraightliners_TIPI'] == False
quick_response_gcbs_mask = gcbs_df['SURV_QuickResponse_GCBS'] == False

In [26]:
gcbs_clean_df = gcbs_df[age_mask & validity_check_erros_mask & check_straightliners_gcbs_mask & check_straightliners_tipi_mask & quick_response_gcbs_mask]
gcbs_clean_df = gcbs_clean_df.reset_index(drop=True)
gcbs_clean_df = gcbs_clean_df.drop(['SURV_ValidityCheck_Errors','SURV_CheckStraightliners_GCBS','SURV_CheckStraightliners_TIPI','SURV_QuickResponse_GCBS'], axis=1)

In [27]:
gcbs_clean_df.head(5).T

Unnamed: 0,0,1,2,3,4
RAW_GCBS1,5,5,5,5,4
RAW_GCBS2,5,4,4,4,4
RAW_GCBS3,3,1,3,2,1
RAW_GCBS4,5,4,3,4,3
RAW_GCBS5,5,4,4,5,4
RAW_GCBS6,5,5,5,5,2
RAW_GCBS7,5,4,5,3,4
RAW_GCBS8,3,3,5,3,1
RAW_GCBS9,4,1,5,5,1
RAW_GCBS10,5,5,5,3,4


In [28]:
gcbs_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 45 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   RAW_GCBS1                 962 non-null    int64   
 1   RAW_GCBS2                 962 non-null    int64   
 2   RAW_GCBS3                 962 non-null    int64   
 3   RAW_GCBS4                 962 non-null    int64   
 4   RAW_GCBS5                 962 non-null    int64   
 5   RAW_GCBS6                 962 non-null    int64   
 6   RAW_GCBS7                 962 non-null    int64   
 7   RAW_GCBS8                 962 non-null    int64   
 8   RAW_GCBS9                 962 non-null    int64   
 9   RAW_GCBS10                962 non-null    int64   
 10  RAW_GCBS11                962 non-null    int64   
 11  RAW_GCBS12                962 non-null    int64   
 12  RAW_GCBS13                962 non-null    int64   
 13  RAW_GCBS14                962 non-null    int64   

In [29]:
gcbs_clean_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RAW_GCBS1,962.0,3.598753,1.366737,1.0,3.0,4.0,5.0,5.0
RAW_GCBS2,962.0,3.099792,1.444938,1.0,2.0,3.0,4.0,5.0
RAW_GCBS3,962.0,1.918919,1.281289,1.0,1.0,1.0,3.0,5.0
RAW_GCBS4,962.0,2.663202,1.414536,1.0,1.0,3.0,4.0,5.0
RAW_GCBS5,962.0,3.330561,1.424069,1.0,2.0,4.0,5.0,5.0
RAW_GCBS6,962.0,3.185031,1.440121,1.0,2.0,3.0,4.0,5.0
RAW_GCBS7,962.0,2.794179,1.501732,1.0,1.0,3.0,4.0,5.0
RAW_GCBS8,962.0,2.386694,1.517895,1.0,1.0,2.0,4.0,5.0
RAW_GCBS9,962.0,2.286902,1.406398,1.0,1.0,2.0,3.0,5.0
RAW_GCBS10,962.0,3.576923,1.316195,1.0,3.0,4.0,5.0,5.0


# Data quality


When it comes to psychological data, data quality is a serious business as the quality of the output strongly depends on the ability of the scales to actually measure the psychologial dimension which is intented to be measured. So I computed 
Cronbach’s alpha for both *GCBS* (Conspiracist Belief) and *TIPI* (Personality). Cronbach’s alpha is a measure used to assess the reliability, or internal consistency, of a set of scale or test items. 

Cronbach’s alpha is computed by correlating the score for each scale item with the total score for each observation (usually individual survey respondents or test takers), and then comparing that to the variance for all individual item scores:

${\alpha} =  (\frac{k}{k-1}) (1- \sum \limits _{i=1} ^{k} \frac{{\sigma}_{y_{i}}^2}{{\sigma}_{x}^2}) $

where:

* ${k}$ refers to the number of scale items
* ${\sigma}_{y_{i}}^2$ refers to the variance associated with item ${i}$
* ${\sigma}_{x}^2$ refers to the variance associated with the observed total scores

Note: reverse items must be recoded before computing ${\alpha}$

Internal consistency should be avaluated according to the following table:

| Cronbach’s alpha | Internal consistency |
| ----------- | ----------- |
| ${\alpha}>= 0.9$      | Excellent       |
| $0.9>{\alpha}>=0.8$      | Good       |
| $0.8>{\alpha}>=0.7$      | Acceptable       |
| $0.7>{\alpha}>=0.6$      | Questionable       |
| $0.6>{\alpha}>=0.5$      | Poor       |
| $0.5>{\alpha}$      | Unacceptable       |


As pointed out by Eisinga and colleagues (2013), when ${items}=2$ coefficient alpha almost always underestimates true reliability.

Eisinga, Rob; Grotenhuis, Manfred te; Pelzer, Ben (2013). *The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown?. International Journal of Public Health, 58(4), 637–642.* [doi:10.1007/s00038-012-0416-3](https://link.springer.com/article/10.1007%2Fs00038-012-0416-3)

In [31]:
import pingouin as pg

In [32]:
def getCronbachDict(scale_name,df, ci=.95):
    cron = pg.cronbach_alpha(data=df)
    
    if cron[0] >= 0.9:
        internal_consistency = 'Excellent'
    elif cron[0] >= 0.8:
        internal_consistency = 'Good'
    elif cron[0] >= 0.7:
        internal_consistency = 'Acceptable'
    elif cron[0] >= 0.6:
        internal_consistency = 'Questionable'
    elif cron[0] >= 0.5:
        internal_consistency = 'Poor'
    else:
        internal_consistency = 'Unacceptable'

    if len(df.columns) == 2:
        note = 'When items = 2, coefficient alpha almost always underestimates true reliability'
    else:
        note = ''
    
    return {
        'scale':scale_name,
        'scale_items':len(df.columns),
        'sample_size':len(df),
        'cronbach_alpha':round(cron[0],3),
        'ci_lower':cron[1][0],
        'ci_upper':cron[1][1],
        'ci':ci,
        'internal_consistency':internal_consistency,
        'note':note
  }

In [33]:
cronbach_dict_list = []

In [34]:
cronbach_dict_list.append(getCronbachDict('GCBS_Overall',gcbs_clean_df[['RAW_GCBS1','RAW_GCBS2','RAW_GCBS3','RAW_GCBS4','RAW_GCBS5','RAW_GCBS6','RAW_GCBS7','RAW_GCBS8','RAW_GCBS9','RAW_GCBS10','RAW_GCBS11','RAW_GCBS12','RAW_GCBS13','RAW_GCBS14','RAW_GCBS15']]))

In [35]:
# spiegare perchè rev
for column in ['RAW_TIPI2','RAW_TIPI4','RAW_TIPI6','RAW_TIPI8','RAW_TIPI10']:
    gcbs_clean_df['{}_REV'.format(column)] = reverse(gcbs_clean_df[column])

cronbach_dict_list.append(getCronbachDict('TIPI_Overall',gcbs_clean_df[['RAW_TIPI1','RAW_TIPI2_REV','RAW_TIPI3','RAW_TIPI4_REV','RAW_TIPI5','RAW_TIPI6_REV','RAW_TIPI7','RAW_TIPI8_REV','RAW_TIPI9','RAW_TIPI10_REV']]))
cronbach_dict_list.append(getCronbachDict('TIPI_Extraversion',gcbs_clean_df[['RAW_TIPI1','RAW_TIPI6_REV']]))
cronbach_dict_list.append(getCronbachDict('TIPI_Agreeableness',gcbs_clean_df[['RAW_TIPI2_REV','RAW_TIPI7']]))
cronbach_dict_list.append(getCronbachDict('TIPI_Conscientiousness',gcbs_clean_df[['RAW_TIPI3','RAW_TIPI8_REV']]))
cronbach_dict_list.append(getCronbachDict('TIPI_Emotional_Stability',gcbs_clean_df[['RAW_TIPI4_REV','RAW_TIPI9']]))
cronbach_dict_list.append(getCronbachDict('TIPI_Openness',gcbs_clean_df[['RAW_TIPI5','RAW_TIPI10_REV']]))

gcbs_clean_df = gcbs_clean_df.drop(list(gcbs_clean_df.filter(like='_REV',axis=1).columns), axis=1)

In [36]:
cronbach_df  = pd.DataFrame(cronbach_dict_list)
cronbach_df

Unnamed: 0,scale,scale_items,sample_size,cronbach_alpha,ci_lower,ci_upper,ci,internal_consistency,note
0,GCBS_Overall,15,962,0.924,0.916,0.931,0.95,Excellent,
1,TIPI_Overall,10,962,0.636,0.6,0.669,0.95,Questionable,
2,TIPI_Extraversion,2,962,0.752,0.718,0.781,0.95,Acceptable,"When items = 2, coefficient alpha almost always underestimates true reliability"
3,TIPI_Agreeableness,2,962,0.374,0.29,0.449,0.95,Unacceptable,"When items = 2, coefficient alpha almost always underestimates true reliability"
4,TIPI_Conscientiousness,2,962,0.623,0.572,0.668,0.95,Questionable,"When items = 2, coefficient alpha almost always underestimates true reliability"
5,TIPI_Emotional_Stability,2,962,0.748,0.714,0.778,0.95,Acceptable,"When items = 2, coefficient alpha almost always underestimates true reliability"
6,TIPI_Openness,2,962,0.367,0.282,0.442,0.95,Unacceptable,"When items = 2, coefficient alpha almost always underestimates true reliability"


Cronbach’s alphas show excellent internal consistency for the *Generic Conspiracist Beliefs Scale (GCBS)*, while results from the personality assessment seems questionable. With just 10 items *Personality Scale (TIPI)* struggle to capture internal consistency for subscales. However composite reliability ${\alpha}>=0.6$ is considered satisfactory for exploratory research (Nunally & Bernstein, 1994 ).

Nunnally, J.C. and Bernstein, I.H. (1994) *The Assessment of Reliability.* Psychometric Theory, 3, 248-292.

# Data Exploration

In [None]:
import matplotlib.pyplot as plt

#importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

### Sample description

In [None]:
total_responses = sum(gcbs_clean_df['DEMO_age'].value_counts())
gcbs_clean_df['DEMO_age'].iplot(kind='hist', opacity=0.75, color='rgb(12, 128, 128)', title='Age Distribtution (N={})'.format(total_responses), yTitle='Count', xTitle='Age', bargap = 0, theme='white')

In [None]:
total_responses = sum(gcbs_clean_df['DEMO_education'].value_counts())
gcbs_clean_df['DEMO_education'].value_counts().sort_index(ascending=True).iplot(kind='bar', title='Education (N={})'.format(total_responses), theme='white')

In [None]:
family_distribution = gcbs_clean_df['DEMO_familysize'].value_counts().sort_index(ascending=True)
family_distribution.index = [int(i) for i in list(family_distribution.index)]
total_responses = sum(family_distribution)
family_distribution.iplot(kind='bar', title='Family size (N={})'.format(total_responses), theme='white')

In [None]:
def pieChart(df, col, title='', subtitle=''):
    total_responses = sum(df[col].value_counts()[df[col].value_counts()> 0])
    count_df = df[col].value_counts()[df[col].value_counts()> 0].rename_axis(col).reset_index(name='counts')
    fig = count_df.iplot(kind='pie', labels=col, values='counts', hoverinfo="label+percent+name",hole=0.3, theme='white', asFigure=True)
    fig.update_traces(texttemplate='%{percent:.2%}')
    fig.update_layout(title_text='{} (N={})'.format(title,total_responses), title_x=0.1, legend=dict(orientation="h", xanchor = "center",  x = 0.5))
    return fig

In [None]:
pieChart(gcbs_clean_df, col='DEMO_urban', title='Area')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_gender', title='Gender')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_engnat', title='Language')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_hand', title='Hand preference')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_religion', title='Religion')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_orientation', title='Sexual orientation')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_race', title='Race')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_voted', title='Voted')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_married', title='Married')

In [None]:
pieChart(gcbs_clean_df, col='DEMO_major_cluster', title='Major')

### Scales

In [None]:
scales_list = ['GCBS_Overall','TIPI_Extraversion','TIPI_Agreeableness','TIPI_Conscientiousness','TIPI_Emotional_Stability','TIPI_Openness']

In [None]:
round(gcbs_clean_df[scales_list].describe(),2)

In [None]:
gcbs_clean_df['GCBS_Overall'].iplot(kind='hist', opacity=0.75, color='rgb(12, 128, 128)', title='Generic Conspiracist Beliefs Scale Distribution', yTitle='Count', xTitle='GCBS score (overall)', bargap = 0, theme='white')

In [None]:
gcbs_clean_df['TIPI_Extraversion'].iplot(kind='hist', opacity=0.75, color='rgb(93, 173, 226)', title='Personality - Extraversion Distribution', yTitle='Count', xTitle='TIPI score (extraversion)', bargap = 0, theme='white')

In [None]:
gcbs_clean_df['TIPI_Agreeableness'].iplot(kind='hist', opacity=0.75, color='rgb(72, 201, 176)', title='Personality - Agreeableness Distribution', yTitle='Count', xTitle='TIPI score (agreeableness)', bargap = 0, theme='white')

In [None]:
gcbs_clean_df['TIPI_Conscientiousness'].iplot(kind='hist', opacity=0.75, color='rgb(175, 122, 197)', title='Personality - Conscientiousness Distribution', yTitle='Count', xTitle='TIPI score (conscientiousness)', bargap = 0, theme='white')

In [None]:
gcbs_clean_df['TIPI_Emotional_Stability'].iplot(kind='hist', opacity=0.75, color='rgb(247, 220, 111)', title='Personality - Emotional Stability Distribution', yTitle='Count', xTitle='TIPI score (emotional stability)', bargap = 0, theme='white')

In [None]:
gcbs_clean_df['TIPI_Openness'].iplot(kind='hist', opacity=0.75, color='rgb(236, 112, 99)', title='Personality - Openness Distribution', yTitle='Count', xTitle='TIPI score (openness)', bargap = 0, theme='white')

In [None]:
# Verifica con test statistico di normalità
# Jarque-Bera ## H0: Data is normally distributed.
# Shapiro-Wilks ## H0: Data is normally distributed.
# D’Agostino Pearson ## H0: Data is normally distributed.
# + Skewness + Kurtosis

In [None]:
from scipy.stats import shapiro, jarque_bera, normaltest, skew, kurtosis

In [None]:
def testNormality(x):
    #Shapiro-Wilk
    w, p_w = shapiro(x)
    #Jarque-Bera
    jb, p_jb = jarque_bera(x)
    #D’Agostino-Pearson
    k2, p_k2 = normaltest(x)
    
    #Skewness
    s = skew(x)
    #Kurtosis
    k = kurtosis(x)
    
    return {
        'Shapiro (w)':w,
        'p-value (Shapiro)':p_w,
        'Jarque-Bera (jb)':jb,
        'p-value (Jarque-Bera)':p_jb,
        'D’Agostino-Pearson (k2)':k2,
        'p-value (D’Agostino-Pearson)':p_k2,
        'skewness':s,
        'kurtosis':k
    }

In [None]:
normality_check_list = []
for column in scales_list:
    distribution_info = testNormality(gcbs_clean_df[column])
    normality_check_list.append(distribution_info)
normality_tests_df = pd.DataFrame(normality_check_list)
normality_tests_df.index = scales_list
normality_tests_df

Not normally distributed, non parametric tests will be needed!

# Research questions

* *L'appartenenza a determinati gruppi socio-demografici influenza il livello di complottismo?*
* *C'è una relazione tra le 5 dimensioni della personalità e il livello di complottismo?*
* *Ci sono configurazioni di personalità che nell’insieme influenzano il livello di complottismo?*

### L'appartenenza a determinati gruppi socio-demografici influenza il livello di complottismo?

Unequal variances between samples, which affects the assumption of equal variances in tests like ANOVA. Having both unequal sample sizes and variances dramatically affects statistical power and Type I error rates (Rusticus & Lovato, 2014).

Rusticus, S. & Lovato, C. (2014). Impact of Sample Size and variability on the Power and Type I Error Rates of Equivalence Tests: A Simulation Study. Practical Assessment, Research & Evaluation. Vol. 19, No. 11. August.
Shingala, C. et. al. / International Journal of New Technologies in Science and Engineering
Vol. 2, Issue 5,Nov 2015, ISSN 2349-0780

The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. It is a non-parametric version of ANOVA. The test works on 2 or more independent samples, which may have different sizes.

Due to the assumption that H has a chi square distribution, the number of samples in each group must not be too small. A typical rule is that each sample must have at least 5 measurements.

It is important to realize that the Kruskal-Wallis H test is an omnibus test statistic and cannot tell you which specific groups of your independent variable are statistically significantly different from each other; it only tells you that at least two groups were different. You can perform pairwise comparisons using a post hoc test.

In [None]:
import pingouin as pg

In [None]:
kruskallwallis_demo_gcbs_list = []
for demo in ['DEMO_education','DEMO_urban','DEMO_gender','DEMO_engnat','DEMO_hand','DEMO_religion','DEMO_orientation','DEMO_race','DEMO_voted','DEMO_married']:
    
    result = pg.kruskal(data=gcbs_clean_df, dv='GCBS_Overall', between=demo, detailed=False)
    
    if result.loc['Kruskal']['p-unc']<.05:
        sign = 'Statistically significant result'
    else:
        sign = ''
        
    kruskallwallis_demo_gcbs_list.append((result.loc['Kruskal']['Source'],'GCBS_Overall',result.loc['Kruskal']['H'],result.loc['Kruskal']['p-unc'],result.loc['Kruskal']['ddof1'],sign))

In [None]:
pd.DataFrame(kruskallwallis_demo_gcbs_list, columns = ['Factor', 'Scale', 'H', 'p-value', 'DoF','Note'])

**Domanda:** *devo fare posthoc o mostro le differenze tra i gruppi con boxplot?*

**Domanda:** *inserire grafico età e family size (ha senso fare scatterplot o meglio ricodificare valori in classi ovviamente seguendo una logica)?*

### C'è una relazione tra le 5 dimensioni della personalità e il livello di complottismo?

Correlazione lineare tra TIPI e GCBS (non va bene perchè non sono soddisfatte le assumptions sulle distribuzioni)
 is non-parametric, since it does not rely on any assumptions on the distribution of X or Y or (X, Y). 
 
The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship.

In [None]:
from scipy.stats import pearsonr, spearmanr

In [None]:
correlation_list = []
for personality_dimension in list(gcbs_clean_df.filter(like='TIPI_',axis=1).columns):
        r_spear, p_spear = spearmanr(gcbs_clean_df[personality_dimension],gcbs_clean_df['GCBS_Overall'])
        correlation_list.append(('{} - GCBS_Overall'.format(personality_dimension), r_spear, p_spear))

In [None]:
personality_conspiracy_corr_df = pd.DataFrame(correlation_list, columns=['Variables', 'r_spearman', 'p_value'])
personality_conspiracy_corr_df

**Domanda: ha senso anche inserire la prova grafica (scatterplot) dell'assenza di relazione?**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set_theme()
sns.scatterplot(data=gcbs_clean_df, x="GCBS_Overall", y="TIPI_Extraversion")
plt.show()

### Ci sono configurazioni di personalità che nell’insieme influenzano il livello di complottismo?

#### Finding Personality types with unsupervised learning

**Domanda:** *per creare il modello di clustering posso utilizzare anche dati proveniente da altre ricerche (che usano la stessa scala) disponibili su openpsychometrics?*

**Domanda:** *Va bene k-means? La letteratura suggerisce quello come algoritmo da utilizzare ma ci sono pochi studi e recenti*

**Domanda:** *Va bene creare il modello e applicare le etichette sui dati usati per il training*

In [None]:
personality_df = gcbs_clean_df[gcbs_clean_df.filter(like='TIPI_',axis=1).columns]
personality_df

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:
scaler = StandardScaler()

In [None]:
scaled_array = scaler.fit_transform(personality_df)

In [None]:
scaled_dataframe = pd.DataFrame( scaled_array, columns = personality_df.columns )
scaled_dataframe.head(5)

In [None]:
# finding the optimal number of clusters (k) with elbow and silhouette

In [None]:
K = range(2,26)
sum_of_squared_distances = []
for num_clusters in K :
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(scaled_dataframe)
    sum_of_squared_distances.append(kmeans.inertia_)

sns.set_theme()
plt.figure(figsize=(16, 6))
plt.plot(K,sum_of_squared_distances,'bx-')
plt.xticks(K)
plt.xlabel('Values of K') 
plt.ylabel('Sum of squared distances/Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
K = range(2,26)
silhouette_avg = []
for num_clusters in K:
    # initialise kmeans
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(scaled_dataframe)
    cluster_labels = kmeans.labels_
    # silhouette score
    silhouette_avg.append(silhouette_score(scaled_dataframe, cluster_labels))
sns.set_theme()
plt.figure(figsize=(16, 6))
plt.plot(K,silhouette_avg,'bx-')
plt.xticks(K)
plt.xlabel('Values of K') 
plt.ylabel('Silhouette score') 
plt.title('Silhouette analysis For Optimal k')
plt.show()

In [None]:
kmeans_model = KMeans(n_clusters = 5)
kmeans_model.fit(scaled_dataframe)

In [None]:
centroids = kmeans_model.cluster_centers_
labels = kmeans_model.labels_

In [None]:
scaled_dataframe["personality_cluster"] = labels
scaled_dataframe.head(5)

In [None]:
scaled_dataframe.groupby(['personality_cluster']).mean()

**Domanda:** *Faccio un grafico per confrontare le caratteristiche dei cluster?*

#### Do personality clusters influence beliefs in conspiracy theories?

In [None]:
gcbs_clean_df['TIPI_Personality_Cluster'] = labels

In [None]:
gcbs_clean_df.head(5)

In [None]:
pg.kruskal(data=gcbs_clean_df, dv='GCBS_Overall', between='TIPI_Personality_Cluster', detailed=False)