## Capstone Project: Initial EDA

**Author: Elliot Carter**

**Date Created: July 21, 2023**

**Contact: elliot.carter@gmail.com**

---

### Table of Contents

[Introduction](#introduction)

[Part 1 - Basic Information About the Data](#part1)

[Part 2 - Exploring the Missing Data](#part2)

[Part 3 - Basic Analysis](#part3)

[Conclusion and Next Steps](#conclusion)

<a id = 'introduction'></a>
### Introduction

This notebook contains my initial Exploratory Data Analysis (EDA) for my capstone project at the BrainStation Data Science bootcamp.

My project is aimed at using the tools of data science to better understand the phenomenon of conspiracy theory belief. The dataset I am working with comes from a 2022 paper by Roland Imhoff and 39 coauthors at various European universities (Imhoff et al 2022). The paper reports the results of two surveys studying the relationship between political orientation and susceptibility to conspiracy belief, with respondants from 26 European countries. The datasets include two measures of  political orientation: self-reported location on a left-right political spectrum (in both surveys) and reported voting behaviour in the previous election (in the second survey only). In both surveys, susceptibility to conspiracy belief is measured via responses to a standard questionnaire (the 'Conspiracy Mentality Questionnaire,' or 'CMQ') (originating in Bruder et al 2013). The datasets also include personal-level demographic information (age, sex, country, etc.) as well as some country-level information about political and economic climate.

The ultimate aim is to use machine learning to model the data and predict scores on the CMQ (serving as a measure of conspiracy-susceptibility, or 'conspiracy mindset') based on other variables. In this notebook, I will begin the project by exploring the data and beginning to assess what kinds of cleaning and preprocessing will be required. I will also see if there are any initial insights that we can draw from the data.

#### The Conspiracy Mentality Questionnaire (CMQ)

I will construct a full data dictionary after doing some exploratory analysis. But it will be helpful to begin with some description of the CMQ, which will be important for understanding the target variables we will be interested in. The CMQ consists of five statements for which respondants are asked to rate their level of confidence (from 0% for 'certainly not' to 100% for 'certain', in increments of 10%). The statements are as follows:

1. *I think that many very important things happen in the world, which the public is never informed about.*
2. *I think that politicians usually do not tell us the true motives for their decisions.*
3. *I think that government agencies closely monitor all citizens.*
4. *I think that events which superficially seem to lack a connection are often the result of secret activities.*
5. *I think that there are secret organizations that greatly influence political decisions.*

#### References

- Imhoff, R., Zimmer, F., Klein, O. et al. (2022) Conspiracy mentality and political orientation across 26 countries. Nat Hum Behav 6, 392–403. https://doi.org/10.1038/s41562-021-01258-7
- Bruder, M., Haffke, P., Neave N., Nouripanah, N. and Imhoff, R. (2013) Measuring individual differences in generic beliefs in conspiracy theories across cultures: Conspiracy Mentality Questionnaire. Front. Psychol. 4: 225. https://doi.org/10.3389/fpsyg.2013.00225

<a id = 'part1'></a>
### Part 1: Basic Information About the Data

We will start by loading in the Python libraries we'll be using in this notebook.

In [80]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import plotly.express as px
from scipy import stats

Next, we'll load in the data and look at the first few rows as well as the shape (i.e., the number of rows and columns).

In [81]:
# Loading the data for study 1 and looking at head
df1 = pd.read_csv('data/data_study1.csv')

# Setting option to display unlimited number of columns
pd.set_option('display.max_columns', None)

df1.head()

Unnamed: 0.1,Unnamed: 0,UID,Country,Sex,Age,Edu_high,Edu_low,Pol_Ori,ZmeanPO,ZPO,CMQ_1,CMQ_2,CMQ_3,CMQ_4,CMQ_5,CM4x,CM5x,CT_left,CT_right,CT_neutral,Winner_state,CHES_version,lrgen,lrecon,galtan,zlrgen,zlrecon,zgaltan,mean_lrgen,mean_lrecon,mean_galtan,zmean_lrgen,zmean_lrecon,zmean_galtan,lrgengov,lrecongov,galtangov,CPO
0,iceland.1,Iceland::1,Iceland,male,33.0,1.0,0.0,4.2,1.064809,-0.453397,11.0,10.0,1.0,7.0,1.0,5.0,6.0,7.0,4.0,1.0,,,,,,,,,,,,,,,,,,-0.827639
1,iceland.2,Iceland::2,Iceland,male,28.0,1.0,0.0,4.2,1.064809,-0.453397,7.0,7.0,4.0,3.0,5.0,4.75,5.2,1.0,1.0,2.5,0.0,,,,,,,,,,,,,,,,,-0.827639
2,iceland.3,Iceland::3,Iceland,female,35.0,0.0,1.0,,1.064809,,8.0,7.0,4.0,9.0,3.0,6.0,6.2,5.5,5.5,4.0,,,,,,,,,,,,,,,,,,
3,iceland.4,Iceland::4,Iceland,male,55.0,1.0,0.0,5.0,1.064809,-0.015141,5.0,8.0,6.0,4.0,3.0,4.5,5.2,4.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,-0.027639
4,iceland.5,Iceland::5,Iceland,male,52.0,0.0,1.0,5.8,1.064809,0.423114,8.0,6.0,5.0,7.0,6.0,6.5,6.4,5.5,4.0,4.0,0.0,,,,,,,,,,,,,,,,,0.772361


In [82]:
df1.shape

(40954, 38)

We see that the dataset from the first study contains ~41,000 rows across 38 columns. The first column is an unnamed column, which appears to be an old index. We'll remove this shortly. Let's also look at the information on the columns, including the data types.

In [83]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40954 entries, 0 to 40953
Data columns (total 38 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    40954 non-null  object 
 1   UID           40954 non-null  object 
 2   Country       40954 non-null  object 
 3   Sex           37629 non-null  object 
 4   Age           37555 non-null  float64
 5   Edu_high      31414 non-null  float64
 6   Edu_low       31414 non-null  float64
 7   Pol_Ori       38145 non-null  float64
 8   ZmeanPO       40954 non-null  float64
 9   ZPO           38145 non-null  float64
 10  CMQ_1         33670 non-null  float64
 11  CMQ_2         33641 non-null  float64
 12  CMQ_3         33592 non-null  float64
 13  CMQ_4         33447 non-null  float64
 14  CMQ_5         33505 non-null  float64
 15  CM4x          33195 non-null  float64
 16  CM5x          33152 non-null  float64
 17  CT_left       29555 non-null  float64
 18  CT_right      29639 non-nu

We see a mix of object data type and numeric (float64) columns. Some of these numeric columns are storing categorical information (e.g., `Edu_high` and `Edu_low` store binary information about education status). We can also see that there is a considerable amount of missing data, an issue we'll return to in the next section.

Because we are looking at data from two studies, we now need to repeat these steps for the dataset from the second study.

In [84]:
# Loading the data from the second study

df2 = pd.read_csv('data/data_study2.csv')

df2.head()

Unnamed: 0.1,Unnamed: 0,Weights,Sex,Age,Country,Pol_Ori,CMQ_1,CMQ_2,CMQ_3,CMQ_4,CMQ_5,Edu_low,Edu_high,Winner_state,CHES_version,lrgen,lrecon,galtan,zlrgen,zlrecon,zgaltan,ZPO,ZmeanPO,CM4x,CM5x,CPO
0,1,1.532339,Male,18.916667,Sweden,5.0,8.0,8.0,4.0,6.0,7.0,0.0,0.0,1.0,CHES17,3.888889,3.470588,4.411765,-0.79985,-0.780204,-0.336897,-0.449278,1.062604,6.25,6.6,-1.304476
1,2,0.997387,Male,25.75,Sweden,11.0,,,,,,0.0,1.0,1.0,CHES17,7.944445,8.411765,5.888889,0.837842,1.306571,0.191618,1.617198,1.062604,,,4.695524
2,3,6.050884,Female,19.916667,Sweden,11.0,11.0,11.0,11.0,9.0,11.0,0.0,0.0,0.0,CHES17,8.0,5.941176,8.944445,0.860276,0.263184,1.284895,1.617198,1.062604,10.5,10.6,4.695524
3,4,1.838405,Male,62.75,Sweden,8.0,,,,,,0.0,0.0,1.0,CHES17,7.944445,8.411765,5.888889,0.837842,1.306571,0.191618,0.58396,1.062604,,,1.695524
4,5,0.997387,Male,30.916667,Sweden,5.0,11.0,7.0,9.0,6.0,11.0,0.0,1.0,1.0,CHES17,3.888889,3.470588,4.411765,-0.79985,-0.780204,-0.336897,-0.449278,1.062604,8.25,8.8,-1.304476


In [85]:
df2.shape

(89576, 26)

In [86]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89576 entries, 0 to 89575
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    89576 non-null  int64  
 1   Weights       66596 non-null  float64
 2   Sex           71913 non-null  object 
 3   Age           72988 non-null  float64
 4   Country       89576 non-null  object 
 5   Pol_Ori       73636 non-null  float64
 6   CMQ_1         53437 non-null  float64
 7   CMQ_2         52895 non-null  float64
 8   CMQ_3         52890 non-null  float64
 9   CMQ_4         52718 non-null  float64
 10  CMQ_5         52716 non-null  float64
 11  Edu_low       89507 non-null  float64
 12  Edu_high      89507 non-null  float64
 13  Winner_state  65087 non-null  float64
 14  CHES_version  89576 non-null  object 
 15  lrgen         55414 non-null  float64
 16  lrecon        55414 non-null  float64
 17  galtan        55414 non-null  float64
 18  zlrgen        55414 non-nu

The dataset from the second study contains ~90,000 rows and 26 columns. Again, we see that the first column appears to be an old index, which we can remove. We still see a mix of object and numeric (float64) datatypes, and a mix of numeric and categorical information, and again we see that there is a considerable amount of missing data, which we'll examine shortly.

Let's now remove the old indexes from both dataframes.

In [87]:
df1.drop(columns='Unnamed: 0', axis=1, inplace=True)

# Sanity check
df1.head()

Unnamed: 0,UID,Country,Sex,Age,Edu_high,Edu_low,Pol_Ori,ZmeanPO,ZPO,CMQ_1,CMQ_2,CMQ_3,CMQ_4,CMQ_5,CM4x,CM5x,CT_left,CT_right,CT_neutral,Winner_state,CHES_version,lrgen,lrecon,galtan,zlrgen,zlrecon,zgaltan,mean_lrgen,mean_lrecon,mean_galtan,zmean_lrgen,zmean_lrecon,zmean_galtan,lrgengov,lrecongov,galtangov,CPO
0,Iceland::1,Iceland,male,33.0,1.0,0.0,4.2,1.064809,-0.453397,11.0,10.0,1.0,7.0,1.0,5.0,6.0,7.0,4.0,1.0,,,,,,,,,,,,,,,,,,-0.827639
1,Iceland::2,Iceland,male,28.0,1.0,0.0,4.2,1.064809,-0.453397,7.0,7.0,4.0,3.0,5.0,4.75,5.2,1.0,1.0,2.5,0.0,,,,,,,,,,,,,,,,,-0.827639
2,Iceland::3,Iceland,female,35.0,0.0,1.0,,1.064809,,8.0,7.0,4.0,9.0,3.0,6.0,6.2,5.5,5.5,4.0,,,,,,,,,,,,,,,,,,
3,Iceland::4,Iceland,male,55.0,1.0,0.0,5.0,1.064809,-0.015141,5.0,8.0,6.0,4.0,3.0,4.5,5.2,4.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,-0.027639
4,Iceland::5,Iceland,male,52.0,0.0,1.0,5.8,1.064809,0.423114,8.0,6.0,5.0,7.0,6.0,6.5,6.4,5.5,4.0,4.0,0.0,,,,,,,,,,,,,,,,,0.772361


In [88]:
df2.drop(columns='Unnamed: 0', axis=1, inplace=True)

# Sanity check
df2.head()

Unnamed: 0,Weights,Sex,Age,Country,Pol_Ori,CMQ_1,CMQ_2,CMQ_3,CMQ_4,CMQ_5,Edu_low,Edu_high,Winner_state,CHES_version,lrgen,lrecon,galtan,zlrgen,zlrecon,zgaltan,ZPO,ZmeanPO,CM4x,CM5x,CPO
0,1.532339,Male,18.916667,Sweden,5.0,8.0,8.0,4.0,6.0,7.0,0.0,0.0,1.0,CHES17,3.888889,3.470588,4.411765,-0.79985,-0.780204,-0.336897,-0.449278,1.062604,6.25,6.6,-1.304476
1,0.997387,Male,25.75,Sweden,11.0,,,,,,0.0,1.0,1.0,CHES17,7.944445,8.411765,5.888889,0.837842,1.306571,0.191618,1.617198,1.062604,,,4.695524
2,6.050884,Female,19.916667,Sweden,11.0,11.0,11.0,11.0,9.0,11.0,0.0,0.0,0.0,CHES17,8.0,5.941176,8.944445,0.860276,0.263184,1.284895,1.617198,1.062604,10.5,10.6,4.695524
3,1.838405,Male,62.75,Sweden,8.0,,,,,,0.0,0.0,1.0,CHES17,7.944445,8.411765,5.888889,0.837842,1.306571,0.191618,0.58396,1.062604,,,1.695524
4,0.997387,Male,30.916667,Sweden,5.0,11.0,7.0,9.0,6.0,11.0,0.0,1.0,1.0,CHES17,3.888889,3.470588,4.411765,-0.79985,-0.780204,-0.336897,-0.449278,1.062604,8.25,8.8,-1.304476


We see that the old indexes have been removed. We'll now move on to looking at the null and missing values in the data.

<a id = 'part2'></a>
### Part 2. Exploring the Missing Data

Let's begin exploring the missing data by looking at the percentage of data missing for each column in each of the datasets.

#### 2.1 - Missing Values in the First Dataset

In [89]:
# Creating dataframe to show missing values in df1
missing_values_df1 = pd.DataFrame({'percent_missing': round(df1.isnull().mean(),4)*100})


# Displaying missing_values_df1 sorted by highest-to-lowest % missing
missing_values_df1.sort_values(by='percent_missing', ascending=False).T

Unnamed: 0,galtan,zlrgen,lrecon,lrgen,zlrecon,zgaltan,CT_neutral,Winner_state,CT_left,CT_right,Edu_high,Edu_low,CM5x,CM4x,CMQ_4,CMQ_5,CMQ_3,CMQ_2,CMQ_1,Age,Sex,lrgengov,galtangov,lrecongov,CPO,ZPO,Pol_Ori,CHES_version,mean_lrgen,mean_lrecon,mean_galtan,zmean_lrgen,zmean_lrecon,zmean_galtan,Country,ZmeanPO,UID
percent_missing,33.17,33.17,33.17,33.17,33.17,33.17,28.84,28.74,27.83,27.63,23.29,23.29,19.05,18.95,18.33,18.19,17.98,17.86,17.79,8.3,8.12,7.14,7.14,7.14,6.86,6.86,6.86,6.27,6.27,6.27,6.27,6.27,6.27,6.27,0.0,0.0,0.0


Here are some initial observations about the missing data:
- We see that we are missing relatively high percentages of the data (between 33.2% and 27.5%) for the following columns:

    - `galtan` (rating of political party preference on a *social* left-right scale)
    - `zlrgen` (*centered* rating of political party preference on a *general* left-right scale)
    - `lrecon` (rating of political party preference on an *economic* left-right scale)
    - `lrgen` (rating of political party preference on a *general* left-right scale
    - `zlrecon` (*centered* rating of political party preference on an *economic* left-right scale)
    - `zgaltan` (*centered* rating of political party preference on a *social* left-right scale)
    - `CT_neutral` (endorsement of country-specific conspiracy theories categorized as politically 'neutral')
    - `Winner_state` (dummy variable representing whether respondant's preferred party was in power at time of survey)
    - `CT_left` (endorsement of country-specific conspiracy theories categorized as politically 'left')
    - `CT_right` (endorsement of country-specific conspiracy theories categorized as politically 'right')

    The general picture here is that for the first survey, we are missing a large portion of the data having to do with participants' political preferences and views.

- We are also missing about 23% of the data for `Edu_high` (a binary variable, where '1' represents that the respondant has a university degree) and `Edu_low` (a binary variable, where '1' represents that the respondant did not finish high school).

- Next, let's describe the missing values for the columns having to do with CMQ (remember: ultimately, our target variable will be one of, or a function of some of, these columns). We are missing about 19% of the data for `CM5x` (the CMQ score for all five items) as well as for `CM4x` (the CMQ score for four items, where the second question is excluded). Around 18% of the data is missing for responses to the individual CMQ questions.

- We're missing about 8% of the data for `Age` and `Sex`.

- We're missing about 7% of data for `lrgengov`, `galtangov`,	and `lrecongov` (country-level variables representing the ruling party's location on three different left-right spectra: general, social, and economic, respectively).

- We're also missing about 7% of data from `Pol_Ori`, `CPO`, and `ZPO` (columns representing respondants' self-ratings on a left-right spectrum of political orientation, including variations involving a centered scale and Z-scores).

- 6% of data are missing from the following columns having to do with mean scores for the left-right political spectrum questions (`1mean_lrgen`, `mean_lrecon`, `mean_galtan`, `zmean_lrgen`, `zmean_lrecon`, and `zmean_galtan`). 

- Finally, 6% of the data is missing from `CHES_version`. This column represents the version of the 'Chapel Hill Expert Survey' used, a survey of experts which produces a codebook for estimating the political ideology of political parties. The CHES codebook was used to translate participants' proclaimed voting intentions or party preferences into comparable numerical ratings.



Since the survey includes 24 countries, it could be helpful to see how much of the data and which parts of the data are missing for each one.

In [90]:
# Creating dataframe for percent missing per column per country
missing_dict = {}
for country in df1['Country'].unique():
    missing_dict[country] = df1[df1['Country'] == country].isna().mean()*100

missing_by_country = pd.DataFrame(missing_dict)

# Displaying the 'percent missing by country' dataframe
display(missing_by_country)

Unnamed: 0,Iceland,"Belgium, FR",Bosnia,Brazil,"Switzerland, FR","Switzerland, GE",Croatia,Czech Republic,France,Greece,Hungary,Italy,Macedonia,Netherlands,Poland,Serbia,Spain,Turkey,Norway,Germany,Israel,Portugal,Romania,UK
UID,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Country,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sex,0.0,0.0,0.0,0.0,0.0,0.0,1.123596,0.0,0.0,0.0,0.0,1.288245,0.0,15.618412,0.0,0.167504,0.0,39.674267,0.0,0.017042,6.590909,0.194932,0.0,0.066711
Age,0.0,0.0,0.0,0.0,0.0,0.0,1.605136,0.0,0.0,0.0,0.0,2.254428,0.0,15.735535,0.0,0.083752,1.178604,41.824104,0.0,0.017042,6.590909,0.389864,0.0,0.0
Edu_high,3.513909,0.0,0.0,0.0,3.4,0.0,1.605136,0.0,0.0,0.203252,0.0,1.288245,0.325203,15.794097,0.0,0.083752,0.0,39.543974,5.110173,100.0,8.409091,0.194932,1.408451,10.206805
Edu_low,3.513909,0.0,0.0,0.0,3.4,0.0,1.605136,0.0,0.0,0.203252,0.0,1.288245,0.325203,15.794097,0.0,0.083752,0.0,39.543974,5.110173,100.0,8.409091,0.194932,1.408451,10.206805
Pol_Ori,23.718887,0.0,0.0,0.0,0.2,0.0,0.321027,0.0,0.0,9.95935,0.0,5.152979,0.0,7.730148,23.748773,8.542714,3.898459,34.527687,1.687764,4.192229,6.590909,1.364522,2.253521,0.0
ZmeanPO,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZPO,23.718887,0.0,0.0,0.0,0.2,0.0,0.321027,0.0,0.0,9.95935,0.0,5.152979,0.0,7.730148,23.748773,8.542714,3.898459,34.527687,1.687764,4.192229,6.590909,1.364522,2.253521,0.0
CMQ_1,0.732064,0.0,0.0,0.0,0.0,0.0,2.889246,0.0,0.0,0.0,0.0,5.152979,0.0,12.344811,22.178606,4.355109,0.99728,33.811075,4.125645,71.608725,5.0,0.194932,0.0,0.0


Let's visualize this information using a heatmap to get a sense of how the missing values are distributed across the countries and the columns.

In [91]:
# Displaying with interactive heatmap
fig = px.imshow(missing_by_country.T,
                title='Percentage of Data Missing for Each Column by Country in Study 1',
                labels={'color':'% Missing'},
                aspect="auto")
fig.update_xaxes(title='Column')
fig.update_layout(width=1000, height=650)
fig.show()

![missing_data.png](attachment:missing_data.png)

We can see some patterns in the missing data broken down by column and country:

- The percentage of missing values in certain columns for a country seem to be correlated with the percentage missing in other columns. For example, each of the columns having to do with political party preferences on the left-right spectrum seem to have the same percentage of data missing for each country.
- Iceland, Brazil, and Israel are missing 100% of the data for the above-mentioned columns having to do with political party preferences. These countries, along with Romania, are also missing 100% of data for columns representing the ruling party's location on the left-right spectrum.
- Many other countries are missing relatively high proportions of the data for these political party preference columns (the highest being Croatia, which is missing ~85%).
- Some countries are missing a high percentage of the data for our target variables (having to do with the CMQ). Germany is missing ~72% of data for all CMQ-related columns, and Turkey is missing about ~34% for those columns as well. Poland is missing ~35% of the data in the columns for the CMQ summary scores, `CM5x` and `CM4x`.
- Germany is also missing 100% of the data on education status as well as all of the data from the 'CT' columns, representing the respondant's endorsement of country-specific conspiracy theories. Israel is missing just `CT_neutral`, the version of this variable for 'politcally neutral' conspiracy theories.
- Many countries are missing a high proportion of `Winner_state` (the highest being Croatia, with ~85% missing) which represents whether the respondant's preferred political party was currently in power.


Before proceeding with modelling the data, we will need to decide how to deal with the missing values. For certain modelling purposes, we might be able to exclude certain columns and/or rows; for others, we might need to impute the missing values.

#### 2.2 - Missing Values in the Second Dataset

Now we will repeat this process for the second study.

In [92]:
# Creating dataframe to show missing values in df2
missing_values_df2 = pd.DataFrame({'percent_missing': round(df2.isnull().mean(),4)*100})


# Displaying missing_values_df2 sorted by highest-to-lowest % missing
missing_values_df2.sort_values(by='percent_missing', ascending=False).T

Unnamed: 0,CM5x,CM4x,CMQ_4,CMQ_5,CMQ_3,CMQ_2,CMQ_1,lrgen,zgaltan,zlrecon,zlrgen,galtan,lrecon,Winner_state,Weights,Sex,Age,Pol_Ori,ZPO,CPO,Edu_high,Edu_low,CHES_version,ZmeanPO,Country
percent_missing,42.29,42.22,41.15,41.15,40.96,40.95,40.34,38.14,38.14,38.14,38.14,38.14,38.14,27.34,25.65,19.72,18.52,17.79,17.79,17.79,0.08,0.08,0.0,0.0,0.0


Again, we will take note of some of the patterns in the missing data:

- The second dataset is missing a higher proportion of the values for the columns related to the CMQ (between ~40-42%) compared to the first dataset.
- It is missing a relatively high proportion (~38%) of the values for `zgaltan`, `zlrecon`, `zlrgen`, `galtan`,  `lrecon`, and `lrgen`. These columns all encode information about respondants' political party preferences on various versions of a left-right spectrum (economic, general, and social) and the related Z-scores.
- `Winner_state` is missing ~27% of the data (which again represents whether ther respondants' preferred party was currently in power).
- Basic demographic information in `Age` and `Sex` is missing for more rows in dataset 2 (~18.5 to 20%) than dataset 1.
- There is a `Weights` column for dataset 2 which did not exist in dataset 1. It is missing data in about 26% of rows.
- A smaller amount of data is missing from `Edu_high` and `Edu_low` in this dataset (just under 1%).

Again, we would like to see how the missing data looks when broken down by country.

In [93]:
# Creating dataframe for percent missing per column per country
missing_dict2 = {}
for country in df2['Country'].unique():
    missing_dict2[country] = df2[df2['Country'] == country].isna().mean()*100

missing_by_country2 = pd.DataFrame(missing_dict2)

# Displaying the 'percent missing by country' dataframe
display(missing_by_country2)

Unnamed: 0,Sweden,Spain,Romania,Portugal,Poland,Italy,Hungary,Germany,France,Denmark,Belgium-Wallonia,Belgium-Flanders,Austria,Netherlands,UK
Weights,21.408046,11.85935,11.915297,37.976854,35.976419,47.531826,36.238045,23.032326,12.085952,48.071895,51.480836,40.459706,28.749271,4.30829,100.0
Sex,19.458128,9.770911,9.702908,34.869267,32.173913,17.79683,33.772582,19.624034,9.818244,44.248366,49.883856,39.241207,24.528302,2.957535,5.238971
Age,17.631363,7.863612,9.276233,34.483498,32.011791,14.679137,33.368757,17.550949,9.062341,43.954248,48.83856,38.631958,22.25248,1.977576,6.158088
Country,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Pol_Ori,16.543514,7.192328,8.40708,34.569224,31.215917,14.211484,33.241233,16.303584,8.603703,44.150327,46.806039,36.001108,20.404591,0.273682,23.345588
CMQ_1,38.526273,45.221097,33.86536,34.290613,45.23213,43.907508,33.113709,44.606465,42.585358,44.150327,46.893148,36.63805,54.055631,29.345811,29.6875
CMQ_2,38.875205,45.551412,35.603666,34.590656,46.779661,44.28423,35.090329,45.274069,43.027009,44.542484,46.864111,36.610357,54.716981,29.416439,29.779412
CMQ_3,38.834154,45.530101,35.176991,34.633519,47.25129,44.453105,34.941552,45.02811,42.899609,44.901961,47.0964,36.748823,54.464112,29.557694,29.595588
CMQ_4,39.018883,45.604688,35.824905,34.954994,47.605011,44.375162,35.557917,45.291637,43.035502,44.901961,47.0964,36.63805,54.83369,29.610665,29.6875
CMQ_5,39.08046,45.562067,35.935525,34.933562,47.840825,44.427124,35.621679,45.344343,43.010022,45.03268,46.922184,36.63805,54.755884,29.434096,29.6875


In [94]:
# Displaying missing data from dataset 2 with interactive heatmap
fig = px.imshow(missing_by_country2.T,
                title='Percentage of Data Missing for Each Column by Country in Study 2',
                labels={'color':'% Missing'},
                aspect="auto")
fig.update_xaxes(title='Column')
fig.update_layout(width=1000, height=650)
fig.show()

![missing_data2.png](attachment:missing_data2.png)

In our second dataset, we are seeing fewer country-specific patterns in the missing data. We still see correlations between proportions of data missing between certain columns per country. Some things to note:
- The UK is the only country with no value for `Weights`. It is also the only country with missing data in the `Edu_high` and `Edu_low` columns (about 6% missing).
- Romania is missing a high percentage (75.5%) of data for the party preference on the left-right spectrum columns (and their related Z-scores). Many countries are missing relatively high proportions of this data (~25-65%), with the Netherlands having the most complete data for this column (about 6% missing).
- Austria is missing the highest percentage of data related to our target (the `CM_` columns).

Let's just take note of how the volume of data would be affected if we were to simply drop the missing rows.

In [95]:
df1_reduction = round((df1.dropna().shape[0] / len(df1))*100, 2)
df2_reduction = round((df2.dropna().shape[0] / len(df2))*100, 2)

print('First dataset before dropping null values:', len(df1), 'rows \nAfter:', df1.dropna().shape[0], 'rows')
print(f'Reduction: {df1_reduction}%\n')

print('Second dataset before dropping null values:', len(df2), 'rows \nAfter:', df2.dropna().shape[0], 'rows')
print(f'Reduction: {df2_reduction}%')

First dataset before dropping null values: 40954 rows 
After: 20114 rows
Reduction: 49.11%

Second dataset before dropping null values: 89576 rows 
After: 35298 rows
Reduction: 39.41%


We see that simply dropping all rows with missing data will result in a substantial reduction in the sizes of our datasets (around 49% and 39%). In the future, we will investigate strategies for imputing the missing data where possible, or dropping columns with missing data for cases where we can ignore the information those columns stored.

Next, we will begin some basic analysis of the dataset.

<a id = 'part3'></a>
### Part 3. Basic Analysis

While our substantive data analysis will require cleaning the data, we can nonetheless get a sense of some general patterns here.

Let's begin by looking at the relationship between conspiracy mentality and country. We can look at the average `CM5x` score for each country, and then plot those averages on a map. We'll repeat the procss for each dataset.

In [96]:
# Group by country and average CM5x for df1

country_averages1 = pd.DataFrame({'Average CM Score Out of 10': df1.groupby('Country')['CM5x'].mean()-1})
country_averages1.sort_values(by='Average CM Score Out of 10', ascending=False)

Unnamed: 0_level_0,Average CM Score Out of 10
Country,Unnamed: 1_level_1
Turkey,7.243731
Portugal,7.243529
Hungary,7.018
Serbia,6.873796
"Belgium, FR",6.831086
Spain,6.802502
Israel,6.758852
Romania,6.756313
Bosnia,6.664106
Macedonia,6.466667


In [97]:
# Plotting average conspiracy score by country on map for df1

fig = px.choropleth(country_averages1,
                    locationmode='country names',
                    locations=country_averages1.index,
                    color='Average CM Score Out of 10',
                    color_continuous_scale='reds',
                    fitbounds="locations",
                    title='''

Average Conspiracy Mentality Score by Country <br>
<sup>This interactive map displays the average conspiracy mentality score for each country
in the first study from Imhoff et al. 2022.

                    '''
                    )
fig.update_traces(marker_line_width=.3)


fig.show()

![avg_CM_by_country_1-2.png](attachment:avg_CM_by_country_1-2.png)

The results here are interesting, although the patterns will not be clear until we investigate relationships with the other variables. We can note that Turkey is the country with the highest score, followed closely by Portugal (both around 7.2 out of 10). The country with the lowest score is the Netherlands (4.6). In fact, the Netherlands is the only country with an average score lower than 5, meaning that it is the only country where participants lean towards finding the statements in the CMQ to be unlikely to be true on average (a topic we'll return to shortly).

We'll now repeat the process for the second dataset.

In [98]:
# Group by country and average CM5x

country_averages2 = pd.DataFrame({'Average CM Score Out of 10': df2.groupby('Country')['CM5x'].mean()-1})
country_averages2.sort_values(by='Average CM Score Out of 10', ascending=False)

Unnamed: 0_level_0,Average CM Score Out of 10
Country,Unnamed: 1_level_1
Portugal,7.037971
Spain,6.961841
Romania,6.733859
Hungary,6.551568
Poland,6.409459
Belgium-Wallonia,6.385857
Belgium-Flanders,6.213375
France,5.931092
UK,5.852048
Denmark,5.768523


In [99]:
# Plotting average conspiracy score by country on map for df2

fig = px.choropleth(country_averages2,
                    locationmode='country names',
                    locations=country_averages.index,
                    color='Average CM Score Out of 10',
                    color_continuous_scale='reds',
                    fitbounds="locations",
                    scope='europe',
                    title='''

Average Conspiracy Mentality Score by Country <br>
<sup>This interactive map displays the average conspiracy mentality score for each country
in the second study from Imhoff et al. 2022.

                    '''
                    )
fig.update_traces(marker_line_width=.3)

fig.show()

![avg_CM_by_country2.png](attachment:avg_CM_by_country2.png)

The second survey involves fewer countries than the first, so we do not see results for initially high-scoring countries like Turkey and Hungary here. Portugal again rates very high on the CMQ, followed closely by Spain. The Netherlands again rates the lowest, and is again the only country with an average score below 5.

Let's see what the mean and median `CM5x` score is for both datasets.

In [100]:
# Average CM5x for each dataset (note: subtracting 1 to put on a 10 point scale)

print('Dataset 1 mean CMQ score:', df1['CM5x'].mean()-1)
print('Dataset 1 median CMQ score:', df1['CM5x'].median()-1,'\n')
print('Dataset 2 mean CMQ score:', df2['CM5x'].mean()-1)
print('Dataset 2 median CMQ score:', df2['CM5x'].median()-1,'\n')

Dataset 1 mean CMQ score: 5.5293038797726295
Dataset 1 median CMQ score: 5.6 

Dataset 2 mean CMQ score: 5.94661353040181
Dataset 2 median CMQ score: 6.0 



Notably, the mean and median values in CMQ score for both datasets is above 5, and close to 6 for the second dataset. This means that on average, respondants are more willing to say they find the statements somewhat likely to be true than that they are unlikely to be true.

Let's also see how these values break down by sex.

In [101]:
print('Mean scores by sex in df1')
print(df1.groupby('Sex')['CM5x'].mean())

print('\nMedian scores by sex in df1')
print(df1.groupby('Sex')['CM5x'].median())

print('\nMean scores by sex in df2')
print(df2.groupby('Sex')['CM5x'].mean())

print('\nMedian scores by sex in df2')
print(df2.groupby('Sex')['CM5x'].median())


Mean scores by sex in df1
Sex
female    6.422864
male      6.701631
other     6.104651
Name: CM5x, dtype: float64

Median scores by sex in df1
Sex
female    6.6
male      6.8
other     6.3
Name: CM5x, dtype: float64

Mean scores by sex in df2
Sex
Female    7.177107
Male      6.831201
Name: CM5x, dtype: float64

Median scores by sex in df2
Sex
Female    7.2
Male      6.8
Name: CM5x, dtype: float64


Interestingly, we see a pattern where in the first dataset, male respondants scored slightly higher than female respondants (who scored higher than respondants who selected 'other'). In the second survey, the pattern is reversed; female respondants scored higher (and 'other' was no longer an option).

Let's quickly see if these patterns are statistically significant.

$H_0$: the mean conspiracy mentalities among the different sex categories in the population are equal.

$H_1$: the mean conspiracy mentalities among the sex categories in the population are not equal.

In [102]:
# Preparing data for ANOVA test

anova_data = {}
sex_types = df1["Sex"].unique()

for sex in sex_types:
    anova_data[sex] = df1.loc[df1["Sex"] == sex, "CM5x"].dropna()

# Running the test

stats.f_oneway(anova_data["male"], 
               anova_data["female"], 
               anova_data["other"])

F_onewayResult(statistic=62.92405385698298, pvalue=5.311636281739257e-28)

We see that the result here is statistically significant at the .05 level (and at stricter significance levels as well). Our test suggests that we can reject the null hypothesis, that the means in the population between sexes are the same. We can do the same for the second datset.

In [103]:
# Preparing data for ANOVA test

anova_data2 = {}
sex_types = df2["Sex"].unique()

for sex in sex_types:
    anova_data2[sex] = df2.loc[df2["Sex"] == sex, "CM5x"].dropna()

# Running the test

stats.f_oneway(anova_data2["Male"], 
               anova_data2["Female"])

F_onewayResult(statistic=305.8878414921635, pvalue=2.7401679735184845e-68)

Again, our p-value is below the .05 threshold as well as far stricter thresholds. We can tentatively reject the null hypothesis that the means are the same in the population. This could reflect a change in male and female attitudes between the two surveys. But it could also reflect the inclusion of a third category ('other') in survey 1 but not survey 2, which is perhaps affecting the data.

Let's also look at the average and median score for each statement in the CMQ.

In [104]:
# Average CMQ scores by question

for i in range(1,6):
    print(f'Dataset 1 mean CMQ_{i} score:', df1['CMQ_'+str(i)].mean()-1)
    print(f'Dataset 1 median CMQ_{i} score:', df1['CMQ_'+str(i)].median()-1,'\n')
    print(f'Dataset 2 mean CMQ_{i} score:', df2['CMQ_'+str(i)].mean()-1)
    print(f'Dataset 2 median CMQ_{i} score:', df2['CMQ_'+str(i)].median()-1,'\n')

Dataset 1 mean CMQ_1 score: 6.831420321420321
Dataset 1 median CMQ_1 score: 7.0 

Dataset 2 mean CMQ_1 score: 6.952635814136273
Dataset 2 median CMQ_1 score: 7.0 

Dataset 1 mean CMQ_2 score: 6.812729506653588
Dataset 1 median CMQ_2 score: 7.0 

Dataset 2 mean CMQ_2 score: 7.164571320540693
Dataset 2 median CMQ_2 score: 7.0 

Dataset 1 mean CMQ_3 score: 4.674315313169801
Dataset 1 median CMQ_3 score: 5.0 

Dataset 2 mean CMQ_3 score: 5.307487237663074
Dataset 2 median CMQ_3 score: 6.0 

Dataset 1 mean CMQ_4 score: 4.287526202316766
Dataset 1 median CMQ_4 score: 4.0 

Dataset 2 mean CMQ_4 score: 4.891422284608672
Dataset 2 median CMQ_4 score: 5.0 

Dataset 1 mean CMQ_5 score: 5.068313850337429
Dataset 1 median CMQ_5 score: 5.0 

Dataset 2 mean CMQ_5 score: 5.476970938614462
Dataset 2 median CMQ_5 score: 6.0 



For reference, recall that the statments are as follows:
1. *I think that many very important things happen in the world, which the public is never informed about.*
2. *I think that politicians usually do not tell us the true motives for their decisions.*
3. *I think that government agencies closely monitor all citizens.*
4. *I think that events which superficially seem to lack a connection are often the result of secret activities.*
5. *I think that there are secret organizations that greatly influence political decisions.*

We see quite high levels of mean and median confidence in statements 1 and 2 (around 7). However, there is a case to be made that these statements have interpretations which we would not ordinarily understand as reflecting high conspiracy mentality. For example, someone who believes that politicians make the policy choices they do because they think it will help them to get elected or stay in power (rather than because they truly believe them to be optimal) might express high confidence in statement 2, but it seems odd to characterize such a response as high in conspiracy mentality.

In future data analysis, we should investigate further whether we can correlate these columns with the independent measures of conspiracy mentality (i.e., with the `CT_` columns, reflecting endorsement of country-specific conspiracy theories). Perhaps we can make a case that the best version of our target variable is not a straightforward average of the CMQ responses.

<a id = 'conclusion'></a>
### Conclusion and Next Steps

The main conclusions of our initial EDA are as follows:

- There is a large amount of missing data in various columns, and there are patterns in the missing data by country (especially in the first dataset).
- We can see some interesting differences in the average CMQ scores across countries and between sexes.
- The average scores for each CMQ item are different, and this perhaps reflects variability in the validity of each statement as a measure of conspiracy mentality.

The next steps we plan to take:

- Further investigate and decide how to handle missing data (what can be imputed and how, and what cannot).
- Determine initial modeling approaches
- Decide how to encode target variable (should it be a straight average, or an average of certain questions? Are there further patterns in the CMQ data which would support, e.g., turning it into a binary variable of 'low/high' score?)