<br><br><font color="gray">DOING COMPUTATIONAL SOCIAL SCIENCE<br>MODULE 5 <strong>PROBLEM SETS</strong></font>

# <font color="#49699E" size=40>MODULE 5 </font>


# What You Need to Know Before Getting Started

- **Every notebook assignment has an accompanying quiz**. Your work in each notebook assignment will serve as the basis for your quiz answers.
- **You can consult any resources you want when completing these exercises and problems**. Just as it is in the "real world:" if you can't figure out how to do something, look it up. My recommendation is that you check the relevant parts of the assigned reading or search for inspiration on [https://stackoverflow.com](https://stackoverflow.com).
- **Each problem is worth 1 point**. All problems are equally weighted.
- **The information you need for each problem set is provided in the blue and green cells.** General instructions / the problem set preamble are in the blue cells, and instructions for specific problems are in the green cells. **You have to execute all of the code in the problem set, but you are only responsible for entering code into the code cells that immediately follow a green cell**. You will also recognize those cells because they will be incomplete. You need to replace each blank `▰▰#▰▰` with the code that will make the cell execute properly (where # is a sequentially-increasing integer, one for each blank).
- Most modules will contain at least one question that requires you to load data from disk; **it is up to you to locate the data, place it in an appropriate directory on your local machine, and replace any instances of the `PATH_TO_DATA` variable with a path to the directory containing the relevant data**.
- **The comments in the problem cells contain clues indicating what the following line of code is supposed to do.** Use these comments as a guide when filling in the blanks. 
- **You can ask for help**. 

Finally, remember that you do not need to "master" this content before moving on to other course materials, as what is introduced here is reinforced throughout the rest of the course. You will have plenty of time to practice and cement your new knowledge and skills.
<div class='alert alert-block alert-danger'>As you complete this assignment, you may encounter variables that can be assigned a wide variety of different names. Rather than forcing you to employ a particular convention, we leave the naming of these variables up to you. During the quiz, submit an answer of 'USER_DEFINED' (without the quotation marks) to fill in any blank that you assigned an arbitrary name to. In most circumstances, this will occur due to the presence of a local iterator in a for-loop.</b></div>

## Package Imports

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples



import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%config Completer.use_jedi = False

## Defaults

In [2]:
seed = 7

## Problem 1:
<div class="alert alert-block alert-info">  
Let's dig into our subsetted data. Notice that many of the variables have a range (min to max) of 0 to 1. In the next exercises we'll be making some comparisons between variables, so let's simplify those comparisons by selecting only the meta-data columns and columns with a range of 0 to 1.
</div>

<div class="alert alert-block alert-success">
Create a new list to filter the dataframe columns, initializing it with the names of the three meta-data columns. Append that list with the names of columns from vd_index_vars, only if the data in those columns falls within the required range. Subset the dataframe using this updated list. Fill in the blanks to continue.
</div>

In [None]:
df = pd.read_csv(PATH_TO_DATA/'vdem_subset.csv', low_memory=False, index_col=0)

vd_meta_vars = ['country_name', 'year', 'e_regiongeo']
vd_index_vars = ['v2x_freexp_altinf', 'v2x_frassoc_thick', 'v2x_suffr', 'v2xel_frefair', 'v2x_elecoff',    # electoral democracy index
              'v2xcl_rol', 'v2x_jucon', 'v2xlg_legcon',                                                 # liberal democracy index
              'v2x_cspart', 'v2xdd_dd', 'v2xel_locelec', 'v2xel_regelec', 'v2x_polyarchy',              # participatory democracy index
              'v2dlreason', 'v2dlcommon', 'v2dlcountr', 'v2dlconslt', 'v2dlengage',                     # deliberative democracy index
              'v2xeg_eqprotec', 'v2xeg_eqaccess', 'v2xeg_eqdr']   
sdf = df[vd_meta_vars + vd_index_vars]


sub_vd_indices = []
# iterate over the columns of vd_index_vars
▰▰1▰▰ column in vd_index_vars:
    # filter out columns that have values greater than 1 or less than 0
    ▰▰2▰▰ sdf[column].min() >= 0 ▰▰3▰▰ sdf[column].max() <= 1:
        # add columns that pass the filter to a list of such columns
        sub_vd_indices.append(▰▰4▰▰)

# create a new dataframe consisting of only those columns saved in sub_vd_indices and vd_meta_vars
fsdf = sdf[vd_meta_vars ▰▰5▰▰ sub_vd_indices]

fsdf.describe()

## Problem 2:

<div class="alert alert-block alert-info">
In this problem set, we will continue to compute some descriptive statistics for our subsetted data and create some visualizations. We need a list that has the column names for just the variables, so we can re-use the 'subset_vd_indices' list from the previous problem. This code block will modify our dataframe so that it's easy to generate a single plot with all of the variables.
</div>

<div class="alert alert-block alert-success">
Use the <code>fsdf_ecdf</code> dataframe to create an empirical cumulative distribution plot of the indicator variables, using hue to differentiate them.
</div>

In [None]:
fsdf_ecdf = fsdf[sub_vd_indices]
fsdf_ecdf = fsdf_ecdf.melt(value_vars = sub_vd_indices, var_name = 'vd_index', value_name = 'score')

# create a new matplotlib figure
figure = ▰▰1▰▰.figure(figsize=(10,6))
# use seaborn to create the plot
ax = sns.▰▰2▰▰(fsdf_ecdf, x = '▰▰3▰▰', hue = '▰▰4▰▰', kind = "ecdf")
ax.set(xlabel='Score', ylabel='Proportion', title="ECDF for VDEM Indicator Variables")
figure.show()

## Problem 3:
<div class="alert alert-block alert-info">
We can see that most of the variables follow a fairly smooth curve, while a few see dramatic proportion increases in a step-like way. This indicates that although we might consider these to be continuous variables, the measurement that produced them had some discrete (interval-like) qualities.
</div>

<div class="alert alert-block alert-success">
Select one variable that seems to have a relatively smooth distribution and another that has a distinctly step-like distribution. Fill their names into the underscore blanks (<code>__A__</code> and <code>__B__</code>) in the 'x' and 'y' variables in the `sns.distplot` call in the code block below. There are a number of options to choose from for each, so the distinction between the two is what matters. Produce a bivariate kernel density estimation rug plot for the two selected variables. Fill in the blanks to continue.
</div>

In [None]:
# create a new matplotlib figure
figure = ▰▰1▰▰.figure()
# use seaborn to create the plot
ax = sns.▰▰2▰▰(fsdf, x=__A__, y=__B__, kind="▰▰3▰▰", rug = ▰▰4▰▰, rug_kws = {"alpha": 0.01})
sns.despine()
ax.set(xlabel='frassoc_thick', ylabel='polyarchy')
figure.show()


## Problem 4:

<div class="alert alert-block alert-info">
Now we'll create a correlation matrix (2D array) of all the variables and plot it in a heatmap to see which pairs of variables are most and least correlated. 
<br><br>
We'll use a boolean mask to clean-up the heatmap. Remember that a boolean mask is an array of "True" and "False" values, the same size and shape as the data array, where a value of "False" indicates that the value in the data array should be ignored. In this case, the mask will remove all values on the top-left -> bottom-right diagonal and above that diagonal.
</div>

<div class="alert alert-block alert-success">
Create a heatmap of the correlation matrix. Use the heatmap to select a few correlations to print. 
Looking at the heatmap above, select 2 pairs of variables (4 variables total) that appear highly correlated with each other. Then, select another 2 pairs of variables (4 variables total) that appear minimally correlated with each other. Create two lists, one for the first element of each variable pair and one for the second element of each pair. These lists should be aligned. Replace the underscore blanks (<code>__A__</code> and <code>__B__</code>) with your aligned lists of variables. Jointly iterate over the two lists and print the resulting Pearson correlations between each variable pair. 
</div>

In [None]:
fsdf_corr = fsdf[sub_vd_indices].corr()
# create the upper triangular mask
mask = np.▰▰1▰▰(np.▰▰2▰▰(fsdf_corr, dtype = bool))
figure = plt.figure()
# create the masked heatmap
ax = sns.▰▰3▰▰(data = fsdf_corr, ▰▰4▰▰ = ▰▰5▰▰)
figure.show()

var_1_list = __A__
var_2_list = __B__

for v1, v2 in ▰▰6▰▰(var_1_list, var_2_list):
    result = ▰▰7▰▰[v1].▰▰8▰▰(▰▰9▰▰[v2])
    print('Correlation of ' + v1 + ' and ' + v2 + ' : ' + str(▰▰10▰▰))


## Problem 5:
<div class="alert alert-block alert-info">  
For the remainder of the assignment, we're going to be working with data from the European Values Survey (EVS). The data is comprised of data collected by interviewers, and was drawn from most European nations in 2017. We'll start by loading a subset of the EVS and then standardizing each of the variables. Standardized data is a <i>sine qua non</i> when working with latent variables!
</div>
<div class="alert alert-block alert-success">
Standardize the data present in the columns of the <code>evs_df</code> dataframe. To accomplish this, use the <code>StandardScaler</code> class from the scikit-learn package. 
</div>

In [None]:
evs_df = pd.read_csv(PATH_TO_DATA/"evs_subset.csv")

country_index = evs_df['country'].to_numpy()

# Drop the country column, as it cannot be standardized
evs_df = evs_df.▰▰1▰▰("country", axis=▰▰2▰▰)

# Standardize the data
X = ▰▰3▰▰().▰▰4▰▰(evs_df)

X

## Problem 6:
<div class="alert alert-block alert-info">
Even though our input to the StandardScaler in the previous question was a dataframe, you may have noticed that the resulting output was an Numpy array. Fortunately, Pandas and Numpy have been built to interoperate with one another smoothly. Pandas Dataframes and multidimensional Numpy arrays can be interoperable (although with a different set of features); the same goes for Pandas Series and unidimensional Numpy arrays. Let's return to the 'evs_df' dataframe we made in the previous problem; it will be useful for exploring how numpy handles multidimensional arrays.
</div>
<div class="alert alert-block alert-success">
In this exercise, we're going to turn a subset of our Pandas dataframe (including only variables v145, v146, v147, and v148) into a multidimensional Numpy array, and convert every number it contains into a whole-number percentage (which we can accomplish by multiplying by dividing by 5, multiplying by 100, and rounding to the nearest whole number). Fill in the blanks to continue.
</div>

In [None]:
# Convert the numerical columns of our dataframe to array format 
evs_array = ▰▰1▰▰(evs_df[['v145', 'v146', 'v147', 'v148']])

# Divide every value in the array by 5, then multiply by 100
evs_array_percent = (evs_array ▰▰2▰▰ 5) ▰▰3▰▰ 100 

# Round each value in the array to the nearest whole number
evs_array_percent_r = np.▰▰4▰▰(evs_array_percent)

evs_array_percent_r

## Problem 7:
<div class="alert alert-block alert-info">
We can also use Numpy to rapidly and simply perform linear algebra calculations. If, for example, we wanted to see how 'v145' (which measures respondents' preference for a 'strong leader') and 'v148' (which measures respondents' preference for democratic norms) covary with one another, we can produce a covariance matrix. Instead of using the percentage-valued array we created in the previous problem, we'll instead use a standardized array, which will make the covariance matrix a little easier to read.
</div>
<div class="alert alert-block alert-success">
In the code cell below, use Numpy to extract the 'v145' and 'v148' variables. Then, use <code>np.cov</code> to create a covariance matrix between the <code>strong_leader</code> and <code>democratic</code> variables. Fill in the blanks to continue. 
</div>

In [None]:
# Isolate the 'Strong Leader' column from the array (first column)
strong_leader = evs_array[:,▰▰1▰▰]
# Isolate the 'Democratic' column from the array (last column)
democratic = evs_array[:,▰▰2▰▰]

# Compute the covariance of the two
np.▰▰3▰▰(strong_leader, democratic)



## Problem 8:
<div class="alert alert-block alert-info">  
We'll proceed with by using our now-standardized data as the basis for a principal components analysis. Unlike in the chapter, however, we're only going to have our PCA return the top 10 components.
</div>
<div class="alert alert-block alert-success">
Perform a principal components analysis on the standardizd EVS data. Only return the top 10 components (sorted in order of explained variance ratio). Submit a numpy array containing the explained variance ratios of the 10 principal components you found. 
</div>

In [None]:
# Create PCA with 10 components
pca = ▰▰1▰▰(▰▰2▰▰, random_state=42)

# Fit the PCA
pca.▰▰3▰▰(▰▰4▰▰)

# Extract explained variance ratio
evr = pca.▰▰5▰▰

print(evr)

## Problem 9:
<div class="alert alert-block alert-info">  
Although the process of interpreting screeplots is generally subjective and open to interpretation, we're fortunate in that the screeplot from our PCA of the EVS data has a clear inflection point. It's time to flex your interpretation muscles! 
</div>
<div class="alert alert-block alert-success">
Produce a screeplot of the 10 principal components you produced in question 2. Submit an integer corresponding to the principal component ID that corresponds with the inflection point.
</div>

In [None]:

# Extract explained variances
eigenvalues = pd.Series(pca.explained_variance_)

# Create screeplot
fig, ax = plt.subplots()
sns.lineplot(x=eigenvalues.index, y=eigenvalues, data=eigenvalues)
plt.scatter(x=eigenvalues.index, y=eigenvalues)
ax.set(xlabel='Principal component ID', ylabel='Eigenvalue')
sns.despine()
plt.show()