<br><br><font color="gray">DOING COMPUTATIONAL SOCIAL SCIENCE<br>MODULE 3 <strong>PROBLEM SETS</strong></font>

# <font color="#49699E" size=40>MODULE 3 </font>


# What You Need to Know Before Getting Started

- **Every notebook assignment has an accompanying quiz**. Your work in each notebook assignment will serve as the basis for your quiz answers.
- **You can consult any resources you want when completing these exercises and problems**. Just as it is in the "real world:" if you can't figure out how to do something, look it up. My recommendation is that you check the relevant parts of the assigned reading or search for inspiration on [https://stackoverflow.com](https://stackoverflow.com).
- **Each problem is worth 1 point**. All problems are equally weighted.
- **The information you need for each problem set is provided in the blue and green cells.** General instructions / the problem set preamble are in the blue cells, and instructions for specific problems are in the green cells. **You have to execute all of the code in the problem set, but you are only responsible for entering code into the code cells that immediately follow a green cell**. You will also recognize those cells because they will be incomplete. You need to replace each blank `▰▰#▰▰` with the code that will make the cell execute properly (where # is a sequentially-increasing integer, one for each blank).
- Most modules will contain at least one question that requires you to load data from disk; **it is up to you to locate the data, place it in an appropriate directory on your local machine, and replace any instances of the `PATH_TO_DATA` variable with a path to the directory containing the relevant data**.
- **The comments in the problem cells contain clues indicating what the following line of code is supposed to do.** Use these comments as a guide when filling in the blanks. 
- **You can ask for help**. If you run into problems, you can reach out to John (john.mclevey@uwaterloo.ca) or Pierson (pbrowne@uwaterloo.ca) for help. You can ask a friend for help if you like, regardless of whether they are enrolled in the course.

Finally, remember that you do not need to "master" this content before moving on to other course materials, as what is introduced here is reinforced throughout the rest of the course. You will have plenty of time to practice and cement your new knowledge and skills.
<div class='alert alert-block alert-danger'>As you complete this assignment, you may encounter variables that can be assigned a wide variety of different names. Rather than forcing you to employ a particular convention, we leave the naming of these variables up to you. During the quiz, use the 'USER_DEFINED' option to fill in any blank that you assigned an arbitrary name to.</b></div>

## Package Imports

In [32]:
import pandas as pd
import numpy as np
from pprint import pprint

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples



import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%config Completer.use_jedi = False

## Defaults

In [2]:
seed = 7

## Question 1:

<div class="alert alert-block alert-info">  
In this exercise, we're going to ask you to supply the names of the Pandas methods you'll need to (1) load the .csv from disk and (2) preview a random sample of 5 rows.
</div>

<div class="alert alert-block alert-success">
In the code block below, fill in the blanks to insert the functions, methods, or variable names needed to load the .csv and draw a random sample of 5 rows.
</div>

In [None]:
# Load vdem_subset.csv as a dataframe
df = pd.▰▰0▰▰(PATH_TO_DATA/'vdem_subset.csv', low_memory=False, index_col=0)

# Draw random sample of 5 rows from vdem dataframe
df.▰▰1▰▰(▰▰2▰▰, random_state = 7)

## Question 2:
<div class="alert alert-block alert-info">  
You may have noticed that many of the cells in the dataframe we created have 'NaN' values. It's useful for us to know just how many values in our dataset are missing or not defined. Let's do that now:
</div>
<div class="alert alert-block alert-success">
In the code block below, fill in the blanks to insert the functions, methods, or variable names needed to create a Pandas series of the missing values for each column and then sort it.
</div>

In [None]:
# Sum together all NaN values to produce series with numerical values indicating number of missing entires
missing = df.▰▰0▰▰().▰▰1▰▰()

# Sort the `missing` series
missing = missing.▰▰2▰▰()

print(missing)
print("Total missing values: " + str(sum(missing)))


## Question 3:
<div class="alert alert-block alert-info">
The list below contains a number of variables, including mid-level indicators that go into the 5 high-level democracy indexes that were used in the assigned readings. In this problem, we'll subset our data in two ways - first conceptually by selecting only the mid-level indicators, and then empirically by selecting the indicators that heave the least missing data.
</div>

<div class="alert alert-block alert-success">
Use the list of column names we've provided to filter the large dataframe into a subset. Fill in the blanks to insert the functions, methods, or variable names needed.
</div>

In [None]:
vd_meta_vars = ['country_name', 'year', 'e_regiongeo']
vd_index_vars = ['v2x_freexp_altinf', 'v2x_frassoc_thick', 'v2x_suffr', 'v2xel_frefair', 'v2x_elecoff',    # electoral democracy index
              'v2xcl_rol', 'v2x_jucon', 'v2xlg_legcon',                                                 # liberal democracy index
              'v2x_cspart', 'v2xdd_dd', 'v2xel_locelec', 'v2xel_regelec', 'v2x_polyarchy',              # participatory democracy index
              'v2dlreason', 'v2dlcommon', 'v2dlcountr', 'v2dlconslt', 'v2dlengage',                     # deliberative democracy index
              'v2xeg_eqprotec', 'v2xeg_eqaccess', 'v2xeg_eqdr']                                         # egalitarian democracy index

# filter `df` so that it only includes columns from the two lists above
sdf = df[▰▰0▰▰ ▰▰1▰▰ vd_index_vars]

sdf.describe()

## Question 4:
<div class="alert alert-block alert-info">  
Let's dig into our subsetted data. Notice that many of the variables have a range (min to max) of 0 to 1. In the next exercises we'll be making some comparisons between variables, so let's simplify those comparisons by selecting only the meta-data columns and columns with a range of 0 to 1.
</div>

<div class="alert alert-block alert-success">
Create a new list to filter the dataframe columns, initializing it with the names of the three meta-data columns. Append that list with the names of columns from vd_index_vars, only if the data in those columns falls within the required range. Subset the dataframe using this updated list. Fill in the blanks to continue.
</div>

In [None]:
sub_vd_indices = []
# iteratie over the columns of vd_index_vars
▰▰0▰▰ column in vd_index_vars:
    # filter out columns that have values greater than 1 or less than 0
    ▰▰1▰▰ sdf[column].min() >= 0 ▰▰2▰▰ sdf[column].max() <= 1:
        # add columns that pass the filter to a list of such columns
        sub_vd_indices.append(▰▰3▰▰)

# create a new dataframe consisting of only those columns saved in sub_vd_indices and vd_meta_vars
fsdf = sdf[vd_meta_vars ▰▰4▰▰ ▰▰5▰▰]

fsdf.describe()

## Question 5:

<div class="alert alert-block alert-info">
In this problem set, we will continue to compute some descriptive statistics for our subsetted data and create some visualizations. We need a list that has the column names for just the variables, so we can re-use the 'subset_vd_indices' list from the previous problem. This code block will modify our dataframe so that it's easy to generate a single plot with all of the variables.
</div>

<div class="alert alert-block alert-success">
Use the <code>fsdf_ecdf</code> dataframe to create an empirical cumulative distribution plot of the indicator variables, using hue to differentiate them.
</div>

In [None]:
fsdf_ecdf = fsdf[sub_vd_indices]
fsdf_ecdf = fsdf_ecdf.melt(value_vars = sub_vd_indices, var_name = 'vd_index', value_name = 'score')

# create a new matplotlib figure
figure = ▰▰0▰▰.figure(figsize=(10,6))
# use seaborn to create the plot
ax = sns.▰▰1▰▰(fsdf_ecdf, x = ▰▰2▰▰, hue = ▰▰3▰▰, kind = ▰▰4▰▰)
ax.set(xlabel='Score', ylabel='Proportion', title="ECDF for VDEM Indicator Variables")
figure.show()

## Question 6:
<div class="alert alert-block alert-info">
We can see that most of the variables follow a fairly smooth curve, while a few see dramatic proportion increases in a step-like way. This indicates that although we might consider these to be continuous variables, the measurement that produced them had some discrete (interval-like) qualities.
</div>

<div class="alert alert-block alert-success">
Select one variable that seems to have a relatively smooth distribution and another that has a distinctly step-like distribution. Fill their names into the underscore blanks (<code>__A__</code> and <code>__B__</code>) in the 'x' and 'y' variables in the `sns.distplot` call in the code block below. There are a number of options to choose from for each, so the distinction between the two is what matters. Produce a bivariate kernel density estimation rug plot for the two selected variables. Fill in the blanks to continue.
</div>

In [None]:
# create a new matplotlib figure
figure = ▰▰0▰▰.figure()
# use seaborn to create the plot
ax = sns.▰▰1▰▰(fsdf, x=__A__, y=__B__, kind=▰▰2▰▰, rug = ▰▰3▰▰, rug_kws = {"alpha": 0.01})
sns.despine()
ax.set(xlabel='frassoc_thick', ylabel='polyarchy')
figure.show()


## Question 7:

<div class="alert alert-block alert-info">
Now we'll create a correlation matrix (2D array) of all the variables and plot it in a heatmap to see which pairs of variables are most and least correlated. 
<br><br>
We'll use a boolean mask to clean-up the heatmap. Remember that a boolean mask is an array of "True" and "False" values, the same size and shape as the data array, where a value of "False" indicates that the value in the data array should be ignored. In this case, the mask will remove all values on the top-left -> bottom-right diagonal and above that diagonal.
</div>

<div class="alert alert-block alert-success">
Create a heatmap of the correlation matrix. Use the heatmap to select a few correlations to print. 
Looking at the heatmap above, select 2 pairs of variables (4 variables total) that appear highly correlated with each other. Then, select another 2 pairs of variables (4 variables total) that appear minimally correlated with each other. Create two lists, one for the first element of each variable pair and one for the second element of each pair. These lists should be aligned. Replace the underscore blanks (<code>__A__</code> and <code>__B__</code>) with your aligned lists of variables. Jointly iterate over the two lists and print the resulting Pearson correlations between each variable pair. 
</div>

In [None]:
fsdf_corr = fsdf[sub_vd_indices].corr()
# create the upper triangular mask
mask = ▰▰0▰▰(np.▰▰1▰▰(fsdf_corr, dtype = bool))
figure = plt.figure()
# create the masked heatmap
ax = sns.▰▰2▰▰(data = fsdf_corr, ▰▰3▰▰ = ▰▰4▰▰)
figure.show()

var_1_list = __A__
var_2_list = __B__

for v1, v2 in ▰▰5▰▰(var_1_list, var_2_list):
    result = ▰▰6▰▰[v1].▰▰7▰▰(▰▰8▰▰[v2])
    print('Correlation of ' + v1 + ' and ' + v2 + ' : ' + str(▰▰9▰▰))


## Question 8:

<div class="alert alert-block alert-info">
One useful thing that using Pandas dataframes enables us to do is group data based on one or more the columns and then work with the resulting grouped dataframe (in much the same way we would with an un-grouped dataframe). Using the VDEM data, we'll only import a subset of the data, using the 'columns_to_use' variable. At the same time, we're going to replace the numerical values in the 'e_regionpol_6c' variable with easy-to-read string representations. Finally, we'll filter the resulting dataset to include only those rows from the year 2015.<br><br>
</div>
<div class="alert alert-block alert-success">
In this next code block, we're going to load in a dataset, filtering our dataframe to include only those rows where the year is 2015. Fill in the blanks to continue.
</div>

In [None]:
columns_to_use = [
    'country_name',
    'country_id',
    'year',
    'e_area',
    'e_regionpol_6C',
    'v2x_polyarchy',
    'v2x_libdem',
    'v2x_partipdem',
    'v2x_delibdem',
    'v2x_egaldem'
]

# Load the dataset as a dataframe
df = pd.▰▰0▰▰(
    PATH_TO_DATA/"vdem_subset.csv",
    usecols = ▰▰1▰▰,
    low_memory = False
)

df['e_regionpol_6C'].replace({
    1.0: "East Europe and Central Asia",
    2.0: "Latin America and Carribean",
    3.0: "Middle East and North Africa",
    4.0: "Sub-Saharan Africa",
    5.0: "West Europe and North America",
    6.0: 'Asia and Pacific'
}, inplace=True)


# Subset the dataframe to include only those rows from 2015
df_2015 = df.▰▰2▰▰("▰▰3▰▰ ▰▰4▰▰ 2015")

df_2015

## Question 9:
<div class="alert alert-block alert-info">
Now, we're going to use the Pandas Dataframe's `groupby` method to combine each nation into the region it belongs to. As you would have read in the accompanying chapter, the Pandas groupby method only preserves columns that you give it instructions for; everything else is dropped in the resulting dataframe. 
<br><br>
In order to figure out how to aggregate each of our columns, let's think through them together. First up, we have 'country_name' and 'country_ID'. Since we're going to be grouping our data into only 6 rows (one for each of the 6 politico-geographical regions), it doesn't make sense to keep either of these columns. The same goes for 'year', since we will have already filtered our dataset to only include rows that are from 2015. We're going to be using 'e_regionpol_6C' as the basis for our groupings, so it doesn't make sense to keep it as a data column any longer. 
<br><br>
That leaves us with 'e_area' and the 5 democracy indices. Since we're interested in knowing the total area of each region, it would make sense to <b>add</b> each country's area together. We could do something similar for the 5 democracy indices, but we'll leave them alone for now. In order to make things easier on ourselves, we're going to start by filtering out all of the columns we don't want in our final dataset, which will make aggregating what's left much easier.
</div>

<div class="alert alert-block alert-success">
In the following code cell, we're going to filter out most of the columns in `df_2015` so that only 'e_regionpol_6C' and 'e_area' remain, and store the resulting filtered dataframe as `df_area`. Then, we're going to run a `groupby` operation on the `e_regionpol_6C` column of and sum the `e_area` column in the `df_area` dataframe. Fill in the blanks to continue. 
</div>

In [None]:
# Filter out all columns except 'e_regionpol_6C', 'e_area'
df_area = ▰▰0▰▰[[▰▰1▰▰, 'e_area']]

# group by political region and sum remaining columns
df_grouped_area = df_area.▰▰2▰▰(▰▰3▰▰).▰▰4▰▰()

df_grouped_area

## Question 10:


<div class="alert alert-block alert-info">
In the last question, we explored how we could use Pandas to group rows of a dataframe according to a variable's value, and to handle a subset of the remaining columns according to some kind of aggregation logic (such as adding the values or averaging over them). This time, rather than lumping countries together by region, we're going to drill deeper on how an individual nation has changed over time. For this exercise, we're going to look at how democratic norms in Costa Rica have developed in the decades since the Second World War. Since we already have the full dataframe stored in memory (as 'df'), we'll start by filtering our dataset to include only those rows pertaining to Costa Rica (across all years, not just 2015). 
<br><br>
If you examine the resulting dataframe, you might notice that Costa Rica does not have any scores for the 5 democratic indices the earlier years for which it is present in the dataset. This should come as no surprise; even for a group as capable as the VDEM project, constructing a democratic index for the year 1839 would involve enough guesswork to render the result meaningless. As such, we're going to immediately filter our Costa Rica-only dataframe to weed out any rows that don't have scores for the 5 democratic indices. 
</div>
<div class="alert alert-block alert-success">
Find the first year for which we have a complete set of the democratic indices for Costa Rica. Fill in the blanks to continue.
</div>

In [None]:
# Filter the dataframe to include only rows pertaining to Costa Rica
df_cr = df.▰▰0▰▰("▰▰1▰▰ ▰▰2▰▰ 'Costa Rica'")

# Drop each row with one or more missing values
df_cr_filtered = df_cr.▰▰3▰▰(subset=[
    'v2x_polyarchy',
    'v2x_libdem',
    'v2x_partipdem',
    'v2x_delibdem',
    'v2x_egaldem'])

# Find first year for which VDEM has a complete set of indices for Costa Rica
first_year = ▰▰4▰▰(df_cr_filtered[▰▰5▰▰])


## Question 11:
<div class="alert alert-block alert-info">
Now our data is ready to be plotted! In this part of the exercise, we're going to plot two of Costa Rica's democratic indices against the 'year' variable to see how its democratic norms have evolved over time. We'll accomplish this by using Seaborn and taking advantage of the fact that the columns in Pandas Dataframes can be individually 'pulled out' as a Series (which operate similarly to Numpy arrays, for most intents and purposes). In the following code cell, we'll create the plot for you so you can see how it's done and what it should look like. It won't be graded, and there aren't any blanks to fill in.
<br><br>Despite being as simple as can be, that doesn't look half bad! It's always a good idea to label your axes and give the plot a title so that anyone encoutering it for the first time can rapidly determine what the plot represents.
<br><br>A quick note; if you want to see what the first label-less plot looks like before adding labels to the second plot, you can comment out each of the lines below the first instance of <code>figure.show()</code>.
</div>

<div class="alert alert-block alert-success">
Add useful labels to the x-axis and y-axis of the second plot produced by the code cell below, along with a title describing what the plot is. Fill in the blanks to continue. 
</div>

In [None]:
cr_years = df_cr_filtered['year']
cr_polyarchy = df_cr_filtered['v2x_polyarchy']

figure = plt.figure(figsize=(10, 6))
sns.lineplot(x = cr_years, y = cr_polyarchy)
figure.show()

figure = plt.figure(figsize=(10, 6))
sns.lineplot(x = cr_years, y = cr_polyarchy)
# Label y-axis
plt.ylabel(▰▰0▰▰)
# Label x-axis
plt.▰▰1▰▰(▰▰2▰▰)
# Add title
▰▰3▰▰.▰▰4▰▰("Polyarchy over Time, Costa Rica")
figure.show()


## Question 12:

<div class="alert alert-block alert-info">
In this exercise, we're going to work through how to combine multiple pandas dataframes. This will come in handy whenever you want to explore the relationships between variables that come from different datasets, but which can be linked according to some underlying relationship. 
<br><br>
Earlier, we used addition to aggregate the land area of every nation in a politico-geographic region to give us a sense of how large each region was. In this exercise, we're going to turn our attention to the 5 democracy indices. Using addition (which is what we did with area) to aggregate the 5 democracy indices doesn't make as much sense, though: that might lead us to conclude that regions with more countries would be 'more democratic' than those with only a small number of nations. Instead, we'll *average* over these indicators, which will give us a sense of how democratic each region is, taken together. 
</div>
<div class="alert alert-block alert-success">
Create a dataframe that only includes the columns we care about (the region variable and the 5 democratic indices), group the result by region, and take the average across each score. 
</div>

In [None]:
df_democracy = df_2015[['v2x_polyarchy',
    'v2x_libdem',
    'v2x_partipdem',
    'v2x_delibdem',
    'v2x_egaldem',
     'e_regionpol_6C']]

# Group by region and take average of other variables
df_grouped_democracy = ▰▰0▰▰.▰▰1▰▰('▰▰2▰▰').▰▰3▰▰()

df_grouped_democracy 

## Question 13:
<div class="alert alert-block alert-info">
If you compare the 'df_grouped_democracy' dataframe and the 'df_grouped_area' dataframe, you might notice that the bolded columns on the left are identical. You may recall that the bold column on the left of a dataframe is the 'index', and we can take advantage of its special status to join the two dataframes together. The result will be one dataframe with the same number of rows, but with all 6 of the columns we aggregated: area and the 5 democratic indices.
</div>
<div class="alert alert-block alert-success">
In the following code block, we're going to concatenate `df_grouped_democracy` and `df_grouped`area. Fill in the blanks to continue. 
</div>

In [None]:
# Concatenate df_grouped_democracy and df_grouped_area on rows
df_full_rows = pd.▰▰0▰▰([df_grouped_democracy, df_grouped_area], ▰▰1▰▰=1)

df_full_rows

## Question 14:
<div class="alert alert-block alert-info">
In the above exercise, we combined dataframes along their rows, using the row index to guide how the data was combined. We can do much the same with columns. To demonstrate how, let's return to our Costa Rica dataframe and add another country to it. Since Costa Rica and Nicaragua are geographic neighbours, it makes sense to compare them directly. 
</div>
<div class="alert alert-block alert-success">
Time to give Nicaragua the same treatment as we did to Costa Rica! Once that's done, we're going to concatenate `df_nicaragua_filtered` and `df_nicaragua_cr`, column-wise. Fill in the blanks to continue.
</div>

In [None]:
# Create a dataframe only containing rows pertaining to Nicaragua
df_nicaragua = df.▰▰0▰▰("▰▰1▰▰ ▰▰2▰▰ 'Nicaragua'")

# Drop rows in the Nicaragua dataframe that contain NaNs in the 5 index columns 
df_nicaragua_filtered = df_nicaragua.▰▰3▰▰(▰▰4▰▰=[
    'v2x_polyarchy',
    'v2x_libdem',
    'v2x_partipdem',
    'v2x_delibdem',
    'v2x_egaldem'])

# Concatenate df_grouped_democracy and df_grouped_area on columns
df_nicaragua_cr = pd.▰▰5▰▰([df_nicaragua_filtered, ▰▰6▰▰], ▰▰7▰▰ = ▰▰8▰▰)

df_nicaragua_cr

## Question 15:
<div class="alert alert-block alert-info">
Now that the data for these two countries has been combined into a single dataframe, we can easily create plots that allow us to compare them. Again, we'll be using the Seaborn package to do our plotting for us. Even though all of our data is lumped together, Seaborn allows us to use the 'hue' variable to differentiate the data we're plotting based on some categorical variable (which, in this case, is the country variable -- it's what differentiates between Costa Rica and Nicaragua). 
</div>
<div class="alert alert-block alert-success">
Create a line plot that contains separate lines for both Nicaragua's and Costa Rica's polyarchy score by year. We'll also include labels for the x-axis, y-axis, and plot title. Fill in the blanks to continue.
</div>

In [None]:
concat_years = df_nicaragua_cr['year']
concat_polyarchy = df_nicaragua_cr['v2x_polyarchy']
concat_country = df_nicaragua_cr['country_name']

figure = plt.figure(figsize=(10,6))
ax = sns.▰▰0▰▰(x=▰▰1▰▰,
             y=▰▰2▰▰,
             hue=▰▰3▰▰
            )
ax.set(xlabel='Year', ylabel='Polyarchy', title="Polyarchy over Time")

figure.show()

## Question 16:


<div class="alert alert-block alert-info">
Pandas and Numpy have been built to interoperate with one another smoothly. Pandas Dataframes and multidimensional Numpy arrays can be interoperable (although with a different set of features); the same goes for Pandas Series and unidimensional Numpy arrays. Let's return to the 'df_grouped_democracy' dataframe we made earlier; it will be useful for exploring how numpy handles multidimensional arrays.
</div>
<div class="alert alert-block alert-success">
In this exercise, we're going to turn our Pandas dataframe into a 6-by-5 Numpy array, and convert every number it contains into a whole-number percentage (which we can accomplish by multiplying by 100 and rounding to the nearest whole number). Fill in the blanks to continue.
</div>

In [None]:
# Convert the numerical columns of our dataframe to array format 
arr_dem = ▰▰0▰▰.▰▰1▰▰(df_grouped_democracy)

# Multiply every value in the array by 100
arr_dem_percent = arr_dem ▰▰2▰▰ 100

# Round each value in the array to the nearest whole number
arr_dem_percent_r = ▰▰3▰▰.▰▰4▰▰(arr_dem_percent)

arr_dem_percent_r

## Question 17:
<div class="alert alert-block alert-info">
If you compare the results from the Numpy array we produced with the values in the Pandas dataframe, you'll notice that they're more-or-less a perfect match (differing only due to rounding). We can also use Numpy to rapidly and simply perform linear algebra calculations. If, for example, we wanted to see how the polyarchy and liberal democracy variables covary with one another across the regions, we can produce a covariance matrix.
</div>
<div class="alert alert-block alert-success">
In the code cell below, we're going to use Numpy to create a covariance matrix for polyarchy and liberal democracy. Fill in the blanks to continue. 
</div>

In [None]:
# Isolate the polyarchy column from the array
arr_polyarchy = arr_dem[▰▰0▰▰,▰▰1▰▰]
# Isolate the liberal democracy column from the array
arr_libdem = arr_dem[▰▰2▰▰,▰▰3▰▰]

# Compute the covariance of the two
▰▰4▰▰.▰▰5▰▰(arr_polyarchy, ▰▰6▰▰)



## Question 18:
<div class="alert alert-block alert-info">  
For the remainder of the assignment, we're going to be working with data from the European Values Survey (EVS). The data is comprised of data collected by interviewers, and was drawn from most European nations in 2017. We'll start by loading a subset of the EVS and then standardizing each of the variables. Standardized data is a <i>sine qua non</i> when working with latent variables!
</div>
<div class="alert alert-block alert-success">
Standardize the data present in the columns of the <code>evs_df</code> dataframe. To accomplish this, use the <code>StandardScaler</code> class from the scikit-learn package. 
</div>

In [None]:
evs_df = pd.read_csv(PATH_TO_DATA/"evs_subset.csv")

country_index = evs_df['country'].to_numpy()

# Drop the country column, as it cannot be standardized
evs_df = evs_df.▰▰0▰▰("country", axis=▰▰1▰▰)

# Standardize the data
X = ▰▰2▰▰().▰▰3▰▰(evs_df)

X

## Question 19:
<div class="alert alert-block alert-info">  
We'll proceed with by using our now-standardized data as the basis for a principal components analysis. Unlike in the chapter, however, we're only going to have our PCA return the top 10 components.
</div>
<div class="alert alert-block alert-success">
Perform a principal components analysis on the standardizd EVS data. Only return the top 10 components (sorted in order of explained variance ratio). Submit a numpy array containing the explained variance ratios of the 10 principal components you found. 
</div>

In [None]:
# Create PCA with 10 components
pca = ▰▰0▰▰(▰▰1▰▰, random_state=42)

# Fit the PCA
pca.▰▰2▰▰(▰▰3▰▰)

# Extract explained variance ratio
evr = pca.▰▰4▰▰

print(evr)

## Question 20:
<div class="alert alert-block alert-info">  
Although the process of interpreting screeplots is generally subjective and open to interpretation, we're fortunate in that the screeplot from our PCA of the EVS data has a clear inflection point. It's time to flex your interpretation muscles! 
</div>
<div class="alert alert-block alert-success">
Produce a screeplot of the 10 principal components you produced in question 2. Submit an integer corresponding to the principal component ID that corresponds with the inflection point.
</div>

In [None]:

# Extract explained variances
eigenvalues = pd.Series(pca.explained_variance_)

# Create screeplot
fig, ax = plt.subplots()
sns.lineplot(x=eigenvalues.index, y=eigenvalues, data=eigenvalues)
plt.scatter(x=eigenvalues.index, y=eigenvalues)
ax.set(xlabel='Principal component ID', ylabel='Eigenvalue')
sns.despine()
plt.show()

## Question 21:
<div class="alert alert-block alert-info">  
In the following code cell, we're going to ask you to perform a K-means cluster with a K of 10, AND to write a function capable of telling us a bit more about the nationalities of the individuals from each cluster. That's a lot of code for you to write, but don't be intimidated: you can treat this as two separate problems that you need to solve sequentially (consider commenting out all of the hint code regarding the function while you're working on creating the K-means analysis).
</div>
<div class="alert alert-block alert-success">
Perform a K-means cluster analysis, where K = 10, on the EVS data. Store the labels from your K-means cluster analysis in the 'cluster' variable (provided for you). Then, write a function that zips together your list of cluster assignments and the list of countries from the EVS data (both should be the same length), and then iterates over this zipped list in order to create a dictionary where each key is a cluster number (0 through 9) and each value is a list of the 5 nationalities that most frequently appear in that cluster.
</div>

In [None]:
# Set number of clusters
num_clusters = 10

# Instantiate k-means 
km = ▰▰0▰▰(n_clusters=▰▰1▰▰, init='k-means++', random_state=42)

# Fit instantiated k-means to data
k_means_Fitted = km.fit(X)

# Extract clusters

clusters = k_means_Fitted.labels_.tolist() # Do not change this line

# Define function for summarizing results of k-means clustering
▰▰2▰▰ top_countries_by_cluster(
    km, 
    num_clusters, 
    country_index, 
    return_top = 5):

    # Create zipped list of k-means labels and the country column from original dataset
    cluster_countries = ▰▰3▰▰(▰▰4▰▰(km.labels_, country_index))
    
    # Initialize cluster dictionary
    cluster_count = {i:{} for i in range(num_clusters)}

    # Iterate over cluster-country pairs:
    for pair in cluster_countries:
        
        # Extract cluster and country
        cluster_num = pair[0]
        country = pair[1]

        # Retrieve current count of particular nationality in cluster
        current_count = cluster_count[cluster_num].get(country, 0)
        
        # Increment count by one 
        current_count ▰▰5▰▰ ▰▰6▰▰
        
        # Store incremented count as new value 
        cluster_count[cluster_num][country] = current_count
    
    # Sort the values of the cluster dictionary and filter down to 5 most common nationalities for each cluster
    for cluster_num, country_dict in cluster_count.items():
        cluster_count[cluster_num] = ▰▰7▰▰(country_dict, key=▰▰8▰▰ x: country_dict[x], reverse=▰▰9▰▰)[0:return_top]

    # Return final count dictionary
    return cluster_count
  
pprint(top_countries_by_cluster(k_means_Fitted, num_clusters, country_index))

## Question 22:
<div class="alert alert-block alert-info">  
In the previous question, you familiarized yourself with K-means clustering and built a function capable of summarizing some of your results. Of course, the number of clusters we used above was arbitrary, and might be nowhere close to the optimal number of clusters needed for the data. In this question, we're going to use silhouette analysis to determine a good number of centroids to cluster the data around. It's highly likely that our silhouette analysis will indicate that 2 or 3 clusters is optimal, but since we want to tease out regional variations, we'll insist on finding an optimal solution using more than 4 clusters.
</div>
<div class="alert alert-block alert-success">
Using silhouette analysis, find an optimal number of clusters for the EVS data that's greater than 4 and less than 11. Submit the K value you have decided to use (as an integer). If the silhouette analysis was conducted properly, there should be a clear best option.
</div>

In [None]:

# Iterate over appropriate range of clusters
for i in range(▰▰0▰▰, ▰▰1▰▰):
    
    # Extract value of iterator for consistency 
    num_clusters = i

    # Run k-means with iterated number of clusters (see question 4 for detailed breakdown)
    km = KMeans(n_clusters=num_clusters, init='k-means++', random_state=42)
    k_means_Fitted = km.fit(X)
    clusters = k_means_Fitted.labels_.tolist() # Do not change this line
    
    # Print cluster number
    print(f"{i} clusters:")

    # Print silhouette score
    print(silhouette_score(X, clusters, metric='euclidean'))

    # Extract silhouette samples
    samples = silhouette_samples(X, clusters)

    # Plot extracted samples
    ax = sns.displot(samples, kind="ecdf")
    ax.set(xlabel='Silhouette Score')
    sns.despine()
    plt.xlim(-1, 1)
    plt.axvline(x=0, linewidth=2, color='darkgray')
    plt.show()