<br><br><font color="gray">DOING COMPUTATIONAL SOCIAL SCIENCE<br>MODULE 9 <strong>PROBLEM SETS</strong></font>

# <font color="#49699E" size=40>MODULE 9 </font>


# What You Need to Know Before Getting Started

- **Every notebook assignment has an accompanying quiz**. Your work in each notebook assignment will serve as the basis for your quiz answers.
- **You can consult any resources you want when completing these exercises and problems**. Just as it is in the "real world:" if you can't figure out how to do something, look it up. My recommendation is that you check the relevant parts of the assigned reading or search for inspiration on [https://stackoverflow.com](https://stackoverflow.com).
- **Each problem is worth 1 point**. All problems are equally weighted.
- **The information you need for each problem set is provided in the blue and green cells.** General instructions / the problem set preamble are in the blue cells, and instructions for specific problems are in the green cells. **You have to execute all of the code in the problem set, but you are only responsible for entering code into the code cells that immediately follow a green cell**. You will also recognize those cells because they will be incomplete. You need to replace each blank `▰▰#▰▰` with the code that will make the cell execute properly (where # is a sequentially-increasing integer, one for each blank).
- Most modules will contain at least one question that requires you to load data from disk; **it is up to you to locate the data, place it in an appropriate directory on your local machine, and replace any instances of the `PATH_TO_DATA` variable with a path to the directory containing the relevant data**.
- **The comments in the problem cells contain clues indicating what the following line of code is supposed to do.** Use these comments as a guide when filling in the blanks. 
- **You can ask for help**. If you run into problems, you can reach out to John (john.mclevey@uwaterloo.ca) or Pierson (pbrowne@uwaterloo.ca) for help. You can ask a friend for help if you like, regardless of whether they are enrolled in the course.

Finally, remember that you do not need to "master" this content before moving on to other course materials, as what is introduced here is reinforced throughout the rest of the course. You will have plenty of time to practice and cement your new knowledge and skills.
<div class='alert alert-block alert-danger'>As you complete this assignment, you may encounter variables that can be assigned a wide variety of different names. Rather than forcing you to employ a particular convention, we leave the naming of these variables up to you. During the quiz, submit an answer of 'USER_DEFINED' (without the quotation marks) to fill in any blank that you assigned an arbitrary name to. In most circumstances, this will occur due to the presence of a local iterator in a for-loop.</b></div>

## Package Imports

In [1]:
import pandas as pd 

import numpy as np
from numpy.random import seed

from pyprojroot import here

import graphviz
from graphviz import Source

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

from pprint import pprint

import spacy 

from time import time

seed(42)



## Defaults

In [2]:
x_columns = [

    # Religion and Morale
    'v54', # Religious services? - 1=More than Once Per Week, 7=Never
    'v149', # Do you justify: claiming state benefits? - 1=Never, 10=Always
    'v150', # Do you justify: cheating on tax? - 1=Never, 10=Always 
    'v151', # Do you justify: taking soft drugs? - 1=Never, 10=Always 
    'v152', # Do you justify: taking a bribe? - 1=Never, 10=Always 
    'v153', # Do you justify: homosexuality? - 1=Never, 10=Always 
    'v154', # Do you justify: abortion? - 1=Never, 10=Always 
    'v155', # Do you justify: divorce? - 1=Never, 10=Always 
    'v156', # Do you justify: euthanasia? - 1=Never, 10=Always 
    'v157', # Do you justify: suicide? - 1=Never, 10=Always 
    'v158', # Do you justify: having casual sex? - 1=Never, 10=Always 
    'v159', # Do you justify: public transit fare evasion? - 1=Never, 10=Always 
    'v160', # Do you justify: prostitution? - 1=Never, 10=Always 
    'v161', # Do you justify: artificial insemination? - 1=Never, 10=Always 
    'v162', # Do you justify: political violence? - 1=Never, 10=Always 
    'v163', # Do you justify: death penalty? - 1=Never, 10=Always 

    # Politics and Society
    'v97', # Interested in Politics? - 1=Interested, 4=Not Interested
    'v121', # How much confidence in Parliament? - 1=High, 4=Low
    'v126', # How much confidence in Health Care System? - 1=High, 4=Low
    'v142', # Importance of Democracy - 1=Unimportant, 10=Important
    'v143', # Democracy in own country - 1=Undemocratic, 10=Democratic
    'v145', # Political System: Strong Leader - 1=Good, 4=Bad

    # National Identity
    'v170', # How proud are you of being a citizen? - 1=Proud, 4=Not Proud
    'v184', # Immigrants: impact on development of country - 1=Bad, 5=Good
    'v185', # Immigrants: take away jobs from Nation - 1=Take, 10=Do Not Take
    'v198', # European Union Enlargement - 1=Should Go Further, 10=Too Far Already
]

y_columns = [
    # Overview
    'country',
    
    # Socio-demographics
    'v226', # Year of Birth by respondent 
    'v261_ppp', # Household Monthly Net Income, PPP-Corrected
]

## Problem 1:
<div class="alert alert-block alert-info">  
For each of the parts in this assignment, we're going to closely mirror our progression through the various models in each of the corresponding chapters. It should come as no surprise, then, that we're going to start with some plain-old run-of-the-mill Ordinary Least Squares regression (approaching it from a Machine Learning perspective, of course). We'll start by seeing how effectively we can predict someone's year of birth using variables from the survey on the topics of Religion, Morality, Politics, Society, and National Identity.
</div>
<div class="alert alert-block alert-success">
Perform the following steps, in order:
<ol>
    <li> Load the EVS dataset (provided) 
    <li> Create a feature matrix containing all of the columns in <code>x_columns</code> (provided above) and a vector containing the outcome variable (year of birth)
    <li> Perform a shuffled train-test split of the feature matrix and outcome vector random state of 42
    <li> Create an instance of the <code>ShuffleSplit</code> class, with 5 splits, a test size of 0.2, and a random state of 42
    <li> Using the <code>ShuffleSplit</code> instance, perform cross-validated linear regression on your training data 
    <li> Report the mean score from your cross-validated linear regression, rounded to 2 decimal places
</ol>
</div>

In [None]:
# Load 'evs_module_07.csv'
df = pd.read_csv(PATH_TO_DATA/"evs_module_07.csv")

# Create a feature matrix using all of the column names in `x_columns`
X = df[▰▰1▰▰] 

# Create the outcome variable
y = df[▰▰2▰▰]

# Perform a train-test split
X_train, X_test, y_train, y_test = ▰▰3▰▰(X, y, shuffle=▰▰4▰▰, random_state=▰▰5▰▰)

# Create an instance of the shufflesplit class
shuffsplit = ▰▰6▰▰(n_splits=▰▰7▰▰, test_size=▰▰8▰▰, random_state=▰▰9▰▰)

# Retreieve the scores from a cross-validated linear regression
ols_scores = ▰▰10▰▰(▰▰11▰▰(), X_train, y_train, cv=shuffsplit)
print(ols_scores)
print(f"Mean: {ols_scores.mean()}")

round(ols_scores.mean(), 2) 

## Problem 2:
<div class="alert alert-block alert-info">  
From the above results, it should be pretty clear that our OLS did <b>not</b> perform well. It's possible that this under-performance is as a result of overfitting. If so, we can use either ridge or lasso reguarlization to help dig our model out of its overfitting problem. It's also possible, however, that our model just isn't all that good. In that case, regularization won't be of much use, and will -- at best -- produce results that are roughly the same as, or worse than, the unregularized model. 
</div>
<div class="alert alert-block alert-success">
Using 10 different alpha settings, perform 10 cross-validated ridge and lasso regressions (10 each). When performing the cross-vaildations, use the same instance of the <code>ShuffleSplit</code> class from the previous question. Plot the resulting cross-validated R2 scores. Examine the plot. Is Q1 significantly overfitting the data? Submit your answer as a <b>boolean</b> value.
</div>

In [None]:
# create an evenly-spaced set of 10 alpha scores ranging from near-0 to 2 
alphas = np.linspace(0.01, 2, 10)

ridge_r2s = []
lasso_r2s = []

olscv_score = cross_val_score(LinearRegression(), X_train, y_train, cv=shuffsplit)

# iterate over the alpha scores
for alpha in alphas:
    # Instantiate a new ridge regression with an alpha score
    new_ridge = ▰▰1▰▰(▰▰2▰▰)
    # Use cross validation to fit the regression and retrieve the average score
    ridge_r2s.append(cross_val_score(new_ridge, ▰▰3▰▰, ▰▰4▰▰, cv=▰▰5▰▰).mean())
    # Instantiate a new lasso regression with an alpha score
    new_lasso = ▰▰6▰▰(▰▰7▰▰)
    # Use cross validation to fit the regression and retrieve the average score
    lasso_r2s.append(cross_val_score(new_lasso, ▰▰8▰▰, ▰▰9▰▰, cv=▰▰10▰▰).mean())
    
r2s = pd.DataFrame(
    zip(alphas, ridge_r2s, lasso_r2s), 
    columns = ["alpha", "ridge", "lasso"])

fig, ax = plt.subplots()
sns.lineplot(x="alpha", y="ridge", data = r2s, label="Ridge", linestyle='solid')
sns.lineplot(x="alpha", y="lasso", data = r2s, label = "Lasso", linestyle='dashed')
ax.axhline(olscv_score.mean(), label="OLS", linestyle='dotted', color="darkgray")
ax.set(xlabel='alpha values for Ridge and Lasso Regressions', ylabel='R2')
sns.despine()
ax.legend()
plt.show()

## Problem 3:
<div class="alert alert-block alert-info">  
Regardless of how Question 2 turned out, it's pretty clear that a garden-variety OLS model isn't going to be the best choice for predicting the survey respondents' years of birth. Let's see if we can do better with a Logistic Regression instead. To accomplish this, we're going to categorize everyone in the dataset as either being born after (1) or before (0) 1970 (those born in 1970 also count as 'before'). 
</div>
<div class="alert alert-block alert-success">
Create a vector containing an outcome variable indicating whether or not each respondent was born after (1), before (0), or during (0) the year 1970. Using this new outcome vector and a remade feature matrix (again, using all of the columns from `x_columns`), create a shuffled train-test split and a ShuffleSplit with 5 splits and a test size of 0.2. Both the train-test split and ShuffleSplit should use a random state of 42. <br><br>
    Once all of this is done, run a cross-validated Logistic Regression (with <code>max_iter=1000</code>, which will help it converge) using your new training data and <code>ShuffleSplit</code> instance. Report the mean score from your cross-validated regression, rounded to 2 decimal places.
</div>

In [None]:
# Create feature matrix
X = df[▰▰1▰▰] 
# Create binary target vector
y = np.▰▰2▰▰(df[▰▰3▰▰] ▰▰4▰▰ 1970, 1, 0)

# Create train-test split
X_train, X_test, y_train, y_test = ▰▰5▰▰(X, y, shuffle=▰▰6▰▰, random_state=▰▰7▰▰)
# Create shuffle split
shuffsplit = ▰▰8▰▰(n_splits=▰▰9▰▰, test_size=▰▰10▰▰, random_state=▰▰11▰▰)

# Retreieve the scores from a cross-validated logistic regression
log_reg_scores = cross_val_score(
    ▰▰12▰▰(max_iter=▰▰13▰▰), 
    X_train, 
    y_train, 
    cv=▰▰14▰▰)

print(log_reg_scores)
print(f"Mean: {log_reg_scores.mean()}")

round(log_reg_scores.mean(), 2) 

## Problem 4:
<div class="alert alert-block alert-success">
Find the most impactful predictor from your logistic regression, as measured by the magnitude of the coefficient.
</div>

In [None]:
log_reg = ▰▰1▰▰(max_iter=1000, random_state=42)
log_reg.▰▰2▰▰(▰▰3▰▰, ▰▰4▰▰)

## Problem 5:
<div class="alert alert-block alert-success">
Find the second-most impactful predictor from your logistic regression, as measured by the magnitude of the coefficient.
</div>

In [None]:
log_reg = ▰▰1▰▰(max_iter=1000, random_state=42)
log_reg.▰▰2▰▰(▰▰3▰▰, ▰▰4▰▰)

## Problem 6:
<div class="alert alert-block alert-success">
Find the third-most impactful predictor from your logistic regression, as measured by the magnitude of the coefficient.
</div>

In [None]:
log_reg = ▰▰1▰▰(max_iter=1000, random_state=42)
log_reg.▰▰2▰▰(▰▰3▰▰, ▰▰4▰▰)

## Problem 7:
<div class="alert alert-block alert-info">  
Now that we've spent some time classifying people by year of birth, it's time to move onto a much trickier classification task! We're going to see how well various classification methods can identify EVS respondents' countries of origin. Since there are 33 countries in the dataset, and the difficulty of classification scales with the number of labels, we shouldn't expect much in the way of miraculous performances here. Even an accuracy score of .3 is notable, as it is about 10 times greater than the accuracy score we would expect if our model were guessing purely at random. <br><br>We're going to start with a basic decision tree and -- assuming that it isn't going to do a great job -- immediately move onto a regularized decision tree. Feel free to tweak the regularization parameters on your regularized decision tree in whatever manner you like; your primary goal for this question is to produce a regularized tree that performs better than its unregularized counterpart, as judged by accuracy score. 
</div>
<div class="alert alert-block alert-success">
Perform the following steps, in order:
<ol>
    <li> Create a feature matrix containing all of the columns in <code>x_columns</code> and a vector containing the outcome variable, <code>country</code>
    <li> Perform a stratified, shuffled train-test split of the feature matrix and outcome vector random state of 1
    <li> Create an instance of the <code>ShuffleSplit</code> class, with 5 splits, a test size of 0.1, and a random state of 1
    <li> Using the <code>ShuffleSplit</code> instance, cross-validate a default decision tree on your training data 
    <li> Use at least one regularization parameter to create a regularized decision tree that outperforms the basic decision tree based on cross-validated accuracy score (using the same  <code>ShuffleSplit</code> as before)
</ol>
Submit the <b>name</b> of one parameter you used to create your regularized decision tree (the parameter name should appear exactly as it does in the function call; make sure to include underscores in the parameter name, but do not use <b>any</b> other punctuation, spaces, numbers, or special symbols -- for example, we often use the parameter <code>random_state=42</code>; the name of this parameter is <code>random_state</code>)
</div>

In [None]:
# Create feature matrix
X = df[▰▰1▰▰]
# Create target vector (countries)
y = np.array(df.▰▰2▰▰)

# Create *stratified* train-test split
X_train, X_test, y_train, y_test = ▰▰3▰▰(
    X, 
    y, 
    shuffle=▰▰4▰▰, 
    random_state=1,
    stratify=▰▰5▰▰
)
# Create ShuffleSplit
shuffsplit = ▰▰6▰▰(n_splits=5, test_size=0.1, random_state=1)

# Instantiate decision tree
dtclass = ▰▰7▰▰(random_state=42)
# get cross-validated scores from decision tree
dt_scores = ▰▰8▰▰(dtclass, X_train, y_train, cv=shuffsplit)

print(f"Mean: {dt_scores.mean()}")

# Instantiate a decision tree *using your choice of regularizing parameters* 
dtclass_pruned = DecisionTreeClassifier(▰▰9▰▰, ▰▰10▰▰, random_state=42)
dt_pruned_scores = cross_val_score(dtclass_pruned, X_train, y_train, cv=shuffsplit)
print(f"Mean: {dt_pruned_scores.mean()}")

dtclass_pruned.fit(X_train, y_train)

## Problem 8:
<div class="alert alert-block alert-info">  
Let's now give the same treatment to two of the ensemble methods we learned about: Random Forest and Gradient Boosted Trees.
</div>
<div class="alert alert-block alert-success">
Run a cross-validated Random Forest Classifier with 100 estimators, 2 maximum features, and a random state of 42. Then, run a cross-validated Gradient Boosted Trees Classifier with 10 estimators and a random state of 42. According to the accuracy scores, does the Gradient Boosted Trees Classifier outperform the Random Forest Classifier? Submit your answer as a <b>boolean</b> value.
</div>

In [None]:

# Create and cross validate a Random Forest Classifier
rforest = ▰▰1▰▰(▰▰2▰▰=100,
                                ▰▰3▰▰=2,
                                random_state=42)

rforest_scores = cross_val_score(rforest, X_train, y_train, cv=shuffsplit)
print(rforest_scores)
print(f"Mean: {rforest_scores.mean()}")

# Create and cross validate a Gradient Boosted Trees Classifier
gboost = ▰▰4▰▰(▰▰5▰▰=10,
                                random_state=42)
gboost_scores = cross_val_score(gboost, X_train, y_train, cv=shuffsplit)
print(gboost_scores)
print(f"Mean: {gboost_scores.mean()}")

## Problem 9:
<div class="alert alert-block alert-info">  
Finally, let's add a K-Nearest-Neighbours classifier to the mix.
</div>
<div class="alert alert-block alert-success">
Run a cross-validated K-Nearest Neighbour classifier, setting <code>K</code> to 20. Report the mean cross-validated accuracy score, rounded to two decimal places.
</div>

In [None]:
knn_model = ▰▰1▰▰(▰▰2▰▰=20)

knn_scores = cross_val_score(knn_model, X_train, y_train, cv=shuffsplit)
print(knn_scores)
print(f"Mean: {knn_scores.mean()}")

round(knn_scores.mean(), 2)

## Problem 10:
<div class="alert alert-block alert-info">  
Throughout this assignment, we've used cross-validation to assess how our models have performed. This means that we have an as-of-yet untouched set of test data that we can use to make a final model selection. In this question, we're going to ask you to indicate which of the five classifiers we explored achieved the highest cross-validated accuracy. Then, in the next question, we'll ask you to report which of our candidate models achieves the highest accuracy. 
</div>
<div class="alert alert-block alert-success">
Submit the name of the model that achieved the highest cross-validated accuracy.
</div>

In [None]:
part_b_models =[
    dtclass,
    dtclass_pruned,
    rforest,
    gboost,
    knn_model]

# Write a few lines of code that will help you determine which model had the best cross-validated accuracy


## Problem 11:
<div class="alert alert-block alert-info">  
After fitting each classifier to the full training data and assessing their performance in light of the held-out test data, report the name of the model that achieves the highest accuracy. 
</div>
<div class="alert alert-block alert-success">
Perform the following steps, in order:
<ol>
    <li> Fit each of the five models on the entire training dataset
    <li> Assess each of the fitted models' scores on the held-out test data
    <li> Submit the name of the model that achieved the highest accuracy on the test data
</ol>
</div>

In [None]:
part_b_models =[
    dtclass,
    dtclass_pruned,
    rforest,
    gboost,
    knn_model]

# Write a few lines of code that will help you determine which model had the best accuracy on the test set
