<br><br><font color="gray">DOING COMPUTATIONAL SOCIAL SCIENCE<br>MODULE 10 <strong>PROBLEM SETS</strong></font>

# <font color="#49699E" size=40>MODULE 10 </font>


# What You Need to Know Before Getting Started

- **Every notebook assignment has an accompanying quiz**. Your work in each notebook assignment will serve as the basis for your quiz answers.
- **You can consult any resources you want when completing these exercises and problems**. Just as it is in the "real world:" if you can't figure out how to do something, look it up. My recommendation is that you check the relevant parts of the assigned reading or search for inspiration on [https://stackoverflow.com](https://stackoverflow.com).
- **Each problem is worth 1 point**. All problems are equally weighted.
- **The information you need for each problem set is provided in the blue and green cells.** General instructions / the problem set preamble are in the blue cells, and instructions for specific problems are in the green cells. **You have to execute all of the code in the problem set, but you are only responsible for entering code into the code cells that immediately follow a green cell**. You will also recognize those cells because they will be incomplete. You need to replace each blank `▰▰#▰▰` with the code that will make the cell execute properly (where # is a sequentially-increasing integer, one for each blank).
- Most modules will contain at least one question that requires you to load data from disk; **it is up to you to locate the data, place it in an appropriate directory on your local machine, and replace any instances of the `PATH_TO_DATA` variable with a path to the directory containing the relevant data**.
- **The comments in the problem cells contain clues indicating what the following line of code is supposed to do.** Use these comments as a guide when filling in the blanks. 
- **You can ask for help**. If you run into problems, you can reach out to John (john.mclevey@uwaterloo.ca) or Pierson (pbrowne@uwaterloo.ca) for help. You can ask a friend for help if you like, regardless of whether they are enrolled in the course.

Finally, remember that you do not need to "master" this content before moving on to other course materials, as what is introduced here is reinforced throughout the rest of the course. You will have plenty of time to practice and cement your new knowledge and skills.
<div class='alert alert-block alert-danger'>As you complete this assignment, you may encounter variables that can be assigned a wide variety of different names. Rather than forcing you to employ a particular convention, we leave the naming of these variables up to you. During the quiz, submit an answer of 'USER_DEFINED' (without the quotation marks) to fill in any blank that you assigned an arbitrary name to. In most circumstances, this will occur due to the presence of a local iterator in a for-loop.</b></div>

## Package Imports

In [1]:
import pandas as pd 

import numpy as np
from numpy.random import seed as np_seed


import graphviz
from graphviz import Source

from pyprojroot import here

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

import tensorflow as tf
from tensorflow import keras
from tensorflow.random import set_seed

import spacy 

from time import time

set_seed(42)
np_seed(42)



## Defaults

In [2]:
x_columns = [

    # Religion and Morale
    'v54', # Religious services? - 1=More than Once Per Week, 7=Never
    'v149', # Do you justify: claiming state benefits? - 1=Never, 10=Always
    'v150', # Do you justify: cheating on tax? - 1=Never, 10=Always 
    'v151', # Do you justify: taking soft drugs? - 1=Never, 10=Always 
    'v152', # Do you justify: taking a bribe? - 1=Never, 10=Always 
    'v153', # Do you justify: homosexuality? - 1=Never, 10=Always 
    'v154', # Do you justify: abortion? - 1=Never, 10=Always 
    'v155', # Do you justify: divorce? - 1=Never, 10=Always 
    'v156', # Do you justify: euthanasia? - 1=Never, 10=Always 
    'v157', # Do you justify: suicide? - 1=Never, 10=Always 
    'v158', # Do you justify: having casual sex? - 1=Never, 10=Always 
    'v159', # Do you justify: public transit fare evasion? - 1=Never, 10=Always 
    'v160', # Do you justify: prostitution? - 1=Never, 10=Always 
    'v161', # Do you justify: artificial insemination? - 1=Never, 10=Always 
    'v162', # Do you justify: political violence? - 1=Never, 10=Always 
    'v163', # Do you justify: death penalty? - 1=Never, 10=Always 

    # Politics and Society
    'v97', # Interested in Politics? - 1=Interested, 4=Not Interested
    'v121', # How much confidence in Parliament? - 1=High, 4=Low
    'v126', # How much confidence in Health Care System? - 1=High, 4=Low
    'v142', # Importance of Democracy - 1=Unimportant, 10=Important
    'v143', # Democracy in own country - 1=Undemocratic, 10=Democratic
    'v145', # Political System: Strong Leader - 1=Good, 4=Bad
#     'v208', # How often follow politics on TV? - 1=Daily, 5=Never
#     'v211', # How often follow politics on Social Media? - 1=Daily, 5=Never

    # National Identity
    'v170', # How proud are you of being a citizen? - 1=Proud, 4=Not Proud
    'v184', # Immigrants: impact on development of country - 1=Bad, 5=Good
    'v185', # Immigrants: take away jobs from Nation - 1=Take, 10=Do Not Take
    'v198', # European Union Enlargement - 1=Should Go Further, 10=Too Far Already
]

y_columns = [
    # Overview
    'country',
    
    # Socio-demographics
    'v226', # Year of Birth by respondent 
    'v261_ppp', # Household Monthly Net Income, PPP-Corrected

]


## Problem 1:
<div class="alert alert-block alert-info">  
In this assignment, we're going to continue our exploration of the European Values Survey dataset. By wielding the considerable power of Artificial Neural Networks, we'll aim to create a model capable of predicting an individual survey respondent's country of residence. As with all machine/deep learning projects, our first task will involve loading and preparing the data.
</div>
<div class="alert alert-block alert-success">
Load the EVS dataset and use it to create a feature matrix (using all columns from x_columns) and (with the assistance of Scikit Learn's LabelBinarizer) a target array (representing each respondent's country of residence).  
</div>

In [None]:
# Load EVS Dataset 
df = pd.read_csv(PATH_TO_DATA/"evs_module_08.csv")

# Create Feature Matrix (using all columns from x_columns)
X = df[x_columns] 

# Initialize LabelBinarizer
country_encoder = ▰▰1▰▰()

# Fit the LabelBinarizer instance to the data's 'country' column and store transformed array as target 
y = country_encoder.▰▰2▰▰(np.array(▰▰3▰▰))

## Problem 2:

<div class="alert alert-block alert-info">  
As part of your work in the previous module, you were introduced to the concept of the train-validate-test split. Up until now, we had made extensive use of Scikit Learn's preprocessing and cross-validation suites in order to easily get the most out of our data. Since we're using TensorFlow for our Artificial Neural Networks, we're going to have to change course a little: we can still use the <code>train_test_split</code> function, but we must now use it twice: the first iteration will produce our test set and a 'temporary' dataset; the second iteration will split the 'temporary' data into training and validation sets. Throughout this process, we must take pains to ensure that each of the data splits are shuffled and stratified. 
</div>
<div class="alert alert-block alert-success">
Create shuffled, stratified splits for testing (10% of original dataset), validation (10% of data remaining from test split), and training (90% of data remaining from test split) sets. Submit the number of observations in the <code>X_valid</code> set, as an integer.
</div>

In [None]:

# Split into temporary and test sets
X_t, X_test, y_t, y_test = ▰▰1▰▰(
    ▰▰2▰▰,
    ▰▰3▰▰,
    test_size = ▰▰4▰▰,
    shuffle = ▰▰5▰▰,
    stratify = y,
    random_state = 42
)

# Split into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
    ▰▰6▰▰,
    ▰▰7▰▰,
    test_size = ▰▰8▰▰,
    shuffle = ▰▰9▰▰,
    stratify = ▰▰10▰▰,
    random_state = 42,
)

len(X_valid)

## Problem 3:

<div class="alert alert-block alert-info">  
As you work with Keras and Tensorflow, you'll rapidly discover that both packages are very picky about the 'shape' of the data you're using. What's more, you can't always rely on them to correctly infer your data's shape. As such, it's usually a good idea to store the two most important shapes -- number of variables in the feature matrix and number of unique categories in the target -- as explicit, named variables; doing so will save you the trouble of trying to retrieve them later (or as part of your model specification, which can get messy). We'll start with the number of variables in the feature matrix.
</div>
<div class="alert alert-block alert-success">
Store the number of variables in the feature matrix, as an integer, in the <code>num_vars</code> variable. Submit the resulting number as an integer.
</div>

In [None]:
# The code we've provided here is just a suggestion; feel free to use any approach you like
num_vars = np.▰▰1▰▰(▰▰2▰▰).▰▰3▰▰[1]

print(num_vars)

## Problem 4:

<div class="alert alert-block alert-info">  
Now, for the number of categories (a.k.a. labels) in the target.
</div>
<div class="alert alert-block alert-success">
Store the number of categories in the target, as an integer, in the <code>num_vars</code> variable. Submit the resulting number as an integer.
</div>

In [None]:
# The code we've provided here is just a suggestion; feel free to use any approach you like
num_labels = ▰▰1▰▰.▰▰2▰▰[1]

print(num_labels)

## Problem 5:

<div class="alert alert-block alert-info">  
Everything is now ready for us to begin building an Artifical Neural Network! Aside from specifying that the ANN must be built using Keras's <code>Sequential</code> API, we're going to give you the freedom to tackle the creation of your ANN in whichever manner you like. Feel free to use the 'add' method to build each layer one at a time, or pass all of the layers to your model at instantiation as a list, or any other approach you may be familiar with. Kindly ensure that your model matches the specifications below <b>exactly</b>!
</div>
<div class="alert alert-block alert-success">
Using Keras's <code>Sequential</code> API, create a new ANN. Your ANN should have the following layers, in this order:
<ol>
<li> Input layer with one argument: number of variables in the feature matrix
<li> Dense layer with 400 neurons and the "relu" activation function
<li> Dense layer with 10 neurons and the "relu" activation function
<li> Dense layer with neurons equal to the number of labels in the target and the "softmax" activation function
</ol>
Submit the number of hidden layers in your model.
</div>

In [None]:
# Create your ANN!
nn_model = keras.models.Sequential()

## Problem 6:
<div class="alert alert-block alert-info">  
Even though we've specified all of the layers in our model, it isn't yet ready to go. We must first 'compile' the model, during which time we'll specify a number of high-level arguments. Just as in the textbook, we'll go with a fairly standard set of arguments: we'll use Stochastic Gradient Descent as our optimizer, and our only metric will be Accuracy (an imperfect but indispensably simple measure). It'll be up to you to figure out what loss function we should use: you might have to go digging in the textbook to find it!
</div>
<div class="alert alert-block alert-success">
Compile the model according to the specifications outlined in the blue text above. Submit the name of the loss function <b>exactly</b> as it appears in your code (you should only need to include a single underscore -- no other punctuation, numbers, or special characters). 
</div>

In [None]:
nn_model.▰▰1▰▰(
    loss=keras.losses.▰▰2▰▰,
    optimizer=▰▰3▰▰,
    metrics=[▰▰4▰▰]
)

## Problem 7:
<div class="alert alert-block alert-info">  
Everything is prepared. All that remains is to train the model! 
</div>
<div class="alert alert-block alert-success">
Train your neural network for 100 epochs. Be sure to include the validation data variables. 
</div>

In [None]:
np_seed(42)
tf.random.set_seed(42)

history = nn_model.▰▰1▰▰(▰▰2▰▰, ▰▰3▰▰, epochs=▰▰4▰▰, validation_data = (▰▰5▰▰, ▰▰6▰▰))

## Problem 8:
<div class="alert alert-block alert-info">  
For some Neural Networks, 100 epochs is more than ample time to reach a best solution. For others, 100 epochs isn't enough time for the learning process to even get underway. One good method for assessing the progress of your model at a glance involves visualizing how your loss scores and metric(s) -- for both your training and validation sets) -- changed during training. 
</div>
<div class="alert alert-block alert-success">
After 100 epochs of training, is the model still appreciably improving? (If it is still improving, you shouldn't see much evidence of overfitting). Submit your answer as a boolean value (True = still improving, False = not still improving). 
</div>

In [None]:
pd.DataFrame(history.history).plot(figsize = (8, 8))
plt.grid(True)
plt.show()

## Problem 9:
<div class="alert alert-block alert-info">  
Regardless of whether this model is done or not, it's time to dig into what our model has done. Here, we'll continue re-tracing the steps taken in the textbook, producing a (considerably more involved) confusion matrix, visualizing it as a heatmap, and peering into our model's soul. The first step in this process involves creating the confusion matrix.
</div>
<div class="alert alert-block alert-success">
Using the held-back test data, create a confusion matrix. 
</div>

In [None]:
y_pred = np.argmax(nn_model.predict(▰▰1▰▰), axis=1)

y_true = np.argmax(▰▰2▰▰, axis=1)

conf_mat = tf.math.confusion_matrix(▰▰3▰▰, ▰▰4▰▰)

## Problem 10:
<div class="alert alert-block alert-info">  
Finally, we're ready to visualize the matrix we created above. Rather than asking you to recreate the baroque visualization code, we're going to skip straight to interpretation. 
</div>
<div class="alert alert-block alert-success">
Plot the confusion matrix heatmap and examine it. Based on what you know about the dataset, should the sum of the values in a column (representing the number of observations from a country) be the same for each country? If so, submit the integer that each column adds up to. If not, submit 0.
</div>

In [None]:
sns.set(rc={'figure.figsize':(12,12)})
plt.figure()
sns.heatmap(
    np.array(conf_mat).T,
    xticklabels=country_encoder.classes_,
    yticklabels=country_encoder.classes_,
    square=True,
    annot=True,
    fmt='g',
)
plt.xlabel("Observed")
plt.ylabel("Predicted")
plt.show()

## Problem 11:
<div class="alert alert-block alert-success">
Based on what you know about the dataset, should the sum of the values in a row (representing the number of observations your model <b>predicted</b> as being from a country) be the same for each country? If so, submit the integer that each row adds up to. If not, submit 0.
</div>

## Problem 12:
<div class="alert alert-block alert-success">
If your model was built and run to the specifications outlined in the assignment, your results should include at least three countries whose observations the model struggled to identify (fewer than 7 accurate predictions each). Submit the name of one such country.<br><br>As a result of the randomness inherent to these models, it is possible that your interpretation will be correct, but will be graded as incorrect. If you feel that your interpretation was erroneously graded, please email a screenshot of your confusion matrix heatmap to Pierson along with an explanation of how you arrived at the answer you did. 
</div>