# Problem Set 03

## POLI 175 - Machine Learning for Social Scientists

### Due Date: 03/10/2023

In this problem set we will work with the 2016 wave of the American National Elections Survey (ANES). You can find it in [here](https://electionstudies.org).

The file you are going to use for this PS is the `anes2016.csv`. The names of the variables are self-explanatory, and I pre-processed them to cut unimportant variables. I also added informative names.

We are going to do two things: first, we are going to predict the vote in the election. Second, we are going to predict two types of swing voters:

- Swing voters that answered that would vote for one candidate before the election and then voted for a different one.

- Swing voters that voted for one party in the 2012 election but then shifted the choice during the 2016 election.

**Content required to solve this homework**: All class content up to Non-Linearity. Specifically:

1. Logistic Regression
2. KNN
3. LDA
4. QDA
5. Lasso (or L1 regularization)
6. Ridge (or L2 regularization)
7. GAMs
8. Splines
9. Cross-Validation: K-Fold and Split-sample CVs.

Please reach out if you have any questions.

## Loading Packages and pre-processing the dataset

In [None]:
## Pandas and Numpy
import pandas as pd
import numpy as np

# Seaborn and MatplotLib
import seaborn as sns
import matplotlib.pyplot as plt

# StatsModels
import statsmodels.api as sm
from statsmodels.gam.api import GLMGam, BSplines

# Scikit Learn
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score, get_scorer_names
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score, ConfusionMatrixDisplay, f1_score
from sklearn.model_selection import train_test_split, LeaveOneOut, cross_val_score, KFold, GridSearchCV
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, SplineTransformer
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline

dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/anes2016.csv')

## Predictors
cont_predictors = ['pay_attn_pol_cont', 'feel_dem_cand_cont', 'feel_rep_cand_cont',
                   'how_many_live_hh_cont', 'better_1y_ago_cont', 'lib_con_scale_cont', 
                   'soc_spend_favor_cont', 'def_spend_favor_cont', 'private_hi_favor_cont',
                   'age_cont', 'schooling_cont']

other_predictors = ['int_follow_campg', 'anything_like_dem_cand', 'anything_like_rep_cand', 
                    'approve_congr', 'things_right_track', 'has_hinsur', 'favor_aca', 
                    'afraid_dem_cand', 'disgust_dem_cand', 'afraid_rep_cand', 'disgust_rep_cand',
                    'incgap_morethan_20y_ago', 'economy_improved', 'unempl_improved', 
                    'speaksmind_dem_cand', 'speaksmind_rep_cand', 'shoud_hard_buy_gun', 
                    'favor_affirmaction', 'govt_benefit_all', 'all_ingovt_corrup', 
                    'election_makegovt_payattn', 'global_warming_happen', 'favor_death_penalty',
                    'econ_better_since_2008', 'relig_important', 'married', 'latinx',
                    'white', 'black', 'both_parents_bornUS', 'any_grandparent_foreign',
                    'rent_home', 'has_unexp_passap', 'should_roughup_protestors', 
                    'justified_useviolence', 'consider_self_feminist', 'ppl_easily_offended_nowadays',
                    'soc_media_learn_pres', 'satisfied_life']
## Targets
targetvote = 'vote_pres_2016'
targetswing_2016 = 'swing_2016'
targetswing_2012_2016 = 'swing_2016_2012'
dat[targetvote] = dat[targetvote].map({'Johnson': 'Other', 'Stein': 'Other', 
                                       'Trump': 'Trump', 'Clinton': 'Clinton'})

## Level Target
levels_target3cand = ['Clinton', 'Other', 'Trump']
levels_target2cand = ['Clinton', 'Trump']
levels_target_swing = ['Non-Swingers', 'Swingers']

## Question 01

### Fit a KNN model that predicts vote in the 2016 election based on the feeling thermometer (`feel_rep_cand_cont`) and the age (`age_cont`).

Below, I prepared the data for you. Parameters:

1. Find the optimal K between 1 and 21, 2 by 2. Check the variable `bigK` that I prepared for you.
    + Use K Fold cross-validation to search for the best K.

2. Plot the optimal K

3. Fit the model with the optimal K and use cross-validation, saving 35 percent of the data for your testing set.

4. Print the confusion matrix and the classification report.

5. Discuss your findings in terms of the prediction achieved.

Hints: you can use the code I did in class, or you can try to use [`GridSearchCV`](https://scikit-learn.org/0.16/modules/generated/sklearn.grid_search.GridSearchCV.html). For grid search, check the following code:

```
# Assume bigK is our list of parameters, and has been defined.
# Then, you can run:
knn = KNeighborsClassifier()
parameters = {'n_neighbors': bigK}
model = GridSearchCV(knn, param_grid = parameters, cv = 5)
model.fit(X, y)
print('Best value of K is ', model.best_params_)
```

Note that in parameters, you can have many more things. For example, you could change the way distance is computed to check if this affects anything. The parameters for KNN are in [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

One example of search that changes the K but also the weights is:

```
# Assume bigK is our list of parameters, and has been defined.
# Then, you can run:
knn = KNeighborsClassifier()
parameters = {'n_neighbors': bigK, 'weights': ['uniform', 'distance']}
model = GridSearchCV(knn, param_grid = parameters, cv = 5)
model.fit(X, y)
print('Best parameter combination is ', model.best_params_)
```

Another useful hint: You can [plot your Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_estimator) using the following code:

```
## Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(estimation_model, X_test, y_test,
        display_labels = ['level 1, 'level 2', ..., 'level n'], # How many levels in your target variable :)
        cmap = plt.cm.Blues, normalize = 'true') # Making it pretty
plt.show()
```


In [None]:
## Data Prep
bigK = list(range(1, 22, 2))
y = dat[targetvote]
X = StandardScaler().set_output(transform = 'pandas').fit_transform(dat[['feel_rep_cand_cont', 'age_cont']])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

# Recalling:
# Step 1: use the training set + gridsearchCV to find best parameters
# Step 2: fit in the testing set and evaluate.

In [None]:
# your code here
raise NotImplementedError

## Question 02

### Fit a KNN model that predicts vote in the 2016 election using a spline on both these variables.

Tuning objective:

1. Find the optimal number of knots, from 5 to 15.

Below, I prepared the data for you.

**Hint:**

To run efficiently, we can build a pipeline. They way you do is the following:

```
# Classifier: KNN
knn = KNeighborsClassifier()

# Spline
splines = SplineTransformer(degree = 3, extrapolation = 'constant')

# Parameters for pipeline (can be set using '__' separated parameter names):
param_grid = {
    'splines__n_knots': nknots,
    'splines__knots': ['quantile', 'uniform'],
    'splines__extrapolation': ['constant', 'linear'],
    'knn__n_neighbors': bigK, 
    'knn__weights': ['uniform', 'distance']
}

# Build the pipeline:
pipe = Pipeline(steps = [('splines', splines), ('knn', knn)])
search = GridSearchCV(pipe, param_grid, n_jobs = -1)
search.fit(X, y)

# Results
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
```

The pipeline can be as complicated as you wish it to be.

To more about that, please check this useful code [here](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html).


**Question:** Are you doing better using a spline? Explain.

In [None]:
## Data Prep
bigK = list(range(1, 22, 2))
nknots = list(range(5, 16))
knotstypes = ['quantile', 'uniform']
extraptypes = ['constant', 'linear'] # regular x natural spline
y = dat[targetvote]
X = StandardScaler().set_output(transform = 'pandas').fit_transform(dat[['feel_rep_cand_cont', 'age_cont']])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

# Recalling:
# Step 1: use the training set + gridsearchCV to find best parameters
# Step 2: fit in the testing set and evaluate.

In [None]:
# your code here
raise NotImplementedError

## Question 03

### Fit a KNN model that predicts vote in the 2016 election using all variables.

- Tuning objective: find optimal K.

**Question:** Are you doing better with more variables? Explain.

In [None]:
## Data Prep
bigK = list(range(1, 22, 2))
y = dat[targetvote]

# Continuous predictors: standardize
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat[cont_predictors])

# Put together with other predictors
X = dat[other_predictors].join(X_cont)

# Save some portion for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

In [None]:
# your code here
raise NotImplementedError

## Question 04

### Fit a KNN model that predicts vote on Trump and Clinton in the 2016 election, using all variables

I decided to make your life easier (believe me, nothing good ever come after this sentence!). You only need to predict the votes for Clinton and Trump.

- Tuning objective: find optimal K.

**Question:** How the quality of the prediction compares to the model in Q3? Explain.

In [None]:
## Data Prep
bigK = list(range(1, 22, 2))
new_targets = ['Clinton', 'Trump']
dat2 = dat.loc[dat[targetvote].isin(new_targets)]
y = dat2[targetvote]

# 1. Process the continuous variables and make them standardized
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat2[cont_predictors])

# 2. Join with all others and compose the full outcome dataset
X = dat2[other_predictors].join(X_cont)

# Save some portion for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

In [None]:
# your code here
raise NotImplementedError

## Question 05

### Can you do better in the Clinton x Trump classification?

Try:

1. Logistic Regression
2. LDA
3. QDA
4. Naïve Bayes

**Question:** Are we better-off when compared with the K-NN? Explain.

In [None]:
## Data Prep
new_targets = ['Clinton', 'Trump']
dat2 = dat.loc[dat[targetvote].isin(new_targets)]
y = dat2[targetvote]

# 1. Process the continuous variables and make them standardized
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat2[cont_predictors])

# 2. Join with all others and compose the full outcome dataset
X = dat2[other_predictors].join(X_cont)

# Save some portion for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

In [None]:
# your code here
raise NotImplementedError

## Question 06

### Tuned Logistic Regression

In the ML world, they usually add two penalties (or [regularizations](https://en.wikipedia.org/wiki/Regularization_(mathematics)#Other_uses_of_regularization_in_statistics_and_machine_learning), if you prefer this word) to our optimization functions: L1 and L2.

**L1**: Works like the Lasso in Linear Regressions.

**L2**: Works like the Ridge in Linear Regressions.

The next one combines both penalties. It is called **elasticnet**.

In this question, your job is to fit a Logistic Regression, searching for the best L1 tuning parameter. Note that in the LR world, they call this parameter C. It is the inverse of the alpha that we studied ($\alpha = \dfrac{1}{C}$).

**Question:** Are we better-off now, when comparing with the un-tuned LR? Explain.

Hint: You will need some extra parameters for your Logistic Regression to work within these specifications. Here is a good starting point:

```
logreg = LogisticRegression(penalty = 'l1', solver = 'saga', max_iter = 1000000)
```

In here, we set the penalty, the solver, and the maximum number of iterations.

In [None]:
## Data Prep
Cs = np.logspace(-2, 2, 100)
new_targets = ['Clinton', 'Trump']
dat2 = dat.loc[dat[targetvote].isin(new_targets)]
y = dat2[targetvote]

# 1. Process the continuous variables and make them standardized
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat2[cont_predictors])

# 2. Join with all others and compose the full outcome dataset
X = dat2[other_predictors].join(X_cont)

# Save some portion for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

In [None]:
# your code here
raise NotImplementedError

## Question 07

### Tuned Logistic Regression

Plot the coefficients. Do they make sense for you? Hint: use the `sns.barplot`. Some piece of code for you to get started:

```
# Size of the plot #
f, ax = plt.subplots(figsize=(6, 15)) # Good for oversized plots (with lots of variables :)

# Nice colors
sns.set_color_codes("pastel")

# Note that I inverted the x and y axes! #
sns.barplot(y = my_testing_or_training_dataset.columns, x = my_regression_mode.coef_[0], color="b")

# Add a beautiful line at zero! #
ax.axvline(0, color = 'black', ls = 'dotted', lw = 0.5)

# Declutter #
sns.despine(left=True, bottom=True)

# Done #
plt.show()
```

In [None]:
# your code here
raise NotImplementedError

## Question 08

### Tuned Logistic Regression

In this question, your job is to fit a Logistic Regression, searching for the best **L2** tuning parameter. In this world, L2 is the same as the Ridge in Linear Regression models. Again, they call the alpha parameter C. It is the inverse of the alpha that we studied ($\alpha = \dfrac{1}{C}$).

**Question:** Are we better-off now, when comparing with the un-tuned LR? How about the **L1** tuned model? Explain.

Hint: You will need some extra parameters for your Logistic Regression to work within these specifications. Here is a good starting point:

```
logreg = LogisticRegression(penalty = 'l2', solver = 'saga', max_iter = 1000000)
```

In here, we set the penalty, the solver, and the maximum number of iterations.

In [None]:
## Data Prep
Cs = np.logspace(-2, 2, 100)
new_targets = ['Clinton', 'Trump']
dat2 = dat.loc[dat[targetvote].isin(new_targets)]
y = dat2[targetvote]

# 1. Process the continuous variables and make them standardized
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat2[cont_predictors])

# 2. Join with all others and compose the full outcome dataset
X = dat2[other_predictors].join(X_cont)

# Save some portion for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

In [None]:
# your code here
raise NotImplementedError

## Question 09

### Tuned Logistic Regression

Plot the coefficients. Do they make sense for you? Hint: use the **Q7** code hint.

In [None]:
# your code here
raise NotImplementedError

## Question 10

### Predicting Swing Voters

Now you job just got considerably harder: you need to predict swing voters. You should start with swing between the two waves of the survey and then move to the swing from 2012 elections.

#### Between the two waves

- The Anes run the survey at two different points: before the election and after the election. They respondents are the same. There are, then, two opportunities to swing: 
    - First, from one wave to the other. The idea is that between the two waves a voter could change her mind.
    - Second, between 2012 and 2016. We will look at the first and then at the second swing and try to predict it.
    
- Use Logistic Regression in here.

Objectives within each model:

1. Tuning objectives: Use L2 regularization, seaching for the best parameter.

2. Plot and discuss the coefficients.

**Question:** Are you confident in your prediction capacity? Explain.

In [None]:
## Data Prep for model between two waves
Cs = np.logspace(-2, 2, 100)
level_swingers = ['Non-swingers', 'Swingers']
y = dat[targetswing_2016]
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat[cont_predictors])
X = dat[other_predictors].join(X_cont)

## Save some portion for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

## Solution between waves below ##

In [None]:
# your code here
raise NotImplementedError

In [None]:
## Data Prep for model from 2012 to 2016
y = dat[targetswing_2012_2016]
X_cont = StandardScaler().set_output(transform = 'pandas').fit_transform(dat[cont_predictors])
X = dat[other_predictors].join(X_cont)

# Save some portion for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 12345)

# Solution between 2012 and 2016 #

In [None]:
# your code here
raise NotImplementedError