## Exercise

- In this exercise, we will work on a classification task of Brexit referendum vote
- The data is originally from British Election Study Online Panel
  - codebook: https://www.britishelectionstudy.com/wp-content/uploads/2020/05/Bes_wave19Documentation_V2.pdf
- The outcome is `LeaveVote` (1: Leave, 0: otherwise)
- The input we use are coming from the following article:
  - Hobolt, Sara (2016) The Brexit vote: a divided nation, a divided continent. _Journal of European Public Policy_, 23 (9) (https://doi.org/10.1080/13501763.2016.1225785)

In [None]:
!wget https://www.dropbox.com/s/up1zpkozgscaty1/brexit_bes_sampled_data.csv

## Import packages

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Load data

In [None]:
df_bes = pd.read_csv("brexit_bes_sampled_data.csv")

In [None]:
df_bes.head()

# Model

- There are four models in the article. We will use the idenity model (Model 2 in Table 2)
- List of input variables:
  gender, age, edlevel, hhincome, EuropeanIdentity, EnglishIdentity, BritishIdentity

In [None]:
X = df_bes.drop('LeaveVote', axis = 1)
y = df_bes['LeaveVote']

# Train-test split

In [None]:
from sklearn.model_selection import train_test_split

# Data wrangling

In [None]:
from sklearn.preprocessing import StandardScaler
st_scaler = StandardScaler()

In [None]:
X_train = st_scaler.fit_transform(X_train)
X_test = st_scaler.transform(X_test)

In [None]:
X_test[:3]

## Fit logistic model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logitmod = LogisticRegression()

In [None]:
logitmod.fit(X_train, y_train)

In [None]:
pred_test = logitmod.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
confusion_matrix(y_test, pred_test)

In [None]:
print(classification_report(y_test, pred_test))

## KNN classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier


In [None]:
knnmod = KNeighborsClassifier(n_neighbors=2)
knnmod.fit(X_train,y_train)


In [None]:
pred_knn = knnmod.predict(X_test)

In [None]:
confusion_matrix(y_test, pred_knn)

In [None]:
print(classification_report(y_test, pred_knn))

### Parameter tuning for KNN

- Parameter tuning will be done using cross-validation
- Reestimate the models for the different values of tuning parameters
  - For KNN, try different values of _k_
- By default, for classification tasks, evaluation metric is accuracy. I want to use f1 for the positive class.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
f1 = make_scorer(f1_score, average = 'binary', pos_label = 1)

In [None]:
knn2 = KNeighborsClassifier()
param_grid = {'n_neighbors': np.concatenate((np.arange(1, 25), [30, 40, 50]))}
knn_cv = GridSearchCV(knn2, param_grid, cv=10, scoring=f1)
#fit model to data
knn_cv.fit(X_train, y_train)

In [None]:
np.concatenate((np.arange(1, 25), [30, 40, 50]))

In [None]:
print(knn_cv.best_score_)
print(knn_cv.best_params_)

In [None]:
knn_cv.param_grid

In [None]:
knn_cv.cv_results_['mean_test_score']

In [None]:
sns.set_style('whitegrid')
plt.plot(np.concatenate((np.arange(1, 25), [30, 40, 50])), knn_cv.cv_results_['mean_test_score'])

### Final model

In [None]:
pred_knn = knn_cv.predict(X_test)

In [None]:
print(confusion_matrix(y_test, pred_knn))
print(classification_report(y_test, pred_knn))

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

## AdaBoost Classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

### Parameter tuning

## Support Vector Classifier

In [None]:
from sklearn.svm import SVC
svcmod = SVC(gamma='auto')

In [None]:
svcmod.fit(X_train, y_train)

In [None]:
svcmod.fit(X_train, y_train)

In [None]:
pred_svc = svcmod.predict(X_test)

In [None]:
print(confusion_matrix(y_test, pred_svc))
print(classification_report(y_test, pred_svc))

In [None]:
param_grid = {'C':[1,10,100,1000], # cost for miss classification
              'gamma':[1,0.1,0.001,0.0001], # flexibility of the model 
              'kernel':['rbf']}
svc_cv = GridSearchCV(SVC(),param_grid, refit = True, verbose=2)
svc_cv.fit(X_train,y_train)

In [None]:
print(grid.best_score_)
print(grid.best_params_)

In [None]:
pred_svc = grid.predict(X_test)
print(classification_report(y_test, pred_svc))
print(confusion_matrix(y_test, pred_svc))