## Exercise

- In this exercise, we will work on a classification task of Brexit referendum vote
- The data is originally from British Election Study Online Panel
  - codebook: https://www.britishelectionstudy.com/wp-content/uploads/2020/05/Bes_wave19Documentation_V2.pdf
- The outcome is `LeaveVote` (1: Leave, 0: otherwise)
- The input we use are coming from the following article:
  - Hobolt, Sara (2016) The Brexit vote: a divided nation, a divided continent. _Journal of European Public Policy_, 23 (9) (https://doi.org/10.1080/13501763.2016.1225785)

In [1]:
!wget https://www.dropbox.com/s/up1zpkozgscaty1/brexit_bes_sampled_data.csv

--2020-12-14 21:34:04--  https://www.dropbox.com/s/up1zpkozgscaty1/brexit_bes_sampled_data.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/up1zpkozgscaty1/brexit_bes_sampled_data.csv [following]
--2020-12-14 21:34:04--  https://www.dropbox.com/s/raw/up1zpkozgscaty1/brexit_bes_sampled_data.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucc1093ca4f199c00816edf9a2be.dl.dropboxusercontent.com/cd/0/inline/BFHqsU78ZXwkdHJ3QTQEoQvUf2P2LLk70-fpPRJ0Z-UcZoMSvVFP7LpTwblXd1tPpPmlNic-8mBBORE26cRnWg-t3wqZZhicb1kxU_bB2NdZsQpvCDaEn5077pmib9wseCI/file# [following]
--2020-12-14 21:34:04--  https://ucc1093ca4f199c00816edf9a2be.dl.dropboxusercontent.com/cd/0/inline/BFHqsU78ZXwkdHJ3QTQEoQvUf2P2LLk70-fpPRJ0Z-UcZoMSvVFP7LpTwblXd1tPpP

## Import packages

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Load data

In [3]:
df_bes = pd.read_csv("brexit_bes_sampled_data.csv")

# Model

- There are four models in the article. We will use the idenity model (Model 2 in Table 2)
- List of input variables:
  gender, age, edlevel, hhincome, EuropeanIdentity, EnglishIdentity, BritishIdentity

In [4]:
df_bes_sub = df_bes[['gender', 'age', 'edlevel', 'hhincome', 'EuropeanIdentity', 'EnglishIdentity', 'BritishIdentity', 'LeaveVote']]

In [67]:
x = df_bes_sub.iloc[:, 0:7]
y = df_bes['LeaveVote']

# Train-test split

In [7]:
from sklearn.model_selection import train_test_split

In [68]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .25, random_state = 42)

In [16]:
print(x_train.shape)
print(x_test.shape)

(1500, 7)
(500, 7)


# Data wrangling

In [18]:
from sklearn.preprocessing import StandardScaler
st_scaler = StandardScaler()

In [69]:
x_train = st_scaler.fit_transform(x_train)
x_test = st_scaler.transform(x_test)

In [20]:
x_test[:3]

array([[-1.01342342,  0.50775322,  0.71937721, -1.69515159,  0.69979419,
        -0.01998199,  0.89072259],
       [ 0.98675438,  1.09793518, -0.00726644, -0.52769182,  0.18245004,
         0.8882901 ,  0.89072259],
       [ 0.98675438,  1.39302616,  0.71937721, -1.1114217 ,  0.18245004,
         0.43415406, -0.36087824]])

## Fit logistic model

In [23]:
from sklearn.linear_model import LogisticRegression

In [24]:
from sklearn.metrics import classification_report, confusion_matrix

In [25]:
logitmod = LogisticRegression()

In [26]:
logitmod.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [49]:
pred_logit = logitmod.predict(x_test)
pred_logit

array([0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,

## KNN classifier

In [36]:
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis
from sklearn.pipeline import Pipeline

In [48]:
x_test.shape

(500, 7)

In [51]:
nca = NeighborhoodComponentsAnalysis(random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
nca_pipe = Pipeline([('nca', nca), ('knn', knn)])
nca_pipe.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('nca',
                 NeighborhoodComponentsAnalysis(callback=None, init='auto',
                                                max_iter=50, n_components=None,
                                                random_state=42, tol=1e-05,
                                                verbose=0, warm_start=False)),
                ('knn',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=3, p=2,
                                      weights='uniform'))],
         verbose=False)

### Parameter tuning for KNN



In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
f1 = make_scorer(f1_score, average = 'binary', pos_label = 1)

### Final model

## Support Vector Classifier

- We try SVC here
- This is non-linear, parametric classifier
- Much more flexible than Logistic regression
- Fore more information, see Gareth et al, Chapter 9



In [None]:
from sklearn.svm import SVC
svcmod = SVC(gamma='auto')