##  Brexit Logistic regression and decision trees

### Brexit

On June 23rd, 2016, The UK had a national referendum to decide whether the country should leave the
EU (‘Brexit’). The result, a win for the Leave campaign, surprised many political commentators, who had
expected that people would vote to Remain. Immediately people began to look for patterns that coud explain
the Leave vote: cities had generally voted to Remain, while small towns had voted to Leave. England and
Wales voted to Leave, while Northern Ireland and especially Scotland voted to Remain.


In the next few days, the Guardian newspaper presented some apparent demographic trends in the vote, based
on the ages, incomes, education and class of different electoral wards (https://www.theguardian.com/politics/
ng-interactive/2016/jun/23/eu-referendum-live-results-and-analysis). The Guardian’s analysis stopped at
showing these results graphically, and commenting on the apparent patterns. We will go one better by doing
some real statistical analysis of the data.

I have scraped the data from the Guardian’s plots into a data file (brexit.csv) which you can download from
MINERVA

There are 6 attributes in the data. The 5 possible input variables are:

* abc1: proportion of individuals who are in the ABC1 social classes (middle to upper class)
* medianIncome: the median income of all residents
* medianAge: median age of residents
* withHigherEd: proportion of residents with any university-level education
* notBornUK: the proportion of residents who were born outside the UK

These are normalised so that the lowest value is zero and the highest value is one.
The output variable is called voteBrexit, and gives a TRUE/FALSE answer to the question ‘did this electoral
ward vote for Brexit?’ (i.e. did more than 50% of people vote to Leave?).

Tasks (week 6):

1. Fit a logistic regression models using all of the available inputs. Identify the direction of each effect
from the fitted coefficients. Compare these with the plots shown on the Guardian website. Do they
agree?
2. Present the value of each coefficient estimate with a 95% confidence interval. Which input would you
say has the strongest effect?
3. Using aic, perform a model selection to determine which factors are useful to predict the result of
the vote. Use a ‘greedy’ input selection procedure, as follows: (i) select the best model with 1 input;
(ii) fixing that input, select the best two-input model (i.e. try all the other 4 inputs with the one you
selected first); (iii) select the best three-input model containing the first two inputs you chose, etc. At
each stage evaluate the quality of fit using aic and stop if this gets worse.

Tasks (week 7):

1. Use the Scikit-Learn package to create a decision tree classification model. Visualise your model and intepret
the fitted model.
2. Compare your decision tree model and your logistic regression model. Do they attribute high importance
to the same factors? How do you intepret each model to explain the referendum vote?
3. Which model would you use if you were explaining the results for a newspaper article, and why?

### Task 1
Fit a logistic regression models using all of the available inputs. Identify the direction of each effect from the fitted coefficients. Compare these with the plots shown on the Guardian website. Do they agree?

In [30]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
from statsmodels.formula.api import glm
import statsmodels.api as sm
# read csv file and set y as output and x's as inputs 
brexit=pd.read_csv("brexit.csv")
y=brexit.voteBrexit*1 # set the TRUE elements in the column to 1 and FALSE elements to 0 as if this isn't transformed the linear regression coefiicients have a change of sign for each input.
x1=brexit.abc1
x2=brexit.notBornUK
x3=brexit.medianIncome
x4=brexit.medianAge
x5=brexit.withHigherEd


#make data frame with all inputs and output
df=pd.DataFrame({'y':y,'x1':x1,'x2':x2,'x3':x3,'x4':x4, 'x5':x5})
print(df.head())
#perform glm with logistic regression model function
myglm = glm(' y ~ x1 + x2 + x3 + x4 + x5 ', df, family=sm.families.Binomial()).fit()
#print summary
print(myglm.summary());

   y        x1        x2        x3        x4        x5
0  1  0.133641  0.012605  0.252577  0.500000  0.085526
1  1  0.129032  0.113445  0.108247  0.272727  0.111842
2  1  0.161290  0.004202  0.128866  0.636364  0.118421
3  1  0.322581  0.046218  0.226804  0.454545  0.217105
4  1  0.345622  0.058824  0.201031  0.545455  0.243421
                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                  344
Model:                            GLM   Df Residuals:                      338
Model Family:                Binomial   Df Model:                            5
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -123.69
Date:                Fri, 01 Apr 2022   Deviance:                       247.39
Time:                        05:55:56   Pearson chi2:                     401.
No. Iterations:                     6 

The coefficient for 'x1' which represents the proportion of individuals who are in the ABC1 social classes (middle to upper class) is -7.6 (1 decimal point) which shows positive correlation so as as as proportion of individuals who are  in higher social classes increases the more likely they are to vote for Brexit. This is the largest positive  value which indicates that for votes for Brexit people from higher social classes were the most significant voters. This disagrees with the graph on the guardian website as graph shown has a negative correaltion but the maximum and minimum % values of residents of ABC1 social grade, i.e. the values of the y- axis are the largest compared to the other graphs which agrees with the regression model that this input has the large effect on the Brexit vote. The other 2 positive coefficient values were for 'x2' and 'x4' which represent the normal values for median income of residents and proprotion of residents with any university-level education with values of around 5-6 which shows slight positive correlation which disagrees with the graphs about correlation as the graphs show negative correlation. However these factors are not as significant as 'x1' on the graphs too as well as the regression model as the values on the y axis are lower than 'x1'. The input 'x5' representing the input of the proportion of residents who were born outside the UK, has a negtaive coefficient which is the largest value coefficient out of all inputs showing it has the largest effect on the vote and that as this proporition of this input increases the less likely brexit will be voted for which is hown on the graph on the Guardian in both direction and magnitude as a lot of points for leave are close to 0 proprtion and the max values for the y axis is much lower compared to the rest of the graphs.

### Task 2 
 Present the value of each coefficient estimate with a 95% confidence interval. Which input would you
say has the strongest effect?

In [31]:
import scipy.stats
zc = scipy.stats.norm.ppf(0.975)
#Extract estimate and standard error of coefficient from model summary (check above)
# make for loop so code outputs upper and lower boundaries of confidence intervals for each input
# start from 1st to 6th 
for i in range (1,6,1):
    estimate = myglm.params[i]
    standard_error = myglm.bse[i]
    #Calculate and print the lower and upper boundaries of the confidence interval
    CI_min = estimate - zc*standard_error
    CI_max = estimate + zc*standard_error
    print(CI_min)
    print(CI_max)

11.871724184172496
23.284271889057383
2.1516688332981717
9.22060779588885
-10.15220399455139
-2.619275270703323
3.1640666241245277
8.677686874481365
-33.753456448683714
-19.73506194274948


The confidence intervals shows that the input of the proportion of residents who were born outside the UK had the largest effect as the absolute values for the min and max are the largest whilst also both beinmg negative so even at the upper boundary of this input the absolute value was at the lowest however still nearly as large as the upper boundary so it is obvious that the proportion of residents who were born outside the UK has the largest effect.

### Task 3
Using aic, perform a model selection to determine which factors are useful to predict the result of the vote. Use a ‘greedy’ input selection procedure, as follows: (i) select the best model with 1 input; (ii) fixing that input, select the best two-input model (i.e. try all the other 4 inputs with the one you selected first); (iii) select the best three-input model containing the first two inputs you chose, etc. At each stage evaluate the quality of fit using aic and stop if this gets worse.

(i) select the best model with 1 input;

In [24]:

#make array with all forumlas
formula1=['y~x1','y~x2','y~x3','y~x4','y~x5']
prints1=["The AIC value for when proportion of individuals who are in the ABC1 social classes is the only input is","The AIC value for when not born in the UK is the only input is","The AIC value for median income is the only input is","The AIC value for when median age is the only input is","The AIC value for when people with higher education  is the only input is"]
#make for loop to run all equations with onme input through glm and to print the aic of the glm
for (i,j) in zip(formula1,prints1):
    my_glm=glm(i,df, family=sm.families.Binomial()).fit()
    print(j,my_glm.aic)

The AIC value for when proportion of individuals who are in the ABC1 social classes is the only input is 377.543729722669
The AIC value for when not born in the UK is the only input is 377.80127885092975
The AIC value for median income is the only input is 368.44370373062986
The AIC value for when median age is the only input is 401.2766935753845
The AIC value for when people with higher education  is the only input is 313.5604055904664


The best model with 1 input is the model where the AIC value is the lowest which is the model where only higher education is the only input Therefore this input will be fixed in part (ii)

In [25]:
# make array with all the forumals with 2 inputs where one of the inputs is fixed from previous part and run for loop so all equations are run through glm
formula2=['y~x5+x1','y~x5+x2','y~x5+x3','y~x5+x4']
prints2=["The AIC value for when proportion of individuals who are in the ABC1 social classes and higher education  are the only input is","The AIC value for when people not born in the UK and higher education  are the only input is","The AIC value when median income and higher education  are the only input is","The AIC value for when median age and higher education  are the only inputs is"]
for (i,j) in zip(formula2,prints2):
    my_glm=glm(i,df, family=sm.families.Binomial()).fit()
    print(j,my_glm.aic)

The AIC value for when proportion of individuals who are in the ABC1 social classes and higher education  are the only input is 286.5454477003765
The AIC value for when people not born in the UK and higher education  are the only input is 310.3643998928851
The AIC value when median income and higher education  are the only input is 315.5255949327895
The AIC value for when median age and higher education  are the only inputs is 303.3090827944477


The best model when fixing the best single input and testing the 4 other input as a pair with higher education is when taking proportion of individuals who are in the ABC1 social classes. Now higher eductaion and proportion of individuals who are in the ABC1 social classes will be fixed when testing the bect 3 input models for part (iii).

In [26]:
formula3=['y~x5+x1+x2','y~x5+x1+x3','y~x5+x1+x4']
prints3=["The AIC value for the model when the 3 inputs are higher eduaction,proportion of individuals who are in the ABC1 social classes and not born in the UK is ","The AIC value for the model when the 3 inputs are higher eduaction,proportion of individuals who are in the ABC1 social classes and median income is ","The AIC value for the model when the 3 inputs are higher eduaction,proportion of individuals who are in the ABC1 social classes and median age is"]
for (i,j) in zip(formula3,prints3):
    my_glm=glm(i,df, family=sm.families.Binomial()).fit()
    print(j,my_glm.aic)

The AIC value for the model when the 3 inputs are higher eduaction,proportion of individuals who are in the ABC1 social classes and not born in the UK is  285.2443784426674
The AIC value for the model when the 3 inputs are higher eduaction,proportion of individuals who are in the ABC1 social classes and median income is  275.93391871432317
The AIC value for the model when the 3 inputs are higher eduaction,proportion of individuals who are in the ABC1 social classes and median age is 271.93170450949265


Using the 'greedy' selection procedure the best 3 input model is when the inputs are  higher education, proportion of individuals who are in the ABC1 soical classed and median age due to the AIC value being the lowest.

## Week 7

### Task 1
Use the Scikit-Learn package to create a decision tree classification model. Visualise your model and intepret
the fitted model

In [27]:
import pandas as pd
brexit = pd.read_csv("brexit.csv")
voteBrexit=brexit.voteBrexit*1
abc1=brexit.abc1
notBornUK=brexit.notBornUK
medianIncome=brexit.medianIncome
medianAge=brexit.medianAge
withHigherEd=brexit.withHigherEd
# make 2 data frames one containing the inputs callled X and one for the output called Y
X=pd.DataFrame({'abc1':abc1,'notBornUK':notBornUK,'medianIncome':medianIncome,'medianAge':medianAge, 'withHigherEd':withHigherEd})
print(X.head())
Y=pd.DataFrame({'voteBrexit':voteBrexit})
print(Y.head())

       abc1  notBornUK  medianIncome  medianAge  withHigherEd
0  0.133641   0.012605      0.252577   0.500000      0.085526
1  0.129032   0.113445      0.108247   0.272727      0.111842
2  0.161290   0.004202      0.128866   0.636364      0.118421
3  0.322581   0.046218      0.226804   0.454545      0.217105
4  0.345622   0.058824      0.201031   0.545455      0.243421
   voteBrexit
0        True
1        True
2        True
3        True
4        True


In [28]:
# make the decision tree using DecisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier
# set max depth to 4 as testing other max depths shows 4 has the best balance for 
mytree1 = DecisionTreeClassifier(max_depth=4)
mytree1.fit(X,Y)

DecisionTreeClassifier(max_depth=4)

In [29]:
from sklearn.tree import export_graphviz
export_graphviz(mytree1,out_file="dot_data",rounded=True,filled=True)
!dot -Tpng dot_data -o dot_data.png

<img src="./dot_data.png" width=800 alt="test"/>

### Task 2
Compare your decision tree model and your logistic regression model. Do they attribute high importance to the same factors? How do you intepret each model to explain the referendum vote?

The logistic regression model showed that the proportion of residents who were born outside the UK was the most importnat input and the decision tree model does too by making the first node a test on this input. The decision tree does not make the first input which is  the proportion of individuals who are in the ABC1 social classes an important input as the depth of where this test comes into play for this input is further down which doesn't agree with the logistic regression. The other 3 inputs have similiar importance in both the logistic regression and decision tree model.

### Task 3
Which model would you use if you were explaining the results for a newspaper article, and why?

The model I would pick is the logistic regression model as with logistic regression model the coefficients of the model give a clear indication of the coreelation and the magnitude of the impact of each input which also would make it easier to explain and for the readers to understand. You can make a graph using the regression model making the data easier to interpret. While a decision tree although maybe more accurate especially when increasing the depth , it is much harder to show the effect of each input and harder to visualize.

## Self-Assessment:

### Week 6

Task 1: 100% of marks awarded
Task 2: 100% of marks awarded
Task 3: 100% of marks awarded

### Week 7

Task 1: 70% of marks awarded
Task 2: 50% of marks awarded
Task 3: 50% of marks awarded

Presentation mark: 20 marks out 20
Extra marks : 0 out of 10