<a href="https://colab.research.google.com/github/ellenwterry/PoliticalAnalysis/blob/main/Voter_Targeting_LogReg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load Libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import patsy
from sklearn.linear_model import LogisticRegression

import random
!pip install nest-asyncio
import nest_asyncio
nest_asyncio.apply()


!pip install pystan==3.7.0
#!pip install pystan
!pip install corner
import stan

import plotly.express as px
import plotly.graph_objects as go

!pip install geopy
from geopy.geocoders import Nominatim
import matplotlib.pyplot as plt
!pip install pygris
# import matplotlib.pyplot as plt
from pygris import core_based_statistical_areas
from pygris import tracts

from google.colab import files


import geopandas as gpd
import folium
# from google.colab import files


Get Data from Github Site:

In [2]:
# ---------- Get Data from Github ---------- #

url = 'https://raw.githubusercontent.com/ellenwterry/PoliticalAnalysis/main/VoteBase.csv'
VoteBase = pd.read_csv(url)

Tidy Data

In [3]:
# ---------- Clean up data ---------- #

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

le.fit(VoteBase['Sex'])

codes = {'NR':0, 'M': 1, 'F': 2}
VoteBase['Sex'] = VoteBase['Sex'].map(codes)

VoteBase['Age']=VoteBase.Age.astype('int32')

#VoteBase['LastPrimary'] = le.transform(VoteBase['LastPrimary'])
codes = {'NR':0, 'R': 1, 'D':2}
VoteBase['LastPrimary'] = VoteBase['LastPrimary'].map(codes)

#VoteBase['Education'] = le.transform(VoteBase['Education'])
codes = {'NR':0, 'HS': 1, 'Some College':2, 'Bachelor':3, 'Masters':4, 'Doctorate':5}
VoteBase['Education'] = VoteBase['Education'].map(codes)

#VoteBase['HHIncome'] = le.transform(VoteBase['HHIncome'])
codes = {'NR':0, 'Under 50k': 1, '50k-100k':2, '100k-200k':3, '200k-300k':4, '300k-500k':5, 'Over 500k':6}
VoteBase['HHIncome'] = VoteBase['HHIncome'].map(codes)

#VoteBase['ReligiousAffil'] = le.transform(VoteBase['ReligiousAffil'])
codes = {'NR':0,'Protestant': 1, 'Catholic':2, 'Other':3, 'None':4}
VoteBase['ReligiousAffil'] = VoteBase['ReligiousAffil'].map(codes)

#VoteBase['Support24'] = le.transform(VoteBase['Support24'])
codes = {'R':0, 'D': 1}
VoteBase['Support24'] = VoteBase['Support24'].map(codes)
# NOTE: NAs were excluded from sample so that algorithms could score using logistic scale - 2nd pass will use imputed values

#VoteBase['TopIssue'] = le.transform(VoteBase['TopIssue'])
codes = {'NR':0, 'RFree':1, 'Parents':2, 'Crime':3, 'Economy':4, 'Womens':5, 'Education':6, 'Environment':7, 'Democracy':8}
VoteBase['TopIssue'] = VoteBase['TopIssue'].map(codes)

# This is for the second data source (later)
codes = {'NS':0, 'NR':1,'Signed':2}
VoteBase['RRPetition'] = VoteBase['RRPetition'].map(codes)

Create a Model Matrix without intercept column (because the algorithm separates the intecept for multinomial problems), and split into train and test.

In [4]:
np.random.seed(316)
VoteMatrix = patsy.dmatrix('Age + Sex + Education + HHIncome+ ReligiousAffil + LastPrimary + TopIssue -1', VoteBase)
yArray = np.array(VoteBase['Support24'])
rows = VoteMatrix.shape[0]
tstInd = np.random.randint(0, rows, size=100)
tstMatrix = VoteMatrix[tstInd]
yTst = yArray[tstInd]
trnMatrix = np.delete(VoteMatrix, tstInd, axis=0)
yTrn = np.delete(yArray, tstInd, axis=0)

Train Model, produce probabilities and store theta (array of coefficients). This is the model predictions

In [5]:
model = LogisticRegression(max_iter=1000)
model.fit(trnMatrix, yTrn)
Pred = model.predict(tstMatrix)
ModelProbs = pd.DataFrame(model.predict_proba(tstMatrix))
theta = np.matrix(model.coef_)
intercept = model.intercept_

View coefficients: (0.00590519,  0.07741354,  0.28367684,  0.79278793,  0.99741254, 0.37823278,  0.12781814)

bias: -4.71900663

In [6]:
model.coef_

array([[-0.00590519,  0.07741354,  0.28367684,  0.79278793,  0.99741254,
         0.37823278,  0.12781814]])

In [7]:
model.intercept_

array([-4.71900663])

Confirm using Equation, comparing with model probs

In [8]:
def stable_sigmoid(x):
  # Using np.where to avoid numerical overflow or underflow.
  return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))

EQProbs = stable_sigmoid((np.dot(model.coef_,tstMatrix[:,].transpose())+ model.intercept_).transpose())

Now lets take a look at the Test data in a DataFrame, cmparing model probs with equation probs :

In [9]:
tstDF = VoteBase.iloc[tstInd,:].copy()
tstDF['ModelProbs'] = ModelProbs[1].tolist()
tstDF['EQProbs'] = EQProbs[:].astype(float)
tstDF


Unnamed: 0,ID,Name,Sex,Age,LastPrimary,Latitude,Longitude,Education,HHIncome,ReligiousAffil,Support24,TopIssue,RRPetition,ModelProbs,EQProbs
829,830,Voter 830,2,32,0,41.09766,-73.61686,2,2,1,1,0,2,0.167590,0.167590
830,831,Voter 831,1,35,0,41.02575,-73.66299,2,2,1,0,0,0,0.154735,0.154735
1632,1633,Voter 1633,1,85,0,41.01395,-73.65489,3,3,3,1,5,0,0.847766,0.847766
920,921,Voter 921,2,51,0,41.05441,-73.56803,2,3,1,0,0,0,0.284507,0.284507
708,709,Voter 709,1,65,0,41.08071,-73.63603,2,2,1,1,0,0,0.132954,0.132954
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
671,672,Voter 672,2,33,0,41.09276,-73.64826,2,2,1,0,0,2,0.166768,0.166768
922,923,Voter 923,1,51,0,41.08903,-73.65623,0,3,1,0,0,0,0.172647,0.172647
1121,1122,Voter 1122,2,72,0,41.00559,-73.63517,3,3,2,0,0,0,0.558448,0.558448
1273,1274,Voter 1274,1,90,0,41.02769,-73.58716,3,3,2,0,0,0,0.512787,0.512787


OK, that looks good - just one last check - compare predicted class with the voter response in the initial survey

In [11]:
# just to reality check class assignment accuracy
tstDF.loc[tstDF['Support24'] ==1].count()
SupportCnt = tstDF.query('Support24 == 1').shape[0]
ClassCnt = tstDF.query('Support24 == 1 & EQProbs > .5' ).shape[0]
ClassCnt/SupportCnt

0.8085106382978723

Now let's pull a selected voter and change their data (intervention), observing the effect - changing the Top Issue on voter 921 to 5 (womens healthcare) and see how much that would change EQprob

In [None]:
tstDF.loc[tstDF['ID'] == 921]

Unnamed: 0,ID,Name,Sex,Age,LastPrimary,Latitude,Longitude,Education,HHIncome,ReligiousAffil,Support24,TopIssue,RRPetition,ModelProbs,EQProbs
920,921,Voter 921,2,51,0,41.05441,-73.56803,2,3,1,0,0,0,0.284507,0.284507


In [None]:
# What happens if we can move Voter 921 to a TopIssue of 5

tstVoter = tstDF.loc[tstDF['ID'] == 921].to_numpy()
#pull the columns from the dataframe that we used to model
tstVoter = tstVoter[:, [3, 2, 7, 8, 9, 4, 10]].astype(float)
# change top issue from 0 to 5 (simulating a change in position after a canvassing conversation - intervention)
tstVoter[0,6] = 5
ProbNew = stable_sigmoid((np.dot(model.coef_,tstVoter[0].transpose())+ model.intercept_).transpose())
ProbNew


array([0.42968812])

We've estimated an intevention here, moving the probability of voting D from .28% to 42% which is getting close to a D vote. Just based on this, it seems like it would make sense to target some voters and have conversations about womens healthcare.

So, let's step back and look at the voter universe:

In [12]:
AllProbs = stable_sigmoid((np.dot(model.coef_,VoteMatrix[:,].transpose())+ model.intercept_).transpose())
AllDF = VoteBase.copy()
AllDF['EQProbs'] = AllProbs[:].astype(float)
AllDF


Unnamed: 0,ID,Name,Sex,Age,LastPrimary,Latitude,Longitude,Education,HHIncome,ReligiousAffil,Support24,TopIssue,RRPetition,EQProbs
0,1,Voter 1,1,51,0,41.00544,-73.62954,2,3,1,0,0,0,0.269015
1,2,Voter 2,1,52,0,41.04027,-73.63011,2,2,1,0,0,0,0.142056
2,3,Voter 3,2,54,0,41.05734,-73.65044,3,6,1,1,4,0,0.903208
3,4,Voter 4,1,59,0,41.03255,-73.60403,2,3,1,0,0,0,0.259827
4,5,Voter 5,1,36,0,41.06684,-73.57837,2,3,4,0,0,0,0.889061
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2493,2494,Voter 2494,2,62,0,41.07108,-73.63842,4,3,4,1,5,1,0.961265
2494,2495,Voter 2495,2,62,2,41.03758,-73.57932,3,3,0,1,5,1,0.424250
2495,2496,Voter 2496,2,65,2,41.01605,-73.58949,3,6,1,1,0,1,0.917860
2496,2497,Voter 2497,1,67,2,40.99031,-73.65605,0,6,0,1,0,1,0.616792


OK, this shoud be the beginning potential of our universe, let's take a look:

In [13]:
# lets back up and look at the whole picture
fig = go.Figure()

fig.add_trace(go.Histogram(x=AllDF['EQProbs'], marker_color='#378796', opacity=0.05, name = "Vote Prob"))
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.9, nbinsx=100)
fig.update_layout(
    autosize=False,
    width=800,
    height=400,
    plot_bgcolor = "white",
    xaxis=dict(title='Probability of Vote',),
)

fig.show()

So again, let's stress the idea of dealing with continous outcomes. Those clusters around 20%-50% are possibly D votes, and 50%-70% are possibly R votes - it's up to you and you should be aware of the differences in the voter base.

Now,let's look at this from a TopIssue perspectve, highlighting voters who responded that "womens healthcare" is most important:

In [None]:
WmnVote = AllDF.query('TopIssue == 5' )

fig = go.Figure()
fig.add_trace(go.Histogram(x=AllDF['EQProbs'], marker_color='#DEDCDC', opacity=0.05, name = "AllVoters"))
fig.add_trace(go.Histogram(x=WmnVote['EQProbs'], marker_color='#94CDD7', opacity=0.1, name = "Women Healthcare Voters"))
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.9, nbinsx=100)
fig.update_layout(
    autosize=False,
    width=800,
    height=400,
    plot_bgcolor = "white",
    xaxis=dict(title='Probability of Vote',),
)

fig.show()

What this tells us is that the WH voters are more likely to have stated support for D in the election - but there are WH voters that stated support for R voters, and overall, a lot of voters that we just don't know much about.

Maybe we should be targeting women voters greater than 20% but less than 60% that did not respond that Womens Healthcare is their top issue

In [14]:
SegmentDF = AllDF.query('EQProbs > .2 & EQProbs <.6' )
Target = AllDF.query('TopIssue !=5 & EQProbs > .2 & EQProbs <.6 & Sex ==2' )

220

In [15]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=SegmentDF['EQProbs'], marker_color='#DEDCDC', opacity=0.05, name = "AllVoters"))
fig.add_trace(go.Histogram(x=Target['EQProbs'], marker_color='#94CDD7', opacity=0.1, name = "Women Not Responding WH"))
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.9, nbinsx=100)
fig.update_layout(
    autosize=False,
    width=800,
    height=400,
    plot_bgcolor = "white",
    xaxis=dict(title='Probability of Vote',),
)

fig.show()

In [17]:
Target.shape[0]

220

So, there are 220 women voters that did not respond that womens healthcare was the top concern in their voting choices. We can obviously target further - e.g., using correlations between other values and variables to improve effects. Seems like special interest events and canvassing could spur some interesting conversations, don't you think?   

Summariing:

We built a probabilistic, explanatory model of voters, and targeted a likely group of persuadable voters based on gender ('sex' in this data) and survey responses because we tested an intervention and found that it could have a significant effect on voter outcome (btw these are strong hints of causality - an acyclic graph with distribution effects would be confirmation).  