# Logistic Regression

## Using Logistic Regression to Predict if a Voter is Active or Inactive

In [1]:
# Access voter_data from 
%store -r voter_data_pandas

In [2]:
voter_data_pandas

Unnamed: 0,HH_Income_Amount,Home_Value,County,Ethnic_Description,Voters_Gender,Voters_Active,Parties_Description
0,47000.0,253332.0,ALBANY,Irish,M,I,Libertarian
1,46718.0,177015.0,ALBANY,English/Welsh,M,I,Republican
2,6000.0,287499.0,ALBANY,Irish,F,I,Democratic
3,20832.0,287499.0,ALBANY,English/Welsh,M,I,Republican
4,85000.0,241936.0,ALBANY,Unknown,F,I,Non-Partisan
...,...,...,...,...,...,...,...
287031,61764.0,104629.0,WESTON,English/Welsh,M,A,Republican
287032,7000.0,53724.0,WESTON,German,F,A,Republican
287033,9000.0,101065.0,WESTON,Unknown,F,A,Republican
287034,61764.0,87500.0,WESTON,English/Welsh,M,A,Republican


First, encode the predictor variable to 1 if Active, 0 if inactive. This will make it easier for us to implement the logistic regression.

In [3]:
voter_data_pandas_encode = voter_data_pandas.copy()
voter_data_pandas_encode["Voters_Active"].replace({'A': 1, 'I': 0})

0         0
1         0
2         0
3         0
4         0
         ..
287031    1
287032    1
287033    1
287034    1
287035    1
Name: Voters_Active, Length: 287036, dtype: int64

In [4]:
voter_data_pandas_encode = voter_data_pandas.copy()
voter_data_pandas_encode["Voters_Active"] = voter_data_pandas_encode["Voters_Active"].replace({'A': 1, 'I': 0})

For this Logistic Regression, we will be creating dummy variables for County, Ethnic Description, and Gender. We are doing an 80/20 split and calculating the accuracy

### Implementing the model

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

# create dummy variables
encoded_columns = pd.get_dummies(voter_data_pandas_encode[['County', 'Ethnic_Description', 'Voters_Gender']])
df_encoded = pd.concat([voter_data_pandas_encode.drop(columns=['County', 'Ethnic_Description', 'Voters_Gender', "Parties_Description"]), encoded_columns], axis=1)
X = df_encoded.drop(columns=["Voters_Active"])
y = df_encoded["Voters_Active"]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on test data
predictions = model.predict(X_test)

# Calculating accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

# calculate ROC AUC score
roc_auc = roc_auc_score(y_test, predictions)
print("ROC AUC Score:", roc_auc)

Accuracy: 0.9814485785953178
ROC AUC Score: 0.5


The Accuracy of this regression is .98 which is fairly high. This could be due to the fact that we obseved such a high amount of active voters. As mentioned before, the proportion of active voters is almost 1 across all ethnicities, income brackets, counties, etc. Thus, it would not be difficult for the model to predict whether a voter is active or inactive. The ROC of .5 suggests that the model's performance is no better than random guessing, as an ROC of 1 would indicate a perfect classifier. 

### Examining only income as a predictor

In the EDA we noticed a correlation between income and high voter activity. Thus, we will be dropping the other columns and will be predicting how well the income and home value predict voter activity.

In [6]:
# Logistic regression of income, geographic location, ethnicity, and gender predicting general election
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

df_2 = voter_data_pandas_encode.drop(columns=['County', 'Ethnic_Description', 'Voters_Gender', "Parties_Description"])
X_2 = df_2.drop(columns=["Voters_Active"])
y_2 = df_2[["Voters_Active"]]

# Split the data into train and test sets
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.2, random_state=42)

# Create and fit logistic regression model
model_2 = LogisticRegression()
model_2.fit(X_train_2, y_train_2)

# Make predictions on test data
predictions_2 = model_2.predict(X_test_2)

# Calculating accuracy
accuracy_2 = accuracy_score(y_test_2, predictions_2)
print("Accuracy:", accuracy_2)

# Calculate ROC AUC score
roc_auc = roc_auc_score(y_test, predictions)
print("ROC AUC Score:", roc_auc)

  y = column_or_1d(y, warn=True)


Accuracy: 0.9814485785953178
ROC AUC Score: 0.5


Even after dropping the other columns, the Accuracy and ROC are the same. This makes sense, as the proportion of active voters is generally high throughout all of the predictor variables, making it a minimal change in accuracy and model performance when certain predictor variables are removed. 