## Initial Model
I'm going to try to create a regression model to predict voter turnout based on variables like land ownership, party, age and gender. In our initial EDA we saw that the state of Michigan had a significant different in voter turnout between political parties in the 2016 eleciton so the first model will be just using Michigan data.

### Loaidng the Data

In [1]:
gcs_path = 'gs://pstat135-voter-file/VM2Uniform'

mi =  spark.read.parquet(gcs_path + "/" + "VM2Uniform--MI--2021-01-30")


                                                                                

In [2]:
mi.printSchema()


root
 |-- SEQUENCE: string (nullable = true)
 |-- LALVOTERID: string (nullable = true)
 |-- Voters_Active: string (nullable = true)
 |-- Voters_StateVoterID: string (nullable = true)
 |-- Voters_CountyVoterID: string (nullable = true)
 |-- VoterTelephones_LandlineAreaCode: string (nullable = true)
 |-- VoterTelephones_Landline7Digit: string (nullable = true)
 |-- VoterTelephones_LandlineFormatted: string (nullable = true)
 |-- VoterTelephones_LandlineUnformatted: string (nullable = true)
 |-- VoterTelephones_LandlineConfidenceCode: string (nullable = true)
 |-- VoterTelephones_CellPhoneOnly: string (nullable = true)
 |-- VoterTelephones_CellPhoneFormatted: string (nullable = true)
 |-- VoterTelephones_CellPhoneUnformatted: string (nullable = true)
 |-- VoterTelephones_CellConfidenceCode: string (nullable = true)
 |-- Voters_FirstName: string (nullable = true)
 |-- Voters_MiddleName: string (nullable = true)
 |-- Voters_LastName: string (nullable = true)
 |-- Voters_NameSuffix: string (

In [51]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType

In [52]:
mod_df = mi.select("Residence_Addresses_Property_LandSq_Footage",
          "Parties_Description",
          "Voters_Age", 
         "General_2020")

In [53]:
mod_df.count()

                                                                                

7593651

In [54]:
mod_df = mod_df.withColumn("label", when(mod_df.General_2020 == "Y", 1)
                                            .otherwise(0))
mod_df = mod_df.drop("General_2020")
mod_df = mod_df.withColumn("Residence_Addresses_Property_LandSq_Footage",
                           mod_df["Residence_Addresses_Property_LandSq_Footage"].cast(IntegerType()))
mod_df = mod_df.withColumn("Voters_Age",
                           mod_df["Voters_Age"].cast(IntegerType()))
mod_df.show(5)
mod_df.count()

+-------------------------------------------+-------------------+----------+-----+
|Residence_Addresses_Property_LandSq_Footage|Parties_Description|Voters_Age|label|
+-------------------------------------------+-------------------+----------+-----+
|                                      26000|         Democratic|        58|    1|
|                                      17000|         Republican|        29|    1|
|                                      17000|         Republican|        57|    1|
|                                      17000|         Republican|        56|    1|
|                                      36000|         Republican|        62|    1|
+-------------------------------------------+-------------------+----------+-----+
only showing top 5 rows



                                                                                

7593651

In [55]:
mod_df = mod_df.dropna()
mod_df.count()

                                                                                

5981473

### Preparing data for Model
Now I need to one hot encode the `Parties_Description` column and create a column with all the features.

In [56]:
# Create an indexer
indexer = StringIndexer(inputCol="Parties_Description", outputCol='Parties_Description_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(mod_df)

# Indexer creates a new column with numeric index values
df_indexed = indexer_model.transform(mod_df)
onehot = OneHotEncoder(inputCols=["Parties_Description_idx"], outputCols=["party_encoded"])
onehot = onehot.fit(df_indexed)

df_onehot = onehot.transform(df_indexed)

                                                                                

In [57]:
df_onehot.show(5)

+-------------------------------------------+-------------------+----------+-----+-----------------------+-------------+
|Residence_Addresses_Property_LandSq_Footage|Parties_Description|Voters_Age|label|Parties_Description_idx|party_encoded|
+-------------------------------------------+-------------------+----------+-----+-----------------------+-------------+
|                                      26000|         Democratic|        58|    1|                    0.0|(2,[0],[1.0])|
|                                      17000|         Republican|        29|    1|                    1.0|(2,[1],[1.0])|
|                                      17000|         Republican|        57|    1|                    1.0|(2,[1],[1.0])|
|                                      17000|         Republican|        56|    1|                    1.0|(2,[1],[1.0])|
|                                      36000|         Republican|        62|    1|                    1.0|(2,[1],[1.0])|
+-------------------------------

In [58]:
# Create an assembler object
assembler = VectorAssembler(inputCols=[
    "Residence_Addresses_Property_LandSq_Footage",
    "Voters_Age",
    "party_encoded"
], outputCol='features')

# Consolidate predictor columns
df_assembled = assembler.transform(df_onehot)

# Check the resulting column
df_assembled.select('features', 'label').show(5, truncate=False)

+----------------------+-----+
|features              |label|
+----------------------+-----+
|[26000.0,58.0,1.0,0.0]|1    |
|[17000.0,29.0,0.0,1.0]|1    |
|[17000.0,57.0,0.0,1.0]|1    |
|[17000.0,56.0,0.0,1.0]|1    |
|[36000.0,62.0,0.0,1.0]|1    |
+----------------------+-----+
only showing top 5 rows



                                                                                

#### Train Test Split

In [59]:
df_train, df_test = df_assembled.randomSplit([0.8, 0.2], seed=43)

### Logistic Regression Model

In [60]:
logistic = LogisticRegression().fit(df_train)

23/03/14 19:26:09 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/03/14 19:26:09 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
                                                                                

In [61]:
prediction = logistic.transform(df_test)
prediction.groupBy("label", "prediction").count().show()

                                                                                

+-----+----------+------+
|label|prediction| count|
+-----+----------+------+
|    1|       0.0|  5544|
|    0|       0.0|  5802|
|    1|       1.0|879837|
|    0|       1.0|304938|
+-----+----------+------+

