# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

The purpose here is to try to classify the candidate to whom the contributor contributes.  

Here are the feature columns we will use:
1. Last Name (converted from Contributor Name)
2. First Name (converted from Contributor Name)
3. State 
4. Latitude (converted from Zipcode)
5. Longitude (converted from zipcode)
6. Employer
7. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [None]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorIndexer, IndexToString
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import isnan, when, count, col, split, trim, countDistinct, abs 
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import IntegerType

import pyspark.sql.functions

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Load the data

In [None]:
dataset = spark.read.csv("/data/presidential_election_contribs/2016/2016-medium-clean.csv", header=True, inferSchema=True)


In [None]:
prediction_column = ['CAND_NM']
numeric_columns = ['LAT', 'LNG']
feature_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'LAT', 'LNG', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_index = ['FIRSTNAME_index', 'LASTNAME_index', 'CONTBR_ST_index', 'CONTBR_EMPLOYER_index', 
                     "CONTBR_OCCUPATION_index"]

**=> TODO: Print out a count broken down by candidate? **

In [None]:
# Print out a grouping by candidate
# TODO: Print Breakdown.

**=> Which candidates got the most donations? (in terms of number of donors) **

## Step 2: Build Indexers and feature vector

Let's index all the categorical columns, and build a labeld index.

In [None]:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid="keep").fit(dataset) for column in categorical_columns ]
pipeline = Pipeline(stages=indexers)
df_r2 = pipeline.fit(dataset).transform(dataset)

**=> TODO: Build vectors from all the numeric and categorical_index columns **
**=> TODO: Make a new column called "label" which is the candidate name **



In [None]:
assembler2 = VectorAssembler(inputCols=??? + ???, outputCol="features")
fv2 = assembler2.transform(df_r2.na.drop())

fv2 = fv2.withColumn("???",???)   # TODO Create index from candidate name



In [None]:

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(fv2)


# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer2 =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(fv2)



## Step 3: Train Random Forest Model with Pipeline

Train the model here.

**=> TODO: Create pipeline with labelIndexer, featureIndexer2, rf2 **



In [None]:

# Train a RandomForest model.
rf2 = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=20, maxBins=10000)

# Chain indexer and forest in a Pipeline
pipeline2 = Pipeline(stages=[???, ???, ???])



## Step 4: Split data into training and test

**=> TODO: build training and test datasets 70%/30% **


In [None]:

# Split the data into training and test sets (30% held out for testing)
(trainingData2, testData2) = ??? # do a random split 70%/30%


## Step 5: Train the Model

**=> TODO: Get predictions by transforming testData2 **



In [None]:
predictions2.select(feature_columns + ['probability', 'prediction']).show()

## Step 6: Evaulate the model

In [None]:

# Select example rows to display.
predictions2.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator2 = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator2.evaluate(predictions2)
print("Test Error = %g" % (1.0 - accuracy))

rfModel2 = model2.stages[2]
print(rfModel2)  # summary only

### Decoding the Label

0. Hillary Clinton
1. Bernie Sanders
2. Donald Trump
3. Ted Cruz

## Step 7: Print out the Confusion Matrix


In [None]:
predictions2.groupBy('label').pivot('prediction', range(0,22)).count().na.fill(0).orderBy('label').show()

Use the list above to interpret the label.  

**=>What can you conclude from the confusion matrix?**

Is our model better at predicting candidates with many donations (Clinton, Sanders), or few donations?

What can you say about our model perfromance.

## Step 8: Print the feature importanes

In [None]:
rfModel2.featureImportances

In [None]:
print(numeric_columns + categorical_columns_donation) # for reference

**=> TODO Compare the relative weight of the feature importances? **



## Conclusion: Most important Fields

1. Employer
2. Occupation
3. LastName
4. State

Other fields not significant

**=> TODO Compare the relative weight of the feature importances? **

Why do you think that the lat/long and other fields did not contribute?

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation **

