# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

The purpose here is to try to classify the candidate to whom the contributor contributes.  

Here are the feature columns we will use:
1. Last Name (converted from Contributor Name)
2. First Name (converted from Contributor Name)
3. State 
4. Latitude (converted from Zipcode)
5. Longitude (converted from zipcode)
6. Employer
7. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [None]:
%matplotlib inline
import time

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Load the data

In [None]:
t1 = time.perf_counter()
dataset = spark.read.csv("/data/presidential_election_contribs/2016/2016-medium-clean.csv", \
                         header=True, inferSchema=True)
t2 = time.perf_counter()

print("read {:,} records in {:,.2f} ms".format(dataset.count(), (t2-t1)*1000))

In [None]:
prediction_column = ['CAND_NM']
numeric_columns = ['LAT', 'LNG']
feature_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'LAT', 'LNG', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_index = ['FIRSTNAME_index', 'LASTNAME_index', 'CONTBR_ST_index', 'CONTBR_EMPLOYER_index', 
                     "CONTBR_OCCUPATION_index"]

### Print out a contribution count broken down by candidate?
**=> Q : Which candidates got the most donations? (in terms of number of donors) **

In [None]:
## TODO : print out per candidate breakdown
## Hint : What column represents Candidate name
dataset.groupBy('???').count().show()


In [None]:
## TODO : sort the output by number of contributions
dataset.groupBy('???').count().orderBy('???', ascending=False).show()

## Step 2: Build Indexers and feature vector

Let's index all the categorical columns, and build a labeld index.

In [None]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid="keep").fit(dataset) for column in categorical_columns ]
pipeline = Pipeline(stages=indexers)
df_r2 = pipeline.fit(dataset).transform(dataset)

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler2 = VectorAssembler(inputCols=numeric_columns + categorical_index, outputCol="features")
fv2 = assembler2.transform(df_r2.na.drop())


In [None]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorIndexer

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="CAND_NM", outputCol="indexedLabel").fit(fv2)


# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer2 =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(fv2)



## Step 3: Craete Random Forest Model with Pipeline

Create the model here and add it to pipeline

In [None]:
from pyspark.ml.classification import RandomForestClassifier

## TODO : Create a RandomForest Model with  numTrees=20 and  maxBins=10000
rf2 = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=???, maxBins=???)

# Chain indexer and forest in a Pipeline
pipeline2 = Pipeline(stages=[labelIndexer, featureIndexer2, rf2])


## Step 4: Split data into training and test

**=> TODO: build training and test datasets 70%/30% **


In [None]:
## TODO : split 70% training and 30% testing
## Hint : 0.7 ,  0.3
(trainingData2, testData2) = fv2.randomSplit([???,  ???]

print("training set = " , trainingData2.count())
print("testing set = " , testData2.count())

## Step 5: Train the Model


In [None]:
print("Starting model training....this will take some time")
t1 = time.perf_counter()
## TODO : train the model with our training set
## Hint : trainingData2
model2 = pipeline2.fit(???)
t2 = time.perf_counter()
print("trained on {:,} records  in {:,.2f} ms".\
      format(trainingData2.count(),  (t2-t1)*1000))

In [None]:
## TODO : predict with our test data
## Hint : testData2

t1 = time.perf_counter()
predictions2 = model2.transform(???)
t2 = time.perf_counter()
print("prediction on {:,} records  in {:,.2f} ms".\
      format(testData2.count(),  (t2-t1)*1000))

predictions2.select(feature_columns + ['probability', 'prediction']).show()

## Step 6: Evaulate the model

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Select example rows to display.
predictions2.select("prediction", "CAND_NM", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator2 = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator2.evaluate(predictions2)
print("Test Error = %g" % (1.0 - accuracy))

rfModel2 = model2.stages[2]
print(rfModel2)  # summary only

**=> Q: Think about the test error here?  Does it seem high?  What does that say about our model?**

**=> Q: How do we define model success?**

### Decoding the Label

0. Hillary Clinton
1. Bernie Sanders
2. Donald Trump
3. Ted Cruz

## Step 7: Print out the Confusion Matrix


In [None]:
predictions2.groupBy('CAND_NM').pivot('prediction', range(0,22)).count().na.fill(0).orderBy('CAND_NM').show()

Use the list above to interpret the label.  

**=>What can you conclude from the confusion matrix?**

Is our model better at predicting candidates with many donations (Clinton, Sanders), or few donations?

What can you say about our model perfromance.

## Step 8: Print the feature importanes

In [None]:
import pandas as pd

imp = rfModel2.featureImportances.toArray()
print(imp)
cols = numeric_columns + categorical_columns
print(cols)
df = pd.DataFrame({'cols': cols, 'importance':imp})
print(df)
df.sort_values(by=['importance'], ascending=False)

**=> TODO Compare the relative weight of the feature importances?**



## Conclusion: Most important Fields

1. Employer
2. Occupation
3. LastName
4. State

Other fields not significant

**=> TODO Compare the relative weight of the feature importances?**

Why do you think that the lat/long and other fields did not contribute?

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation|**



## BONUS : Running on full dataset

**Use the dowload script**

```bash
$ cd   ~/data/presidential_election_contribs
$ ./download-data.sh
```

This will download full dataset.

As we run on larger dataset, the execution will take longer and Jupyter notebook might time out.  So let's run this in command line / script mode

Download the Jupyter notebook as Python file (File --> Download as --> Python)

```bash
# run the downloaded python script as follows
$    time  ~/spark/bin/spark-submit    --master local[*]  random-forest-2-election-classification.py 2> logs

```

Watch the output
