# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

**The purpose here is to try to classify the candidate to whom the contributor likely to contribute.**  

Here are the feature columns we will use:
1. State 
2. Employer
3. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Load the data

In [None]:
%%time

# 100k samples
data_file = '/data/presidential_election_contribs/2016/2016-100k.csv.gz'


data = spark.read.csv(data_file, \
                         header=True, inferSchema=True)

In [None]:
print("read {:,} records".format(data.count()))

In [None]:
data.printSchema()

In [None]:
## data.show() is hard to read
## use Pandas to pretty print

## vertical
## TODO : 'toPandas'
data.limit(3).???().T

# horizontal
# data.limit(5).toPandas()

### 1.5 - Sample Data
Start with a small sample of data. Once the algorithm is working procss the full dataset.


In [None]:
## TODO : set sample rate, start with 0.1
# sample size :  10% --> 0.1,   100%  -> 1.0
sample_size = ???

data = data.sample(withReplacement=False, fraction=sample_size)
print("sample size {:,} records".format(data.count()))

## Step 2 : Clean Data

### 2.1 - extract only a few columns

In [None]:
## TODO : Select these columns 
## Hint ; 'CAND_NM', 'CONTBR_ST', 'CONTBR_EMPLOYER', 'CONTBR_OCCUPATION', 'CONTB_RECEIPT_AMT'

columns = ['???', '???']

In [None]:
data2 = data.select(columns)
data2.printSchema()

data2.limit(5).toPandas()

### 2.2 - Clean data (drop null values)

In [None]:
## TODO : drop any null values
## Hint : na
data_clean = data2.???.???()

print("original data size = {:,}".format(data2.count()) )
print("clean data size = {:,}".format(data_clean.count()) )
print("droped records = {:,}".format(data2.count() - data_clean.count()) )

## Step 2 : Basic Exploration

### 2.1 - Print out a contribution count broken down by candidate?

**=> Which candidates got the most donations? (in terms of number of donors)**

In [None]:
## TODO : print out per candidate breakdown
## Hint : group by 'CAND_NM'  and order by 'count'
data_clean.groupBy('???').count().orderBy('???', ascending=False).show(20, False)

### 2.2 - find min/max/average contribution per candidate

In [None]:
from pyspark.sql.functions import min,max,mean

## TODO : what colum represents contribution amount?
data_clean.groupBy('CAND_NM').\
        agg(min('???'), mean('???'), max('???')).\
        orderBy('CAND_NM').\
        show(40, False)

### 2.3 - Whoah!  Negative Contributions!
We see some negative contributions!   

**Q==> Can you guys figure out why?**


In [None]:
## TODO Filter out only positive contribs
## Hint : use fileter(condition)
## Hint : condition :   
pos_contribs = data_clean.???("??? > 0")

print("original data size = {:,}".format(data_clean.count()) )
print("positive contributions data size = {:,}".format(pos_contribs.count()) )

### 2.4 - now find min/max/median in positive contributions

In [None]:
from pyspark.sql.functions import min,max,mean

print ("sorted by CAND_NM")

pos_contribs.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('CAND_NM').\
        show(40, False)

In [None]:
from pyspark.sql.functions import min,max,mean

print("sorted by AVG contribution")

pos_contribs.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('avg(CONTB_RECEIPT_AMT)', ascending=False).\
        show(40, False)

### 2.5 -- Find total contribution amount per candidate

In [None]:
from pyspark.sql.functions import min,max,mean

print("sorted by total contribution")

pos_contribs.groupBy('CAND_NM').\
        sum('CONTB_RECEIPT_AMT').\
        orderBy('sum(CONTB_RECEIPT_AMT)', ascending=False).\
        show(40, False)

## Step 3: Build Indexers

In [None]:
from pyspark.ml.feature import StringIndexer

## TODO build indexers for following categorical columns
## CAND_NM,   CONTBR_ST,  CONTBR_EMPLOYER,  CONTBR_OCCUPATION

indexer1 = StringIndexer(inputCol='CAND_NM', outputCol = "CAND_NM_index", handleInvalid="keep")
indexer2 = StringIndexer(inputCol='???', outputCol = "???_index", handleInvalid="keep")
indexer3 = StringIndexer(inputCol='???', outputCol = "???_index", handleInvalid="keep")
indexer4 = StringIndexer(inputCol='???', outputCol = "???_index", handleInvalid="keep")


In [None]:
## Stash indexers into 
from pyspark.ml import Pipeline

## TODO : add all indexers into stages
pipeline = Pipeline(stages=[indexer1, ???, ???, ???])
print(pipeline)

In [None]:
%%time
## TODO : fit and transform 'pos_contribs'  through pipeline
indexed_data = pipeline.fit(???).transform(???)

In [None]:
indexed_data.printSchema()
# indexed_data.show()

### 3.1  Understand indexed values

In [None]:
# state
indexed_data.groupBy(['CONTBR_ST', 'CONTBR_ST_index']).count()\
            .orderBy('CONTBR_ST_index', ascending=False).show()

In [None]:
# employer
indexed_data.groupBy(['CONTBR_EMPLOYER', 'CONTBR_EMPLOYER_index']).count()\
            .orderBy('CONTBR_EMPLOYER_index', ascending=False).show()

In [None]:
# occupation
indexed_data.groupBy(['CONTBR_OCCUPATION', 'CONTBR_OCCUPATION_index']).count()\
            .orderBy('CONTBR_OCCUPATION_index', ascending=False).show(10, False)

## Step 4 -  Feature Vectors

In [None]:
from pyspark.ml.feature import VectorAssembler

## Create a feature vector using 'index' columns
feature_columns = ['CONTBR_ST_index', '???_index', '???_index' ]

assembler = VectorAssembler(inputCols= feature_columns,  outputCol="features")
feature_vector = assembler.transform(indexed_data)
feature_vector.printSchema()

feature_vector.limit(5).toPandas()

## Step 5: Split data into training and test


In [None]:
# TODO : Split the data into training and test sets (30% held out for testing)
(training, test) = feature_vector.randomSplit([??? , ???])

print("training set = " , training.count())
print("testing set = " , test.count())

## Step 6: Create Random Forest Model

In [None]:
from pyspark.ml.classification import RandomForestClassifier

## TODO : Create a random forest model
##        what is the 'labelCol' ?
rf = ???(labelCol="???", featuresCol="features", numTrees=20, maxBins=50000)


## 7 -  Train

In [None]:
%%time
print("training starting...")

## TODO : start training, using 'fit' method  on training data
model = rf.???(???)
print("training done")
print (model)

In [None]:
print("trained on {:,} records".format(training.count()))

## 8 - Prediction

In [None]:
%%time

## TODO : predict on 'test' columns
##        use 'transform' method
predictions = model.???(???)

In [None]:
print("predicted on {:,} records".format(test.count()))

In [None]:
# Select example rows to display.
predictions.sample(False, 0.1).select("prediction", 'CAND_NM_index', "CAND_NM").show()

## 9 - Evaluate

### 9.1 - Acuracy
**=> TODO: Think about the test error here?  Does it seem high?  What does that say about our model?**

**=> How do we define model success?**

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="CAND_NM_index", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))


### 9.2 - Confusion Matrix

####  Figure Out Candidate Mapping

In [None]:
## Candidate Mapping
candidate_mapping = indexed_data.groupBy(['CAND_NM', 'CAND_NM_index']).count()
# candidate_mapping.orderBy('CAND_NM').show()
candidate_mapping.orderBy('CAND_NM_index').show()

#### Confusion Matrix

**=>What can you conclude from the confusion matrix?**

Use the list above to interpret the label.  

Is our model better at predicting candidates with many donations (Clinton, Sanders), or few donations?

What can you say about our model perfromance.

In [None]:
predictions.groupBy('CAND_NM').pivot('prediction', range(0,22)).count().na.fill(0).orderBy('CAND_NM').show()

## Step 10 -  Print the feature importanes

**=> TODO Compare the relative weight of the feature importances?**

In [None]:
import pandas as pd

imp = model.featureImportances.toArray()
print(imp)
df = pd.DataFrame({'cols': feature_columns, 'importance':imp})
print(df)
df.sort_values(by=['importance'], ascending=False)

## Conclusion: Most important Fields

1. Employer
2. Occupation
3. State

Other fields not significant

**=> TODO Compare the relative weight of the feature importances?**

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation**



## BONUS : Running on full dataset

**Use the dowload script**

```bash
$ cd   ~/data/presidential_election_contribs
$ ./download-data.sh
```

This will download full dataset.

As we run on larger dataset, the execution will take longer and Jupyter notebook might time out.  So let's run this in command line / script mode

Download the Jupyter notebook as Python file (File --> Download as --> Python)

```bash
# run the downloaded python script as follows
$    time  ~/spark/bin/spark-submit    --master local[*]  random-forest-2-election-classification.py 2> logs

```

Watch the output
