# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

**The purpose here is to try to classify the candidate to whom the contributor likely to contribute.**  

Here are the feature columns we will use:
1. State 
2. Employer
3. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [2]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
spark

Initializing Spark...
Spark found in :  /home/ubuntu/spark
Spark config:
	 spark.app.name=TestApp
	spark.master=local[*]
	executor.memory=2g
	spark.sql.warehouse.dir=/tmp/tmpg7_jozi0
	some_property=some_value
Spark UI running on port 4040


## Step 1: Load the data

In [3]:
%%time

# 100k samples
data_file = '/data/presidential_election_contribs/2016/2016-100k.csv.gz'


data = spark.read.csv(data_file, \
                         header=True, inferSchema=True)

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 4.53 s


In [4]:
print("read {:,} records".format(data.count()))

read 100,000 records


In [5]:
data.printSchema()

root
 |-- CMTE_ID: string (nullable = true)
 |-- CAND_ID: string (nullable = true)
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_NM: string (nullable = true)
 |-- CONTBR_CITY: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_ZIP: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)
 |-- CONTB_RECEIPT_DT: string (nullable = true)
 |-- RECEIPT_DESC: string (nullable = true)
 |-- MEMO_CD: string (nullable = true)
 |-- MEMO_TEXT: string (nullable = true)
 |-- FORM_TP: string (nullable = true)
 |-- FILE_NUM: integer (nullable = true)
 |-- TRAN_ID: string (nullable = true)
 |-- ELECTION_TP: string (nullable = true)



In [6]:
## data.show() is hard to read
## use Pandas to pretty print

## vertical
## TODO : 'toPandas'
data.limit(3).toPandas().T

# horizontal
# data.limit(5).toPandas()

Unnamed: 0,0,1,2
CMTE_ID,C00605568,C00574624,C00580100
CAND_ID,P20002671,P60006111,P80001571
CAND_NM,"Johnson, Gary","Cruz, Rafael Edward 'Ted'","Trump, Donald J."
CONTBR_NM,"SMITH, PAUL","BROWNE, THOMAS JOHN","RISENHOOVER, LINDSEY"
CONTBR_CITY,SAN DIEGO,WHITESBORO,TULSA
CONTBR_ST,CA,NY,OK
CONTBR_ZIP,92117,134921106,74133
CONTBR_EMPLOYER,SELF,RETIRED,INFORMATION REQUESTED
CONTBR_OCCUPATION,RETIRED,RETIRED,INFORMATION REQUESTED
CONTB_RECEIPT_AMT,150,35,73.59


### 1.5 - Sample Data
Start with a small sample of data. Once the algorithm is working procss the full dataset.


In [40]:
## TODO : set sample rate, start with 0.1
# sample size :  10% --> 0.1,   100%  -> 1.0
sample_size = 1.0

data = data.sample(withReplacement=False, fraction=sample_size)
print("sample size {:,} records".format(data.count()))

sample size 10,056 records


## Step 2 : Clean Data

### 2.1 - extract only a few columns

In [41]:
## TODO : Select these columns 
## Hint ; 'CAND_NM', 'CONTBR_ST', 'CONTBR_EMPLOYER', 'CONTBR_OCCUPATION', 'CONTB_RECEIPT_AMT'

columns = ['CAND_NM', 'CONTBR_ST', 'CONTBR_EMPLOYER', 'CONTBR_OCCUPATION', 'CONTB_RECEIPT_AMT']

In [43]:
data2 = data.select(columns)
data2.printSchema()

data2.limit(5).toPandas()

root
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)



Unnamed: 0,CAND_NM,CONTBR_ST,CONTBR_EMPLOYER,CONTBR_OCCUPATION,CONTB_RECEIPT_AMT
0,"Johnson, Gary",CA,SELF,RETIRED,150.0
1,"Clinton, Hillary Rodham",NY,"SAP, INC",PARTNER,38.0
2,"Clinton, Hillary Rodham",PA,REGIONAL LEARNING ALLIANCE,CEO,25.0
3,"Clinton, Hillary Rodham",MD,SPACE TELESCOPE SCIENCE INSTITUTE,ASTROPHYSICIST,50.0
4,"Sanders, Bernard",FL,"DOYOUREMEMBER, INC.",DIRECTOR OF OPERATIONS,15.0


### 2.2 - Clean data (drop null values)

In [44]:
## TODO : drop any null values
## Hint : na
data_clean = data2.na.drop()

print("original data size = {:,}".format(data2.count()) )
print("clean data size = {:,}".format(data_clean.count()) )
print("droped records = {:,}".format(data2.count() - data_clean.count()) )

original data size = 10,056
clean data size = 9,964
droped records = 92


## Step 2 : Basic Exploration

### 2.1 - Print out a contribution count broken down by candidate?

**=> Which candidates got the most donations? (in terms of number of donors)**

In [11]:
## TODO : print out per candidate breakdown
## Hint : group by 'CAND_NM'  and order by 'count'
data_clean.groupBy('CAND_NM').count().orderBy('count', ascending=False).show(20, False)

+-------------------------+-----+
|CAND_NM                  |count|
+-------------------------+-----+
|Clinton, Hillary Rodham  |4686 |
|Sanders, Bernard         |2766 |
|Trump, Donald J.         |1072 |
|Cruz, Rafael Edward 'Ted'|735  |
|Carson, Benjamin S.      |348  |
|Rubio, Marco             |138  |
|Bush, Jeb                |40   |
|Paul, Rand               |38   |
|Kasich, John R.          |33   |
|Fiorina, Carly           |31   |
|Johnson, Gary            |24   |
|Walker, Scott            |16   |
|O'Malley, Martin Joseph  |11   |
|Stein, Jill              |10   |
|Huckabee, Mike           |6    |
|McMullin, Evan           |4    |
|Christie, Christopher J. |3    |
|Graham, Lindsey O.       |2    |
|Jindal, Bobby            |1    |
+-------------------------+-----+



### 2.2 - find min/max/average contribution per candidate

In [45]:
from pyspark.sql.functions import min,max,mean

## TODO : what colum represents contribution amount?
data_clean.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('CAND_NM').\
        show(40, False)

+-------------------------+----------------------+----------------------+----------------------+
|CAND_NM                  |min(CONTB_RECEIPT_AMT)|avg(CONTB_RECEIPT_AMT)|max(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+----------------------+----------------------+
|Bush, Jeb                |-350.0                |1514.5                |5400.0                |
|Carson, Benjamin S.      |-488.0                |143.26902298850575    |5400.0                |
|Christie, Christopher J. |400.0                 |1933.3333333333333    |2700.0                |
|Clinton, Hillary Rodham  |-2700.0               |112.35318822023045    |2700.0                |
|Cruz, Rafael Edward 'Ted'|-5400.0               |88.00919727891157     |10800.0               |
|Fiorina, Carly           |5.0                   |207.06451612903226    |2700.0                |
|Graham, Lindsey O.       |-2700.0               |-1100.0               |500.0                 |
|Huckabee, Mike           |1.0

### 2.3 - Whoah!  Negative Contributions!
We see some negative contributions!   

**Q==> Can you guys figure out why?**


In [46]:
## TODO Filter out only positive contribs
## Hint : use fileter(condition)
## Hint : condition :   
pos_contribs = data_clean.filter("CONTB_RECEIPT_AMT > 0")

print("original data size = {:,}".format(data_clean.count()) )
print("positive contributions data size = {:,}".format(pos_contribs.count()) )

original data size = 9,964
positive contributions data size = 9,891


### 2.4 - now find min/max/median in positive contributions

In [47]:
from pyspark.sql.functions import min,max,mean

print ("sorted by CAND_NM")

pos_contribs.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('CAND_NM').\
        show(40, False)

sorted by CAND_NM
+-------------------------+----------------------+----------------------+----------------------+
|CAND_NM                  |min(CONTB_RECEIPT_AMT)|avg(CONTB_RECEIPT_AMT)|max(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+----------------------+----------------------+
|Bush, Jeb                |5.0                   |1562.3076923076924    |5400.0                |
|Carson, Benjamin S.      |2.0                   |151.35182352941177    |5400.0                |
|Christie, Christopher J. |400.0                 |1933.3333333333333    |2700.0                |
|Clinton, Hillary Rodham  |0.64                  |113.43495512820517    |2700.0                |
|Cruz, Rafael Edward 'Ted'|1.0                   |120.09557122708041    |10800.0               |
|Fiorina, Carly           |5.0                   |207.06451612903226    |2700.0                |
|Graham, Lindsey O.       |500.0                 |500.0                 |500.0                 |
|Huckabee, M

In [48]:
from pyspark.sql.functions import min,max,mean

print("sorted by AVG contribution")

pos_contribs.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('avg(CONTB_RECEIPT_AMT)', ascending=False).\
        show(40, False)

sorted by AVG contribution
+-------------------------+----------------------+----------------------+----------------------+
|CAND_NM                  |min(CONTB_RECEIPT_AMT)|avg(CONTB_RECEIPT_AMT)|max(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+----------------------+----------------------+
|Christie, Christopher J. |400.0                 |1933.3333333333333    |2700.0                |
|Bush, Jeb                |5.0                   |1562.3076923076924    |5400.0                |
|O'Malley, Martin Joseph  |5.0                   |1192.7272727272727    |5400.0                |
|Walker, Scott            |25.0                  |607.8125              |5400.0                |
|Kasich, John R.          |25.0                  |576.1212121212121     |2700.0                |
|Graham, Lindsey O.       |500.0                 |500.0                 |500.0                 |
|Rubio, Marco             |5.0                   |410.6287878787879     |2700.0                |
|Ji

### 2.5 -- Find total contribution amount per candidate

In [16]:
from pyspark.sql.functions import min,max,mean

print("sorted by total contribution")

pos_contribs.groupBy('CAND_NM').\
        sum('CONTB_RECEIPT_AMT').\
        orderBy('sum(CONTB_RECEIPT_AMT)', ascending=False).\
        show(40, False)

sorted by total contribution
+-------------------------+----------------------+
|CAND_NM                  |sum(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+
|Clinton, Hillary Rodham  |530875.5900000002     |
|Trump, Donald J.         |173825.92999999993    |
|Sanders, Bernard         |134479.5299999999     |
|Cruz, Rafael Edward 'Ted'|85147.76000000001     |
|Bush, Jeb                |60930.0               |
|Rubio, Marco             |54203.0               |
|Carson, Benjamin S.      |51459.62              |
|Kasich, John R.          |19012.0               |
|O'Malley, Martin Joseph  |13120.0               |
|Walker, Scott            |9725.0                |
|Fiorina, Carly           |6419.0                |
|Paul, Rand               |6077.279999999999     |
|Christie, Christopher J. |5800.0                |
|Johnson, Gary            |5147.75               |
|Stein, Jill              |1192.0                |
|McMullin, Evan           |655.0                 |
|G

## Step 3: Build Indexers

In [49]:
from pyspark.ml.feature import StringIndexer

## TODO build indexers for following categorical columns
## CAND_NM,   CONTBR_ST,  CONTBR_EMPLOYER,  CONTBR_OCCUPATION

indexer1 = StringIndexer(inputCol='CAND_NM', outputCol = "CAND_NM_index", handleInvalid="keep")
indexer2 = StringIndexer(inputCol='CONTBR_ST', outputCol = "CONTBR_ST_index", handleInvalid="keep")
indexer3 = StringIndexer(inputCol='CONTBR_EMPLOYER', outputCol = "CONTBR_EMPLOYER_index", handleInvalid="keep")
indexer4 = StringIndexer(inputCol='CONTBR_OCCUPATION', outputCol = "CONTBR_OCCUPATION_index", handleInvalid="keep")


In [50]:
## Stash indexers into 
from pyspark.ml import Pipeline

## TODO : add all indexers into stages
pipeline = Pipeline(stages=[indexer1, indexer2, indexer3, indexer4])
print(pipeline)

Pipeline_97f56de8a381


In [51]:
%%time
## TODO : fit and transform 'pos_contribs'  through pipeline
indexed_data = pipeline.fit(pos_contribs).transform(pos_contribs)

CPU times: user 36 ms, sys: 12 ms, total: 48 ms
Wall time: 2.22 s


In [52]:
indexed_data.printSchema()
# indexed_data.show()

root
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)
 |-- CAND_NM_index: double (nullable = false)
 |-- CONTBR_ST_index: double (nullable = false)
 |-- CONTBR_EMPLOYER_index: double (nullable = false)
 |-- CONTBR_OCCUPATION_index: double (nullable = false)



### 3.1  Understand indexed values

In [22]:
# state
indexed_data.groupBy(['CONTBR_ST', 'CONTBR_ST_index']).count()\
            .orderBy('CONTBR_ST_index', ascending=False).show()

+---------+---------------+-----+
|CONTBR_ST|CONTBR_ST_index|count|
+---------+---------------+-----+
|       VI|           59.0|    1|
|       AS|           58.0|    1|
|       AA|           57.0|    1|
|       EN|           56.0|    1|
|       GU|           55.0|    2|
|       AP|           54.0|    3|
|       AE|           53.0|    3|
|       PR|           52.0|   10|
|       ND|           51.0|   12|
|       ZZ|           50.0|   12|
|       SD|           49.0|   13|
|       WY|           48.0|   16|
|       DE|           47.0|   25|
|       AK|           46.0|   27|
|       MS|           45.0|   28|
|       RI|           44.0|   31|
|       NE|           43.0|   31|
|       WV|           42.0|   33|
|       MT|           41.0|   36|
|       UT|           40.0|   44|
+---------+---------------+-----+
only showing top 20 rows



In [23]:
# employer
indexed_data.groupBy(['CONTBR_EMPLOYER', 'CONTBR_EMPLOYER_index']).count()\
            .orderBy('CONTBR_EMPLOYER_index', ascending=False).show()

+--------------------+---------------------+-----+
|     CONTBR_EMPLOYER|CONTBR_EMPLOYER_index|count|
+--------------------+---------------------+-----+
|       NW FORWARDING|               4159.0|    1|
|       KATTEN MUCHIN|               4158.0|    1|
|                ALTL|               4157.0|    1|
|LIGHT N LIVELY PO...|               4156.0|    1|
|   PENTA CORPORATION|               4155.0|    1|
|              RE/MAX|               4154.0|    1|
|        MOORE CENTER|               4153.0|    1|
|          RITZ-CRAFT|               4152.0|    1|
|HANDS ON CHILDREN...|               4151.0|    1|
| OAKMONT CORPORATION|               4150.0|    1|
|INDIANA UNIVERSIT...|               4149.0|    1|
|     PETTIS BUILDERS|               4148.0|    1|
|ENTERTAINMENT PAR...|               4147.0|    1|
|                  MS|               4146.0|    1|
|     CATHOLIC SCHOOL|               4145.0|    1|
|CITY OF POMPANO B...|               4144.0|    1|
|                  AH|         

In [24]:
# occupation
indexed_data.groupBy(['CONTBR_OCCUPATION', 'CONTBR_OCCUPATION_index']).count()\
            .orderBy('CONTBR_OCCUPATION_index', ascending=False).show(10, False)

+-------------------------+-----------------------+-----+
|CONTBR_OCCUPATION        |CONTBR_OCCUPATION_index|count|
+-------------------------+-----------------------+-----+
|INVENTORY                |2459.0                 |1    |
|FARM LAND OWNER          |2458.0                 |1    |
|SENIOR AIDE              |2457.0                 |1    |
|ESTATE MANAGER           |2456.0                 |1    |
|PROFESSOR, LEGAL STUDIES |2455.0                 |1    |
|SENIOR DIRECTOR          |2454.0                 |1    |
|SCULPTURE TECHNICIAN     |2453.0                 |1    |
|DIETITIAN/HEALTH EDUCATOR|2452.0                 |1    |
|FULL TIM GRANDPA         |2451.0                 |1    |
|IT ADMINISTRATOR         |2450.0                 |1    |
+-------------------------+-----------------------+-----+
only showing top 10 rows



## Step 4 -  Feature Vectors

In [53]:
from pyspark.ml.feature import VectorAssembler

## Create a feature vector using 'index' columns
feature_columns = ['CONTBR_ST_index', 'CONTBR_OCCUPATION_index', 'CONTBR_EMPLOYER_index' ]

assembler = VectorAssembler(inputCols= feature_columns,  outputCol="features")
feature_vector = assembler.transform(indexed_data)
feature_vector.printSchema()

feature_vector.limit(5).toPandas()

root
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)
 |-- CAND_NM_index: double (nullable = false)
 |-- CONTBR_ST_index: double (nullable = false)
 |-- CONTBR_EMPLOYER_index: double (nullable = false)
 |-- CONTBR_OCCUPATION_index: double (nullable = false)
 |-- features: vector (nullable = true)



Unnamed: 0,CAND_NM,CONTBR_ST,CONTBR_EMPLOYER,CONTBR_OCCUPATION,CONTB_RECEIPT_AMT,CAND_NM_index,CONTBR_ST_index,CONTBR_EMPLOYER_index,CONTBR_OCCUPATION_index,features
0,"Johnson, Gary",CA,SELF,RETIRED,150.0,10.0,0.0,7.0,0.0,"[0.0, 0.0, 7.0]"
1,"Clinton, Hillary Rodham",NY,"SAP, INC",PARTNER,38.0,0.0,1.0,1142.0,325.0,"[1.0, 325.0, 1142.0]"
2,"Clinton, Hillary Rodham",PA,REGIONAL LEARNING ALLIANCE,CEO,25.0,0.0,7.0,3124.0,22.0,"[7.0, 22.0, 3124.0]"
3,"Clinton, Hillary Rodham",MD,SPACE TELESCOPE SCIENCE INSTITUTE,ASTROPHYSICIST,50.0,0.0,8.0,3799.0,2402.0,"[8.0, 2402.0, 3799.0]"
4,"Sanders, Bernard",FL,"DOYOUREMEMBER, INC.",DIRECTOR OF OPERATIONS,15.0,1.0,3.0,685.0,146.0,"[3.0, 146.0, 685.0]"


## Step 5: Split data into training and test


In [54]:
# TODO : Split the data into training and test sets (30% held out for testing)
(training, test) = feature_vector.randomSplit([70. , 30.])

print("training set = " , training.count())
print("testing set = " , test.count())

training set =  6940
testing set =  2951


## Step 6: Create Random Forest Model

In [55]:
from pyspark.ml.classification import RandomForestClassifier

## TODO : Create a random forest model
##        what is the 'labelCol' ?  (Hint : CAND_NM_index)
rf = RandomForestClassifier(labelCol="CAND_NM_index", featuresCol="features", numTrees=20, maxBins=50000)


## 7 -  Train

In [56]:
%%time
print("training starting...")

## TODO : start training, using 'fit' method  on training data
model = rf.fit(training)
print("training done")
print (model)

training starting...
training done
RandomForestClassificationModel (uid=RandomForestClassifier_878df6ba303e) with 20 trees
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 39.9 s


In [31]:
print("trained on {:,} records".format(training.count()))

trained on 6,932 records


## 8 - Prediction

In [57]:
%%time

## TODO : predict on 'test' columns
##        use 'transform' method
predictions = model.transform(test)

CPU times: user 8 ms, sys: 4 ms, total: 12 ms
Wall time: 62.4 ms


In [33]:
print("predicted on {:,} records".format(test.count()))

predicted on 2,959 records


In [58]:
# Select example rows to display.
predictions.sample(False, 0.1).select("prediction", 'CAND_NM_index', "CAND_NM").show()

+----------+-------------+--------------------+
|prediction|CAND_NM_index|             CAND_NM|
+----------+-------------+--------------------+
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       3.0|          4.0| Carson, Benjamin S.|
|       0.0|          0.0|Clinton, Hillary ...|
|       0.0|          0.0|Clinton, Hillary ...|
|       1.0|          0.0|Clinton, Hillary ...|
|       2.0|          0.0|Clinton, Hillary ...|
|       0.0|          0.0|Clinton, Hillary ...|
|       1.0|          0.0|Clinton, Hillary ...|
|       2.0|          0.0|Clinton, Hillary ...|
|       2.0|          0.0|Clinton, Hilla

## 9 - Evaluate

In [59]:
predictions_test = model.transform(test)
predictions_train = model.transform(training)

### 9.1 - Acuracy
**=> TODO: Think about the test error here?  Does it seem high?  What does that say about our model?**

**=> How do we define model success?**

In [60]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="CAND_NM_index", predictionCol="prediction",
                                              metricName="accuracy")

print("Training set accuracy = " , evaluator.evaluate(predictions_train))
print("Test set accuracy = " , evaluator.evaluate(predictions_test))

Training set accuracy =  0.6469740634005764
Test set accuracy =  0.5296509657743138


### 9.2 - Confusion Matrix

####  Figure Out Candidate Mapping

In [61]:
## Candidate Mapping
candidate_mapping = indexed_data.groupBy(['CAND_NM', 'CAND_NM_index']).count()
# candidate_mapping.orderBy('CAND_NM').show()
candidate_mapping.orderBy('CAND_NM_index').show()

+--------------------+-------------+-----+
|             CAND_NM|CAND_NM_index|count|
+--------------------+-------------+-----+
|Clinton, Hillary ...|          0.0| 4680|
|    Sanders, Bernard|          1.0| 2766|
|    Trump, Donald J.|          2.0| 1047|
|Cruz, Rafael Edwa...|          3.0|  709|
| Carson, Benjamin S.|          4.0|  340|
|        Rubio, Marco|          5.0|  132|
|           Bush, Jeb|          6.0|   39|
|          Paul, Rand|          7.0|   38|
|     Kasich, John R.|          8.0|   33|
|      Fiorina, Carly|          9.0|   31|
|       Johnson, Gary|         10.0|   24|
|       Walker, Scott|         11.0|   16|
|O'Malley, Martin ...|         12.0|   11|
|         Stein, Jill|         13.0|   10|
|      Huckabee, Mike|         14.0|    6|
|      McMullin, Evan|         15.0|    4|
|Christie, Christo...|         16.0|    3|
|       Jindal, Bobby|         17.0|    1|
|  Graham, Lindsey O.|         18.0|    1|
+--------------------+-------------+-----+



#### Confusion Matrix

**=>What can you conclude from the confusion matrix?**

Use the list above to interpret the label.  

Is our model better at predicting candidates with many donations (Clinton, Sanders), or few donations?

What can you say about our model perfromance.

In [62]:
cm = predictions.groupBy('CAND_NM').pivot('prediction', range(0,22)).count().na.fill(0).orderBy('CAND_NM')
cm.toPandas()

Unnamed: 0,CAND_NM,0,1,2,3,4,5,6,7,8,...,12,13,14,15,16,17,18,19,20,21
0,"Bush, Jeb",2,0,6,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Carson, Benjamin S.",10,9,67,10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Christie, Christopher J.",0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Clinton, Hillary Rodham",857,118,389,17,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Cruz, Rafael Edward 'Ted'",37,33,148,16,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Fiorina, Carly",3,3,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"Huckabee, Mike",0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Johnson, Gary",3,1,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Kasich, John R.",2,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"O'Malley, Martin Joseph",1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 10 -  Print the feature importanes

**=> TODO Compare the relative weight of the feature importances?**

In [63]:
import pandas as pd

imp = model.featureImportances.toArray()
print(imp)
df = pd.DataFrame({'cols': feature_columns, 'importance':imp})
print(df)
df.sort_values(by=['importance'], ascending=False)

[0.02851933 0.32975983 0.64172084]
                      cols  importance
0          CONTBR_ST_index    0.028519
1  CONTBR_OCCUPATION_index    0.329760
2    CONTBR_EMPLOYER_index    0.641721


Unnamed: 0,cols,importance
2,CONTBR_EMPLOYER_index,0.641721
1,CONTBR_OCCUPATION_index,0.32976
0,CONTBR_ST_index,0.028519


## Conclusion: Most important Fields

1. Employer
2. Occupation
3. State

Other fields not significant

**=> TODO Compare the relative weight of the feature importances?**

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation**



## BONUS : Running on full dataset

**Use the dowload script**

```bash
$ cd   ~/data/presidential_election_contribs
$ ./download-data.sh
```

This will download full dataset.

As we run on larger dataset, the execution will take longer and Jupyter notebook might time out.  So let's run this in command line / script mode

Download the Jupyter notebook as Python file (File --> Download as --> Python)

```bash
# run the downloaded python script as follows
$    time  ~/spark/bin/spark-submit    --master local[*]  random-forest-2-election-classification.py 2> logs

```

Watch the output
