# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

**The purpose here is to try to classify the candidate to whom the contributor likely to contribute.**  

Here are the feature columns we will use:
1. State 
2. Employer
3. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [1]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
spark

Initializing Spark...
Spark found in :  /Users/sujee/spark
Spark config:
	 spark.app.name=TestApp
	spark.master=local[*]
	executor.memory=2g
	spark.sql.warehouse.dir=/var/folders/lp/qm_skljd2hl4xtps5vw0tdgm0000gn/T/tmphyk5403s
	some_property=some_value
Spark UI running on port 4040


## Step 1: Load the data

In [2]:
%%time

# 100k samples
data_file = '/data/presidential_election_contribs/2016/2016-100k.csv.gz'


data = spark.read.csv(data_file, \
                         header=True, inferSchema=True)

CPU times: user 1.95 ms, sys: 1.3 ms, total: 3.24 ms
Wall time: 3.38 s


In [3]:
print("read {:,} records".format(data.count()))

read 100,000 records


In [4]:
data.printSchema()


root
 |-- CMTE_ID: string (nullable = true)
 |-- CAND_ID: string (nullable = true)
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_NM: string (nullable = true)
 |-- CONTBR_CITY: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_ZIP: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)
 |-- CONTB_RECEIPT_DT: string (nullable = true)
 |-- RECEIPT_DESC: string (nullable = true)
 |-- MEMO_CD: string (nullable = true)
 |-- MEMO_TEXT: string (nullable = true)
 |-- FORM_TP: string (nullable = true)
 |-- FILE_NUM: integer (nullable = true)
 |-- TRAN_ID: string (nullable = true)
 |-- ELECTION_TP: string (nullable = true)



In [5]:
## data.show() is hard to read
## use Pandas to pretty print

## vertical
data.limit(3).toPandas().T

# horizontal
# data.limit(5).toPandas()

Unnamed: 0,0,1,2
CMTE_ID,C00605568,C00574624,C00580100
CAND_ID,P20002671,P60006111,P80001571
CAND_NM,"Johnson, Gary","Cruz, Rafael Edward 'Ted'","Trump, Donald J."
CONTBR_NM,"SMITH, PAUL","BROWNE, THOMAS JOHN","RISENHOOVER, LINDSEY"
CONTBR_CITY,SAN DIEGO,WHITESBORO,TULSA
CONTBR_ST,CA,NY,OK
CONTBR_ZIP,92117,134921106,74133
CONTBR_EMPLOYER,SELF,RETIRED,INFORMATION REQUESTED
CONTBR_OCCUPATION,RETIRED,RETIRED,INFORMATION REQUESTED
CONTB_RECEIPT_AMT,150,35,73.59


### 1.5 - Sample Data
Start with a small sample of data. Once the algorithm is working procss the full dataset.


In [6]:
# sample size :  10% --> 0.1,   100%  -> 1.0

sample_size = 0.1

data = data.sample(withReplacement=False, fraction=sample_size)
print("sample size {:,} records".format(data.count()))

sample size 9,909 records


## Step 2 : Clean Data

### 2.1 - extract only a few columns

In [7]:
columns = ['CAND_NM', 'CONTBR_ST', 'CONTBR_EMPLOYER', 'CONTBR_OCCUPATION', 'CONTB_RECEIPT_AMT']

In [8]:
data2 = data.select(columns)
data2.printSchema()
data2.limit(5).toPandas()

root
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)



Unnamed: 0,CAND_NM,CONTBR_ST,CONTBR_EMPLOYER,CONTBR_OCCUPATION,CONTB_RECEIPT_AMT
0,"Trump, Donald J.",OK,INFORMATION REQUESTED,INFORMATION REQUESTED,73.59
1,"Sanders, Bernard",NE,NONE,NOT EMPLOYED,50.0
2,"Clinton, Hillary Rodham",WA,AXON,SOFTWARE ENGINEER,100.0
3,"Sanders, Bernard",FL,BROWN DAVIS,SELF,27.0
4,"Sanders, Bernard",WI,SOUTHWEST WISCONSIN TECHNICAL COLLEGE,PROFESSOR,25.0


### 2.2 - Clean data (drop null values)

In [9]:
data_clean = data2.na.drop()

print("original data size = {:,}".format(data2.count()) )
print("clean data size = {:,}".format(data_clean.count()) )
print("droped records = {:,}".format(data2.count() - data_clean.count()) )

original data size = 9,909
clean data size = 9,831
droped records = 78


## Step 2 : Basic Exploration

### 2.1 - Print out a contribution count broken down by candidate?

**=> Which candidates got the most donations? (in terms of number of donors)**

In [10]:
## TODO : print out per candidate breakdown
data_clean.groupBy('CAND_NM').count().orderBy('count', ascending=False).show(20, False)

+-------------------------+-----+
|CAND_NM                  |count|
+-------------------------+-----+
|Clinton, Hillary Rodham  |4610 |
|Sanders, Bernard         |2732 |
|Trump, Donald J.         |1039 |
|Cruz, Rafael Edward 'Ted'|749  |
|Carson, Benjamin S.      |350  |
|Rubio, Marco             |106  |
|Fiorina, Carly           |45   |
|Paul, Rand               |41   |
|Kasich, John R.          |36   |
|Bush, Jeb                |35   |
|Stein, Jill              |17   |
|Johnson, Gary            |15   |
|Christie, Christopher J. |11   |
|Walker, Scott            |10   |
|O'Malley, Martin Joseph  |10   |
|Huckabee, Mike           |9    |
|Graham, Lindsey O.       |4    |
|McMullin, Evan           |4    |
|Jindal, Bobby            |3    |
|Perry, James R. (Rick)   |2    |
+-------------------------+-----+
only showing top 20 rows



### 2.2 - find min/max/average contribution per candidate

In [11]:
from pyspark.sql.functions import min,max,mean

data_clean.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('CAND_NM').\
        show(40, False)

+-------------------------+----------------------+----------------------+----------------------+
|CAND_NM                  |min(CONTB_RECEIPT_AMT)|avg(CONTB_RECEIPT_AMT)|max(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+----------------------+----------------------+
|Bush, Jeb                |-350.0                |937.1142857142858     |5000.0                |
|Carson, Benjamin S.      |-2300.0               |84.16982857142857     |1000.0                |
|Christie, Christopher J. |25.0                  |1756.8181818181818    |2700.0                |
|Clinton, Hillary Rodham  |-450.0                |129.25781344902384    |2700.0                |
|Cruz, Rafael Edward 'Ted'|-2700.0               |81.89006675567424     |5400.0                |
|Fiorina, Carly           |3.0                   |368.24444444444447    |3000.0                |
|Graham, Lindsey O.       |-2700.0               |-312.5                |1000.0                |
|Huckabee, Mike           |25.

### 2.3 - Whoah!  Negative Contributions!
We see some negative contributions!   

**Q==> Can you guys figure out why?**


In [12]:
## TODO Filter out only positive contribs
pos_contribs = data_clean.filter("CONTB_RECEIPT_AMT > 0")

print("original data size = {:,}".format(data_clean.count()) )
print("positive contributions data size = {:,}".format(pos_contribs.count()) )

original data size = 9,831
positive contributions data size = 9,763


### 2.4 - now find min/max/median in positive contributions

In [13]:
from pyspark.sql.functions import min,max,mean

print ("sorted by CAND_NM")

pos_contribs.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('CAND_NM').\
        show(40, False)

sorted by CAND_NM
+-------------------------+----------------------+----------------------+----------------------+
|CAND_NM                  |min(CONTB_RECEIPT_AMT)|avg(CONTB_RECEIPT_AMT)|max(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+----------------------+----------------------+
|Bush, Jeb                |5.0                   |974.9705882352941     |5000.0                |
|Carson, Benjamin S.      |2.0                   |94.55170028818445     |1000.0                |
|Christie, Christopher J. |25.0                  |1756.8181818181818    |2700.0                |
|Clinton, Hillary Rodham  |1.0                   |129.6677389226759     |2700.0                |
|Cruz, Rafael Edward 'Ted'|1.0                   |104.65925             |5400.0                |
|Fiorina, Carly           |3.0                   |368.24444444444447    |3000.0                |
|Graham, Lindsey O.       |200.0                 |483.3333333333333     |1000.0                |
|Huckabee, M

In [14]:
from pyspark.sql.functions import min,max,mean

print("sorted by AVG contribution")

pos_contribs.groupBy('CAND_NM').\
        agg(min('CONTB_RECEIPT_AMT'), mean('CONTB_RECEIPT_AMT'), max('CONTB_RECEIPT_AMT')).\
        orderBy('avg(CONTB_RECEIPT_AMT)', ascending=False).\
        show(40, False)

sorted by AVG contribution
+-------------------------+----------------------+----------------------+----------------------+
|CAND_NM                  |min(CONTB_RECEIPT_AMT)|avg(CONTB_RECEIPT_AMT)|max(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+----------------------+----------------------+
|Christie, Christopher J. |25.0                  |1756.8181818181818    |2700.0                |
|O'Malley, Martin Joseph  |100.0                 |1477.5                |5400.0                |
|Perry, James R. (Rick)   |250.0                 |1475.0                |2700.0                |
|Bush, Jeb                |5.0                   |974.9705882352941     |5000.0                |
|Jindal, Bobby            |50.0                  |950.0                 |2700.0                |
|Kasich, John R.          |25.0                  |741.6666666666666     |2700.0                |
|Walker, Scott            |25.0                  |558.3333333333334     |2700.0                |
|Gr

### 2.5 -- Find total contribution amount per candidate

In [15]:
from pyspark.sql.functions import min,max,mean

print("sorted by total contribution")

pos_contribs.groupBy('CAND_NM').\
        sum('CONTB_RECEIPT_AMT').\
        orderBy('sum(CONTB_RECEIPT_AMT)', ascending=False).\
        show(40, False)

sorted by total contribution
+-------------------------+----------------------+
|CAND_NM                  |sum(CONTB_RECEIPT_AMT)|
+-------------------------+----------------------+
|Clinton, Hillary Rodham  |596990.2699999999     |
|Trump, Donald J.         |167640.23999999993    |
|Sanders, Bernard         |116360.12999999999    |
|Cruz, Rafael Edward 'Ted'|75354.66              |
|Rubio, Marco             |47519.0               |
|Bush, Jeb                |33149.0               |
|Carson, Benjamin S.      |32809.44              |
|Kasich, John R.          |26700.0               |
|Christie, Christopher J. |19325.0               |
|Fiorina, Carly           |16571.0               |
|O'Malley, Martin Joseph  |14775.0               |
|Johnson, Gary            |7148.2                |
|Paul, Rand               |6447.0                |
|Stein, Jill              |5554.0                |
|Walker, Scott            |5025.0                |
|Perry, James R. (Rick)   |2950.0                |
|J

## Step 3: Build Indexers

In [16]:
from pyspark.ml.feature import StringIndexer

indexer1 = StringIndexer(inputCol='CAND_NM', outputCol = "CAND_NM_index", handleInvalid="keep")
indexer2 = StringIndexer(inputCol='CONTBR_ST', outputCol = "CONTBR_ST_index", handleInvalid="keep")
indexer3 = StringIndexer(inputCol='CONTBR_EMPLOYER', outputCol = "CONTBR_EMPLOYER_index", handleInvalid="keep")
indexer4 = StringIndexer(inputCol='CONTBR_OCCUPATION', outputCol = "CONTBR_OCCUPATION_index", handleInvalid="keep")
# indexed_data = indexer1.fit(pos_contribs).transform(pos_contribs)
# indexed_data.show()


In [17]:
## Stash indexers into 
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[indexer1, indexer2, indexer3, indexer4])
print(pipeline)

Pipeline_46c99bf9ca311da8f399


In [18]:
%%time
indexed_data = pipeline.fit(pos_contribs).transform(pos_contribs)

CPU times: user 25 ms, sys: 5.89 ms, total: 30.9 ms
Wall time: 3.76 s


In [19]:
indexed_data.printSchema()
# indexed_data.show()

root
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)
 |-- CAND_NM_index: double (nullable = false)
 |-- CONTBR_ST_index: double (nullable = false)
 |-- CONTBR_EMPLOYER_index: double (nullable = false)
 |-- CONTBR_OCCUPATION_index: double (nullable = false)



### 3.1  Understand indexed values

In [20]:
# state
indexed_data.groupBy(['CONTBR_ST', 'CONTBR_ST_index']).count()\
            .orderBy('CONTBR_ST_index', ascending=False).show()

+---------+---------------+-----+
|CONTBR_ST|CONTBR_ST_index|count|
+---------+---------------+-----+
|       VI|           57.0|    1|
|       AA|           56.0|    1|
|       EN|           55.0|    1|
|       AE|           54.0|    2|
|       ON|           53.0|    2|
|       PR|           52.0|    7|
|       ZZ|           51.0|    9|
|       SD|           50.0|   13|
|       ND|           49.0|   14|
|       DE|           48.0|   17|
|       WY|           47.0|   18|
|       RI|           46.0|   21|
|       MS|           45.0|   33|
|       AK|           44.0|   36|
|       WV|           43.0|   37|
|       NE|           42.0|   38|
|       MT|           41.0|   39|
|       ID|           40.0|   44|
|       HI|           39.0|   45|
|       UT|           38.0|   50|
+---------+---------------+-----+
only showing top 20 rows



In [21]:
# employer
indexed_data.groupBy(['CONTBR_EMPLOYER', 'CONTBR_EMPLOYER_index']).count()\
            .orderBy('CONTBR_EMPLOYER_index', ascending=False).show()

+--------------------+---------------------+-----+
|     CONTBR_EMPLOYER|CONTBR_EMPLOYER_index|count|
+--------------------+---------------------+-----+
|        BEEZY'S CAFE|               4135.0|    1|
|       FEDEX EXPRESS|               4134.0|    1|
|UNIVERSITY OF NOR...|               4133.0|    1|
|WIKIMEDIA FOUNDATION|               4132.0|    1|
|                TSFL|               4131.0|    1|
|EAST BAY LEADERSH...|               4130.0|    1|
| GREEN THUMB FLORIST|               4129.0|    1|
|           ACT, INC.|               4128.0|    1|
|PLANED INTERNATIO...|               4127.0|    1|
|PUEBLO HOMEOWNERS...|               4126.0|    1|
|  WISE HEALTH SYSTEM|               4125.0|    1|
| TIME EQUITIES, INC.|               4124.0|    1|
|     SUNSET PHARMACY|               4123.0|    1|
|CHRIST THE SERVAN...|               4122.0|    1|
|   L. DEAN WEAVER CO|               4121.0|    1|
|ARNALL GOLDEN GRE...|               4120.0|    1|
|DARTMOUTH HITCHCO...|         

In [22]:
# occupation
indexed_data.groupBy(['CONTBR_OCCUPATION', 'CONTBR_OCCUPATION_index']).count()\
            .orderBy('CONTBR_OCCUPATION_index', ascending=False).show(10, False)

+--------------------------------------+-----------------------+-----+
|CONTBR_OCCUPATION                     |CONTBR_OCCUPATION_index|count|
+--------------------------------------+-----------------------+-----+
|IT PROGRAM MANAGER  MANAGEMENT ANALYS |2512.0                 |1    |
|LAWYER, GOVERNMENT RELATIONS DIRECTOR |2511.0                 |1    |
|PROFESSOR, LEGAL STUDIES              |2510.0                 |1    |
|NON-PROFIT EXECUTIVE DIRECTOR AND ARTI|2509.0                 |1    |
|PARLIAMENTARIAN/MEETING CONSULTANT    |2508.0                 |1    |
|FULL TIM GRANDPA                      |2507.0                 |1    |
|DOG WALKER SUPERVISOR                 |2506.0                 |1    |
|EXECUTIVE PRODUCER                    |2505.0                 |1    |
|VP OF FINANCE                         |2504.0                 |1    |
|MS                                    |2503.0                 |1    |
+--------------------------------------+-----------------------+-----+
only s

## Step 4 -  Feature Vectors

In [23]:
from pyspark.ml.feature import VectorAssembler

feature_columns = ['CONTBR_ST_index', 'CONTBR_EMPLOYER_index', 'CONTBR_OCCUPATION_index' ]

assembler = VectorAssembler(inputCols= feature_columns,  outputCol="features")
feature_vector = assembler.transform(indexed_data)
feature_vector.printSchema()
# feature_vector.show()

root
 |-- CAND_NM: string (nullable = true)
 |-- CONTBR_ST: string (nullable = true)
 |-- CONTBR_EMPLOYER: string (nullable = true)
 |-- CONTBR_OCCUPATION: string (nullable = true)
 |-- CONTB_RECEIPT_AMT: double (nullable = true)
 |-- CAND_NM_index: double (nullable = false)
 |-- CONTBR_ST_index: double (nullable = false)
 |-- CONTBR_EMPLOYER_index: double (nullable = false)
 |-- CONTBR_OCCUPATION_index: double (nullable = false)
 |-- features: vector (nullable = true)



## Step 5: Split data into training and test


In [24]:
# Split the data into training and test sets (30% held out for testing)
(training, test) = feature_vector.randomSplit([.7,.3]) # do a random split 70%/30%

print("training set = " , training.count())
print("testing set = " , test.count())

training set =  6801
testing set =  2962


## Step 6: Create Random Forest Model

In [25]:
from pyspark.ml.classification import RandomForestClassifier

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="CAND_NM_index", featuresCol="features", numTrees=20, maxBins=50000)


## 7 -  Train

In [26]:
%%time
print("training starting...")
model = rf.fit(training)
print("training done")
print (model)

training starting...
training done
RandomForestClassificationModel (uid=RandomForestClassifier_423d98adc43a8c0376a9) with 20 trees
CPU times: user 16.3 ms, sys: 5.71 ms, total: 22 ms
Wall time: 42.1 s


In [27]:
print("trained on {:,} records".format(training.count()))

trained on 6,801 records


## 8 - Prediction

In [28]:
%%time
predictions = model.transform(test)

CPU times: user 8.41 ms, sys: 2.92 ms, total: 11.3 ms
Wall time: 66.8 ms


In [29]:
print("predicted on {:,} records".format(test.count()))

predicted on 2,962 records


In [30]:
# Select example rows to display.
predictions.sample(False, 0.1).select("prediction", 'CAND_NM_index', "CAND_NM").show()

+----------+-------------+--------------------+
|prediction|CAND_NM_index|             CAND_NM|
+----------+-------------+--------------------+
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       0.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       3.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benjamin S.|
|       4.0|          4.0| Carson, Benjamin S.|
|       3.0|          4.0| Carson, Benjamin S.|
|       4.0|          4.0| Carson, Benjamin S.|
|       2.0|          4.0| Carson, Benja

## 9 - Evaluate

In [31]:
predictions_test = model.transform(test)
predictions_train = model.transform(training)

### 9.1 - Acuracy
**=> TODO: Think about the test error here?  Does it seem high?  What does that say about our model?**

**=> How do we define model success?**

In [32]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="CAND_NM_index", predictionCol="prediction",
                                              metricName="accuracy")

print("Training set accuracy = " , evaluator.evaluate(predictions_train))
print("Test set accuracy = " , evaluator.evaluate(predictions_test))

Training set accuracy =  0.6475518306131451
Test set accuracy =  0.5182309250506415


### 9.2 - Confusion Matrix

####  Figure Out Candidate Mapping

In [33]:
## Candidate Mapping
candidate_mapping = indexed_data.groupBy(['CAND_NM', 'CAND_NM_index']).count()
# candidate_mapping.orderBy('CAND_NM').show()
candidate_mapping.orderBy('CAND_NM_index').show()

+--------------------+-------------+-----+
|             CAND_NM|CAND_NM_index|count|
+--------------------+-------------+-----+
|Clinton, Hillary ...|          0.0| 4604|
|    Sanders, Bernard|          1.0| 2732|
|    Trump, Donald J.|          2.0| 1015|
|Cruz, Rafael Edwa...|          3.0|  720|
| Carson, Benjamin S.|          4.0|  347|
|        Rubio, Marco|          5.0|  103|
|      Fiorina, Carly|          6.0|   45|
|          Paul, Rand|          7.0|   41|
|     Kasich, John R.|          8.0|   36|
|           Bush, Jeb|          9.0|   34|
|         Stein, Jill|         10.0|   17|
|       Johnson, Gary|         11.0|   15|
|Christie, Christo...|         12.0|   11|
|O'Malley, Martin ...|         13.0|   10|
|       Walker, Scott|         14.0|    9|
|      Huckabee, Mike|         15.0|    9|
|      McMullin, Evan|         16.0|    4|
|       Jindal, Bobby|         17.0|    3|
|  Graham, Lindsey O.|         18.0|    3|
|Perry, James R. (...|         19.0|    2|
+----------

#### Confusion Matrix

**=>What can you conclude from the confusion matrix?**

Use the list above to interpret the label.  

Is our model better at predicting candidates with many donations (Clinton, Sanders), or few donations?

What can you say about our model perfromance.

In [34]:
cm = predictions.groupBy('CAND_NM').pivot('prediction', range(0,22)).count().na.fill(0).orderBy('CAND_NM')
cm.toPandas()

Unnamed: 0,CAND_NM,0,1,2,3,4,5,6,7,8,...,12,13,14,15,16,17,18,19,20,21
0,"Bush, Jeb",0,0,8,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Carson, Benjamin S.",8,9,65,10,13,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Christie, Christopher J.",1,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Clinton, Hillary Rodham",790,168,371,98,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Cruz, Rafael Edward 'Ted'",15,35,122,29,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Fiorina, Carly",3,4,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"Graham, Lindsey O.",2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Huckabee, Mike",0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Jindal, Bobby",0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Johnson, Gary",1,0,4,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 10 -  Print the feature importanes

**=> TODO Compare the relative weight of the feature importances?**

In [35]:
import pandas as pd

imp = model.featureImportances.toArray()
print(imp)
df = pd.DataFrame({'cols': feature_columns, 'importance':imp})
print(df)
df.sort_values(by=['importance'], ascending=False)

[0.04677645 0.67797629 0.27524726]
                      cols  importance
0          CONTBR_ST_index    0.046776
1    CONTBR_EMPLOYER_index    0.677976
2  CONTBR_OCCUPATION_index    0.275247


Unnamed: 0,cols,importance
1,CONTBR_EMPLOYER_index,0.677976
2,CONTBR_OCCUPATION_index,0.275247
0,CONTBR_ST_index,0.046776


## Conclusion: Most important Fields

1. Employer
2. Occupation
3. State

Other fields not significant

**=> TODO Compare the relative weight of the feature importances?**

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation**



## BONUS : Running on full dataset

**Use the dowload script**

```bash
$ cd   ~/data/presidential_election_contribs
$ ./download-data.sh
```

This will download full dataset.

As we run on larger dataset, the execution will take longer and Jupyter notebook might time out.  So let's run this in command line / script mode

Download the Jupyter notebook as Python file (File --> Download as --> Python)

```bash
# run the downloaded python script as follows
$    time  ~/spark/bin/spark-submit    --master local[*]  random-forest-2-election-classification.py 2> logs

```

Watch the output
