# Random Forest Model

This notebook pulls in the cleaned data from the data_cleaning.ipynb notebook and uses that to create and evaluate
a random forest model.

This projects is based off the Buzzfeed news article on identifying spy planes found [here](https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes), using the data and code adapted from their github repository [here](https://github.com/BuzzFeedNews/2017-08-spy-plane-finder).

## Instructions

Follow the directions in any cell that does not contain code. If a cell does contain code, run this before moving on to the next cell

In [1]:
%matplotlib inline
#import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#sci-kit learn is a library with machine learning algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [2]:
# This relies on output from a previous notebook!
# If this cell does not work, try using the pregenerated data instead
#planes_labeled = pd.read_csv("/mnt/data/spyplane-data/pregenerated_planes_labeled.csv")
planes_labeled = pd.read_csv("/mnt/data/spyplane-data/planes_labeled.csv")
planes_labeled.head()

Unnamed: 0,adshex,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,...,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type,class,type_factorized
0,A00002,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,...,0.03407,0.202578,0.021179,0.06814,29,0,1086,SHIP,other,0
1,A00220,0.0,0.254902,0.176471,0.313725,0.254902,0.058824,0.372549,0.294118,0.215686,...,0.13203,0.120011,0.008611,0.006906,51,0,11149,RV10,other,1
2,A0041E,0.142857,0.285714,0.0,0.571429,0.0,0.285714,0.142857,0.285714,0.285714,...,0.090498,0.078431,0.010558,0.019608,7,0,663,SR22,other,2
3,A00889,0.0,0.12,0.2,0.08,0.6,0.0,0.2,0.12,0.28,...,0.065339,0.023907,0.001276,0.001702,25,7760,11754,SR22,other,2
4,A008BE,0.0,0.3,0.2,0.2,0.3,0.0,0.3,0.2,0.3,...,0.092958,0.14507,0.001408,0.009859,10,1200,710,PA24,other,3


### Background 

The random forest classifier works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. You can find the random forest classifier documentation for sci-kit learn [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

### Format data

We'll first format our data for the model. 

In [3]:
#factorize the classes
y = pd.factorize(planes_labeled['class'])[0]
y[0:5]

array([0, 0, 0, 0, 0])

In [4]:
#create X and drop columns that won't be used for training
X = planes_labeled.drop(['adshex','class', 'type'], axis = 1)
X.head()

Unnamed: 0,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,...,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type_factorized
0,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,...,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,0
1,0.0,0.254902,0.176471,0.313725,0.254902,0.058824,0.372549,0.294118,0.215686,0.058824,...,0.263342,0.375998,0.13203,0.120011,0.008611,0.006906,51,0,11149,1
2,0.142857,0.285714,0.0,0.571429,0.0,0.285714,0.142857,0.285714,0.285714,0.0,...,0.108597,0.657617,0.090498,0.078431,0.010558,0.019608,7,0,663,2
3,0.0,0.12,0.2,0.08,0.6,0.0,0.2,0.12,0.28,0.4,...,0.078782,0.814361,0.065339,0.023907,0.001276,0.001702,25,7760,11754,2
4,0.0,0.3,0.2,0.2,0.3,0.0,0.3,0.2,0.3,0.2,...,0.250704,0.43662,0.092958,0.14507,0.001408,0.009859,10,1200,710,3


### Train and test data sets

We want to be sure to set aside some data to test this model so that we can determine it's accuracy. Here, we'll use sci-kit learn to split our data into these groups. We'll set a test size of 30% and use the remaining 70% to train on. 

In [5]:
#split into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:
print(X_train.shape, X_test.shape)

(408, 32) (176, 32)


### Train model

Now we'll train our model and make predictions. We'll seed the random number generator so that we get the same values.

In [7]:
#set a seed so that the results are reproducible
np.random.seed(415)

spy_model = RandomForestClassifier()
spy_model.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [8]:
#predict the classes
predictions = spy_model.predict(X_test)
predictions[0:10]

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 0])

In [9]:
#visualize the number of classified surveillence and other planes in the test set

cmtx = pd.DataFrame(
    confusion_matrix(y_test, predictions), 
    index=['true:other', 'true:surveillence'], 
    columns=['predicted:other', 'predicted:surveillence']
)

cmtx

Unnamed: 0,predicted:other,predicted:surveillence
true:other,144,2
true:surveillence,10,20


In [10]:
#look at the overall accuracy from the numbers in the table above
calculated_accuracy = (144+20)/(144+20+2+10)
calculated_accuracy

0.9318181818181818

In [11]:
#use sci-kit learn's built in scoring feature
spy_model.score(X_test, y_test)

0.9318181818181818

In [12]:
predict_prob = spy_model.predict_proba(X_test)
predict_prob[0:10]

array([[0.93, 0.07],
       [0.87, 0.13],
       [1.  , 0.  ],
       [0.99, 0.01],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.95, 0.05],
       [0.39, 0.61],
       [0.18, 0.82],
       [0.99, 0.01]])