# Finding surveillance planes using random forests

**The story:**

- https://www.buzzfeednews.com/article/peteraldhous/spies-in-the-skies
- https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes
    
This story, done by Peter Aldhous at Buzzfeed News, involved training a machine learning algorithm to recognize government surveillance planes based on what their flight patterns look like.

**Topics:** Random Forests

**Datasets**

* **feds.csv:** Transponder codes of planes operated by the federal government
* **planes_features.csv:** various features describing each plane's flight patterns
* **train.csv:** a labeled dataset of transponder codes and whether each plane is a surveillance plane or not
    - The `label` column was originally `class`, but I renamed it because pandas freaks out a bit with a column named `class`
    - This was created by Buzzfeed `feds.csv`
* **data dictionary:** You can find the data dictionary published with their analysis [here](https://buzzfeednews.github.io/2016-04-federal-surveillance-planes/analysis.html)
* **a few other files**

## What's the goal?

The FBI and Department of Homeland Security operate many planes that are not directly labeled as belonging to the government. If we can uncover these planes, we have a better idea of the surveillance activities they are undertaking.

## Imports

Also set a large number of maximum columns.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_colwidth", 200)

# Read in our data

Almost all classification problems start with a set of labeled features. In this case, the features are in one CSV file and the labels are in another. **Read both files in and merge them on `adshex`, the transpoder code.**

In [2]:
features = pd.read_csv('data/planes_features.csv')
features.head()

Unnamed: 0,adshex,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type
0,A,0.120253,0.075949,0.183544,0.335443,0.28481,0.088608,0.044304,0.06962,0.120253,0.677215,0.021824,0.02055,0.06233,0.100713,0.794582,0.042374,0.060971,0.066831,0.106403,0.723421,0.020211,0.048913,0.27055,0.34409,0.097317,0.186651,0.011379,0.009426,158,0,11776,GRND
1,A00000,0.211735,0.155612,0.181122,0.19898,0.252551,0.204082,0.183673,0.168367,0.173469,0.267857,0.107348,0.14341,0.208139,0.177013,0.36409,0.177318,0.114457,0.129648,0.197694,0.380882,0.034976,0.048127,0.240732,0.356314,0.116116,0.159325,0.012828,0.013628,392,0,52465,TBM7
2,A00002,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,SHIP
3,A00008,0.125,0.041667,0.208333,0.166667,0.458333,0.125,0.083333,0.125,0.166667,0.5,0.18796,0.278952,0.221048,0.190257,0.121783,0.014706,0.053309,0.149816,0.279871,0.502298,0.029871,0.044118,0.202665,0.380515,0.094669,0.182904,0.014706,0.020221,24,0,2176,PA46
4,A0001E,0.1,0.2,0.2,0.4,0.1,0.1,0.0,0.1,0.4,0.4,0.007937,0.026984,0.084127,0.179365,0.701587,0.04127,0.085714,0.039683,0.111111,0.722222,0.019048,0.049206,0.249206,0.326984,0.112698,0.206349,0.012698,0.011111,10,1135,630,C56X


In [3]:
labels = pd.read_csv('data/train.csv')
labels.head()

Unnamed: 0,adshex,label
0,A00C4B,surveil
1,A0AB21,surveil
2,A0AE77,surveil
3,A0AE7C,surveil
4,A0C462,surveil


In [4]:
adshex = pd.merge(features, labels, on='adshex', how='left')
adshex.head()

Unnamed: 0,adshex,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type,label
0,A,0.120253,0.075949,0.183544,0.335443,0.28481,0.088608,0.044304,0.06962,0.120253,0.677215,0.021824,0.02055,0.06233,0.100713,0.794582,0.042374,0.060971,0.066831,0.106403,0.723421,0.020211,0.048913,0.27055,0.34409,0.097317,0.186651,0.011379,0.009426,158,0,11776,GRND,
1,A00000,0.211735,0.155612,0.181122,0.19898,0.252551,0.204082,0.183673,0.168367,0.173469,0.267857,0.107348,0.14341,0.208139,0.177013,0.36409,0.177318,0.114457,0.129648,0.197694,0.380882,0.034976,0.048127,0.240732,0.356314,0.116116,0.159325,0.012828,0.013628,392,0,52465,TBM7,
2,A00002,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,SHIP,other
3,A00008,0.125,0.041667,0.208333,0.166667,0.458333,0.125,0.083333,0.125,0.166667,0.5,0.18796,0.278952,0.221048,0.190257,0.121783,0.014706,0.053309,0.149816,0.279871,0.502298,0.029871,0.044118,0.202665,0.380515,0.094669,0.182904,0.014706,0.020221,24,0,2176,PA46,
4,A0001E,0.1,0.2,0.2,0.4,0.1,0.1,0.0,0.1,0.4,0.4,0.007937,0.026984,0.084127,0.179365,0.701587,0.04127,0.085714,0.039683,0.111111,0.722222,0.019048,0.049206,0.249206,0.326984,0.112698,0.206349,0.012698,0.011111,10,1135,630,C56X,


### No wait, merge them again!

We have features for about 20,000 planes and labels for about 600 planes. When you merge, the planes you have features for but not labels for will disappear.

We want to keep those in the dataframe so we can play detective with them later, and try to find surveillance planes using the features. When you merge, you should use `how='left'` or `how='right'` to keep unmatched columns from the left (or right) dataframe.

Confirm you have 19,799 rows and 34 columns.

In [5]:
adshex.shape

(19799, 34)

# Cleaning up our data

## Number-izing our labels

Each row is a plane, and it's marked as either a surveillance plane or not. How many do we have in each category?

In [6]:
adshex.label.value_counts()

other      500
surveil     97
Name: label, dtype: int64

How do you feel about that split?

**Prepare this column for machine learning.** What's wrong with it as `"surveil"` and `"other"`? Add a new column that we can use for classification.

In [7]:
adshex['label'] = adshex.label.replace({'surveil': 1, 'other': 0})

## Categorical variables

Do we have any variables that count as categories? Yes, we do! ...but how many different categories does it have?

* **Tip:** You can use `.unique()` or `.value_counts()` to count unique items, depending on what you're looking for

In [8]:
adshex.type.unique()

array(['GRND', 'TBM7', 'SHIP', 'PA46', 'C56X', 'C82S', 'PC12', 'R66',
       'DA42', 'BE36', 'unknown', 'BE20', 'BE9T', 'B407', 'SR22', 'A139',
       'BE9L', 'B429', 'B350', 'TBM8', 'EPIC', 'GLF4', 'BE35', 'ALIG',
       'RV10', 'PAY1', 'C210', 'C172', 'C310', 'AEST', 'C182', 'RV6',
       'C208', 'BE55', 'B36T', 'AS55', 'BE58', 'C55B', 'BE60', 'C421',
       'MU2', 'PA31', 'ZZZZ', 'C340', 'BE33', 'C501', 'CRER', 'E50P',
       'BE10', 'B190', 'PA27', 'LNC2', 'LNC4', 'P210', 'M20P', 'C402',
       'SR20', 'PA24', 'C240', 'PA34', 'RV3', 'E45X', 'M20T', 'COL3',
       'AC11', 'C441', 'H500', 'LJ25', 'C185', 'COL4', 'P28B', 'T206',
       'C404', 'P28A', 'P32R', 'T210', 'BE80', 'AS50', 'WW24', 'B06',
       'GLF2', 'EC20', 'PA32', 'F2TH', 'P28R', 'H25B', 'C680', 'P180',
       'EA50', 'S22T', 'GLID', 'BE99', 'C180', 'P46T', 'LA4', 'DHC2',
       'DA40', 'KODI', 'PAY2', 'RV9', 'R44', 'C25A', 'RV7', 'A109',
       'SBR1', 'T33', 'C414', 'RV8', 'PA30', 'EVOT', 'EC30', 'SWAK',
       'EFOX',

Most of those types of plane only have one appearance, which means they wouldn't be very helpful identifiers in the final analysis. For example, if I only see one GLF5 and it's a surveillance plane, does that mean the next one I see is probably a surveillance plane? With such a small sample size, I have no idea!

We have a few options

1. Create a very large set of dummy variables out of all 133 types of planes
2. Create `0`/`1` columns for common plane types and ignore the less common ones -  C182, T206, SR22
3. Interview someone who knows something about planes and put these into a few broader categories
4. Keep them as one column, just turn them into numbers - it doesn't make sense in terms of order, but if one or two plane types are very indicative of a surveillance plane the forest might pick it up

Oddly enough, **the last one is a common approach.** Let's use it!

If you want to convert a list of categories into numbers, an easy way is to use the `Categorical` data type.

In [9]:
adshex.type = adshex.type.astype('category')
adshex.type.head()

0    GRND
1    TBM7
2    SHIP
3    PA46
4    C56X
Name: type, dtype: category
Categories (455, object): [208, A109, A119, A139, ..., WW24, XL2, ZZZZ, unknown]

It looks like a normal bunch of strings, but pandas is secretly using a number for each one! You can find the number with `.cat.codes`.

**Use `df.type.cat.codes` to make a new columns called `type_code`.** 

In [10]:
adshex['type_code'] = adshex.type.cat.codes

We'll use `type_code` for machine learning since sklearn needs a number, and `type` for reading since we like text.

# Building our classifier

When we're about to classify, we usually just drop our target column to build our inputs and outputs:

```python
X = train_df.drop(column='column_you_are_predicting')
y = train_df.column_you_are_predicting
```

This time is a little different. First, we have unlabeled data in there! Use `.dropna()` to filter your training data so we only have labeled data.

Confirm `train_df` has 597 rows and 35 columns.

In [11]:
train_df = adshex.dropna()

In [12]:
train_df.shape

(597, 35)

We also have a few extra columns that we aren't using for classification (like the text version of the type column and the transponder code). It's fine to drop multiple columns here that you aren't using, just a little bit messier. You also have to make sure you're dropping all the right ones.

Do a `.head()` to double-check all of the columns you need to drop when creating your `X`.

In [13]:
train_df = train_df.drop(['adshex', 'type'], axis=1)

In [14]:
train_df.head()

Unnamed: 0,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,label,type_code
2,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,0.0,399
29,0.0,0.254902,0.176471,0.313725,0.254902,0.058824,0.372549,0.294118,0.215686,0.058824,0.05265,0.191676,0.419769,0.329805,0.006099,0.244327,0.235985,0.123329,0.307023,0.089335,0.034801,0.038389,0.263342,0.375998,0.13203,0.120011,0.008611,0.006906,51,0,11149,0.0,374
55,0.142857,0.285714,0.0,0.571429,0.0,0.285714,0.142857,0.285714,0.285714,0.0,0.00905,0.135747,0.426848,0.322775,0.105581,0.179487,0.067873,0.352941,0.399698,0.0,0.010558,0.00905,0.108597,0.657617,0.090498,0.078431,0.010558,0.019608,7,0,663,0.0,406
122,0.0,0.12,0.2,0.08,0.6,0.0,0.2,0.12,0.28,0.4,0.005785,0.059895,0.187936,0.628297,0.118087,0.058448,0.122426,0.068402,0.582865,0.167858,0.002978,0.009784,0.078782,0.814361,0.065339,0.023907,0.001276,0.001702,25,7760,11754,0.0,406
124,0.0,0.3,0.2,0.2,0.3,0.0,0.3,0.2,0.3,0.2,0.183099,0.380282,0.152113,0.21831,0.066197,0.108451,0.029577,0.314085,0.102817,0.44507,0.021127,0.01831,0.250704,0.43662,0.092958,0.14507,0.001408,0.009859,10,1200,710,0.0,342


### Create your `X` and `y`.

When you do `train_df.drop`, you'll want to remove more than just your `0`/`1` surveillance label. What other columns do you not want to use as input? Maybe some categories you converted into codes?

In [15]:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std 

X = train_df.drop(columns='label')
y = train_df.label

Triple-check that `X` is a list of numeric features and and `y` is a numeric label.

In [16]:
X.dtypes

duration1       float64
duration2       float64
duration3       float64
duration4       float64
duration5       float64
boxes1          float64
boxes2          float64
boxes3          float64
boxes4          float64
boxes5          float64
speed1          float64
speed2          float64
speed3          float64
speed4          float64
speed5          float64
altitude1       float64
altitude2       float64
altitude3       float64
altitude4       float64
altitude5       float64
steer1          float64
steer2          float64
steer3          float64
steer4          float64
steer5          float64
steer6          float64
steer7          float64
steer8          float64
flights           int64
squawk_1          int64
observations      int64
type_code         int16
dtype: object

In [17]:
y.dtypes

dtype('float64')

### Split into test and train datasets

We could be nice and lazy and use all our data for training, but it just isn't right! Taking a test using the exact same questions you studied is just cheating. Split your data into test and train.

* **Tip:** Don't do this manually! There's a method for it in sklearn

In [18]:
from sklearn import tree
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Classify using a logistic classifier

## Train your classifier

Build a `LogisticRegression` and fit it to your data, making sure you're training using only `X_train` and `y_train`.

* **Tip:** You'll want to give `LogisticRegression` an extra argument of `max_iter=4000` - it means "work a little harder than you expect," because otherwise it won't find an answer (by default it only has a `max_iter` of 100)

In [19]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)

clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Examine the coefficients

What does it mean? What features is the classifier using? Do you care about the odds ratio? **What is even the point of this `LogisticRegression` thing?**

In [20]:
import numpy as np

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False)

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
10,speed1,0.625828,1.869794
21,steer2,0.520862,1.683478
17,altitude3,0.436915,1.547924
5,boxes1,0.386663,1.472061
20,steer1,0.329899,1.390827
6,boxes2,0.314007,1.368899
0,duration1,0.111638,1.118108
27,steer8,0.006517,1.006539
29,squawk_1,0.000702,1.000702
31,type_code,0.000409,1.000409


## How well does our classifier perform?

Let's take a look at the confusion matrix to see how well this classifier finds surveillance planes.

```python
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not surveil', 'surveil'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
```

Notice we're using `y_test` and `X_test`, not the full dataset.

In [21]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not surveil', 'surveil'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not surveil,Predicted surveil
Is not surveil,119,7
Is surveil,8,16


Why do we use `y_test` and `X_test` instead of the full dataset?

In [22]:
# to make sure the training data is good for our model

# Classify using a decision tree

Now we'll use a decision tree. This is how you make one:

```python
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
```

But it's up to you to teach it what spy planes look like using your training data.

If we use `max_depth=` to limit the depth of the tree, it will help us visualize it. For example, `max_depth=5` will only allow the tree to make five decisions.

Make a decision tree and fit it to your data. Use a `max_depth=` of something between 2 to 5.

In [23]:
from sklearn.tree import DecisionTreeClassifier

clf = tree.DecisionTreeClassifier(max_depth=5)
clf = clf.fit(X_train, y_train)
clf

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## What are the important features?

This code is slightly different than feature importance for logistic regression. It looks like this:

```python
feature_names = X.columns
importances = clf.feature_importances_

pd.DataFrame({
    'feature': feature_names,
    'feature importance': importances,
}).sort_values(by='feature importance', ascending=False)
```

In [24]:
feature_names = X.columns
importances = clf.feature_importances_

pd.DataFrame({
    'feature': feature_names,
    'feature importance': importances,
}).sort_values(by='feature importance', ascending=False)

Unnamed: 0,feature,feature importance
21,steer2,0.713024
29,squawk_1,0.103076
6,boxes2,0.04495
0,duration1,0.037769
3,duration4,0.032595
26,steer7,0.024099
1,duration2,0.023992
20,steer1,0.012684
11,speed2,0.00781
22,steer3,0.0


### Understanding the output

**Why is the feature importance difference than for logistic regression?**

Also, if you don't specify a `max_depth`, that's a LOT of zeroes! It doesn't even use most of the features! **Why not?**

## How well does the tree perform?

Display another confusion matrix with your new classifier.

In [25]:
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not surveil', 'surveil'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not surveil,Predicted surveil
Is not surveil,121,5
Is surveil,5,19


## Visualize the tree

You can use this code to visualize the tree. You might need to `brew install graphviz` and `pip install graphviz`.

```python
from sklearn import tree
import graphviz

label_names = ['not surveillance', 'surveillance']
feature_names = X.columns

dot_data = tree.export_graphviz(clf,
                    feature_names=feature_names,  
                    filled=True,
                    class_names=label_names)  
graph = graphviz.Source(dot_data)  
graph
```

* **Tip:** You'll probably need to scroll sideways a bit

In [26]:
from sklearn import tree
import graphviz

label_names = ['not surveillance', 'surveillance']
feature_names = X.columns

dot_data = tree.export_graphviz(clf,
                    feature_names=feature_names,  
                    filled=True,
                    class_names=label_names)  
graph = graphviz.Source(dot_data)  
graph

ModuleNotFoundError: No module named 'graphviz'

# One more classifier: Random forest

## Build and train your classifier

We can build a random forest classifier like this:

```python
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
```

But you're in charge of fitting it to your training data!

* **Tip:** You can also set `max_depth` here, but you won't be able to visualize the result.
* **Tip:** Increase `n_estimators` to 100 to make a better classifier.

In [27]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100, max_depth=5)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## What are the important features?

In [28]:
feature_names = X.columns
importances = clf.feature_importances_

pd.DataFrame({
    'feature': feature_names,
    'feature importance': importances,
}).sort_values(by='feature importance', ascending=False)

Unnamed: 0,feature,feature importance
21,steer2,0.713024
29,squawk_1,0.103076
6,boxes2,0.04495
0,duration1,0.037769
3,duration4,0.032595
26,steer7,0.024099
1,duration2,0.023992
20,steer1,0.012684
11,speed2,0.00781
22,steer3,0.0


### Understanding the output

What is a random forest, and **why is the feature importance difference than for the decision tree?** Isn't a random forest just like a decision tree or something?

In [29]:
# a random forest is different trees which have their own results

## How well does it perform?

In [30]:
from sklearn import metrics

y_pred=model.predict(X_test)

print('Accuracy:', metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.94


### How confident do you feel in the model?

In [31]:
# not too confident, looks like it's overfitted

# Actually finding spy planes

Now let's try ot actually find our spy planes

## Retrain our model

When we did test/train split, we trained our model with only a subset of our data, so we could test with the rest. Now that we're working in the "real world" we want to re-train it using not just `_train` and `_test` data, but instead **everything we have labels for.**

In [32]:
clf = RandomForestClassifier(n_estimators = 100, max_depth=5)
clf = clf.fit(X, y)

## Filter for planes we want to predict

We have a dataframe of features that includes three types of planes:

* Those that are labeled as surveillance planes
* Those that are labeled as not surveillance
* Those that aren't labeled

Which do we want to predictions for? **Filter a new dataframe that's just those.**

* **Tip:** Scroll up to see where you created your `train_df`, it's the opposite!

In [33]:
df2 = adshex[adshex.label.isnull()]

How many planes do you have in that list? **Confirm it's about 19,200.**

In [34]:
df2.shape

(19202, 35)

## Predicting 

Build your `X` - remember you need to drop a few columns - and use that to make a prediction for each plane.

**Assign the prediction into the `predicted` column**.

* **Tip:** Scroll up to see where you created your features for training, it's similar
* **Tip:** pandas will yell at us about setting values on copies of a slice but it's fine

In [35]:
X = df2.drop(columns=['label', 'adshex', 'type'], axis=1)

## How many planes did it predict to be surveillance planes?

It should be roughly around 70-80 planes.

In [36]:
X['predicted'] = clf.predict(X)

In [37]:
X[X['predicted'] == 1].shape

(83, 33)

## But.. what about those other ones? The ones that are just below the threshold?

The cutoff for a prediction of `1` is 50%, but since we have a lot of time we're interested in investigating the top 150. To get the probability for each row, you will use `clf.predict_proba` instead of `clf.predict`. Also, to get the predicted probability for the `1` category, you'll need to add `[:,1]` to the end of the

```python
clf.predict_proba(***your features***)[:,1]
```

**Create a new column called `predicted_prob` that is the chance that the plane is a surveillance plane.**

* **Tip:** You dropped three columns when using `clf.predict`, but if you drop the same three you'll get an error now. There's now an extra column that you'll need to drop! What is it?

In [38]:
X = X.drop(columns=['predicted'], axis=1)

In [39]:
X['predicted_prob'] = clf.predict_proba(X)[:,1]

In [40]:
X.predicted_prob.value_counts(ascending=False)

0.003113    31
0.002867    19
0.002509    16
0.003059    16
0.002450    15
0.003172    12
0.002807    12
0.002701    11
0.002367    11
0.003030    11
0.002861    11
0.003835    10
0.003226    10
0.002813    10
0.002921    10
0.002396    10
0.002926    10
0.002975    10
0.003969    10
0.002784     9
0.002778     9
0.002936     8
0.003478     8
0.002563     7
0.002455     7
0.012269     7
0.003687     7
0.002743     6
0.003344     6
0.003435     6
            ..
0.011076     1
0.058339     1
0.009649     1
0.003983     1
0.005183     1
0.023140     1
0.003531     1
0.007706     1
0.018341     1
0.008745     1
0.008177     1
0.003220     1
0.002873     1
0.005570     1
0.021157     1
0.027441     1
0.116787     1
0.024743     1
0.006620     1
0.005325     1
0.008017     1
0.066953     1
0.008244     1
0.051763     1
0.020278     1
0.018994     1
0.079202     1
0.027115     1
0.186866     1
0.066304     1
Name: predicted_prob, Length: 18334, dtype: int64

### Get the top 200 predictions

Take a look at what the probabilities look like, showing the top 200 planes that are **most likely to be surveillance planes.**

Then save them to a file for later research.

In [41]:
df3 = X.sort_values(by='predicted_prob', ascending=False).head(200)

In [None]:
df3.to_csv('spy_planes', index=False)

# Questions

### Question 1

What kind of machine learning are we doing here, and why are we doing it?

In [None]:
# random forest to if the planes in the dataset are surveillance planes or they're not

### Question 2

What are a few different ways you can deal with categorical data? Think about how we dealt with race in the reveal regression compared to how we dealt with type in this dataset.

In [None]:
# logistic regression or decision tree 

### Question 3

Every time we ran a machine learning algorithm on our dataset, we looked at feature importance.

* When might it be important to explain what our model found important?
* When might it not be important?

In [None]:
# important if testing the values increases the model error - it means the model relied on the feature for the prediction
# not important if the model error remains unchanged - it means the model ignored the feature for the prediction

### Question 4

Using words and not column names, describe what the machine learning algorithm found to be important when identifying surveillance planes.

In [None]:
# the 4-digit code transmitted by the transponder, compass bearing, duration of flight

### Question 5

Why did we use test/train split when it would have been more effective to give our model all of the data from the start?

In [None]:
# when we don't have test data we can't be sure that our model is well doing (after it's created) - to test the model we need "unused" test data

### Question 6

Why did we use a random forest instead of a decision tree or logistic regression? Was there something about the data?

In [None]:
# categorical features = random forest performs better than logistic regression
# continuous variables = logistic regression 

### Question 7

Why did we use probability instead of just looking for planes with a predicted value of 1? It seems like we should have just trusted the algorithm, right?

In [None]:
# in the dataset only a small number of planes had a predicted value of 1. 
# probability helps find more (the rest)

### Question 8

What if our random forest or input dataset were flawed? What would be the repercussions?

In [None]:
# inaccurate... accuracy

### Question 9

The government could claim that we're threatening national security by publishing this paper as well as publishing this code - now anyone could look for planes that are surveilling them. What do you think?

In [None]:
# people need to know if they live in a surveillance state

### Question 10

We're using data from the past, but you can get real-time flight data from many services. Can you think of any uses for this algorithm using real-time instead of historical data?

### Question 11

This isn't a question, but if you look at `candidates.csv` and `candidates-annotates.csv` you can see how Buzzfeed did their research after finding a list of suspicious planes.