# Finding surveillance planes using classification

**The story:**

- https://www.buzzfeednews.com/article/peteraldhous/spies-in-the-skies
- https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes
    
This story, done by Peter Aldhous at Buzzfeed News, involved training a machine learning algorithm to recognize government surveillance planes based on what their flight patterns look like.

**Datasets**

* **feds.csv:** Transponder codes of planes operated by the federal government
* **planes_features.csv:** various features describing each plane's flight patterns
* **train.csv:** a labeled dataset of transponder codes and whether each plane is a surveillance plane or not
    - The `label` column was originally `class`, but I renamed it because pandas freaks out a bit with a column named `class`
    - This was created by Buzzfeed `feds.csv`
* **data dictionary:** You can find the data dictionary published with their analysis [here](https://buzzfeednews.github.io/2016-04-federal-surveillance-planes/analysis.html)
* **a few other files**

## What's the goal?

The FBI and Department of Homeland Security operate many planes that are not directly labeled as belonging to the government. If we can uncover these planes, we have a better idea of the surveillance activities they are undertaking.

## Imports

Also set a large number of maximum columns.

In [12]:
import pandas as pd

pd.set_option("display.max_columns", 100)

# Read in our data

Almost all classification problems start with a set of labeled features. In this case, the features are in one CSV file and the labels are in another. **Read both files in and merge them on `adshex`, the transpoder code.**

In [13]:
# Read in your features
features = pd.read_csv("planes_features.csv")
features.head()

Unnamed: 0,adshex,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type
0,A,0.120253,0.075949,0.183544,0.335443,0.28481,0.088608,0.044304,0.06962,0.120253,0.677215,0.021824,0.02055,0.06233,0.100713,0.794582,0.042374,0.060971,0.066831,0.106403,0.723421,0.020211,0.048913,0.27055,0.34409,0.097317,0.186651,0.011379,0.009426,158,0,11776,GRND
1,A00000,0.211735,0.155612,0.181122,0.19898,0.252551,0.204082,0.183673,0.168367,0.173469,0.267857,0.107348,0.14341,0.208139,0.177013,0.36409,0.177318,0.114457,0.129648,0.197694,0.380882,0.034976,0.048127,0.240732,0.356314,0.116116,0.159325,0.012828,0.013628,392,0,52465,TBM7
2,A00002,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,SHIP
3,A00008,0.125,0.041667,0.208333,0.166667,0.458333,0.125,0.083333,0.125,0.166667,0.5,0.18796,0.278952,0.221048,0.190257,0.121783,0.014706,0.053309,0.149816,0.279871,0.502298,0.029871,0.044118,0.202665,0.380515,0.094669,0.182904,0.014706,0.020221,24,0,2176,PA46
4,A0001E,0.1,0.2,0.2,0.4,0.1,0.1,0.0,0.1,0.4,0.4,0.007937,0.026984,0.084127,0.179365,0.701587,0.04127,0.085714,0.039683,0.111111,0.722222,0.019048,0.049206,0.249206,0.326984,0.112698,0.206349,0.012698,0.011111,10,1135,630,C56X


In [14]:
# Read in your labels
labeled = pd.read_csv("train.csv").rename(columns={'class': 'label'})
labeled.head()

Unnamed: 0,adshex,label
0,A00C4B,surveil
1,A0AB21,surveil
2,A0AE77,surveil
3,A0AE7C,surveil
4,A0C462,surveil


In [15]:
# We're merging with how='right' to keep the rows that do NOT have a match in the training dataset
df = labeled.merge(features, on='adshex', how='right')
df.head()

Unnamed: 0,adshex,label,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type
0,A,,0.120253,0.075949,0.183544,0.335443,0.28481,0.088608,0.044304,0.06962,0.120253,0.677215,0.021824,0.02055,0.06233,0.100713,0.794582,0.042374,0.060971,0.066831,0.106403,0.723421,0.020211,0.048913,0.27055,0.34409,0.097317,0.186651,0.011379,0.009426,158,0,11776,GRND
1,A00000,,0.211735,0.155612,0.181122,0.19898,0.252551,0.204082,0.183673,0.168367,0.173469,0.267857,0.107348,0.14341,0.208139,0.177013,0.36409,0.177318,0.114457,0.129648,0.197694,0.380882,0.034976,0.048127,0.240732,0.356314,0.116116,0.159325,0.012828,0.013628,392,0,52465,TBM7
2,A00002,other,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,SHIP
3,A00008,,0.125,0.041667,0.208333,0.166667,0.458333,0.125,0.083333,0.125,0.166667,0.5,0.18796,0.278952,0.221048,0.190257,0.121783,0.014706,0.053309,0.149816,0.279871,0.502298,0.029871,0.044118,0.202665,0.380515,0.094669,0.182904,0.014706,0.020221,24,0,2176,PA46
4,A0001E,,0.1,0.2,0.2,0.4,0.1,0.1,0.0,0.1,0.4,0.4,0.007937,0.026984,0.084127,0.179365,0.701587,0.04127,0.085714,0.039683,0.111111,0.722222,0.019048,0.049206,0.249206,0.326984,0.112698,0.206349,0.012698,0.011111,10,1135,630,C56X


Confirm you have 19,799 rows and 34 columns.

In [16]:
df.shape

(19799, 34)

# Cleaning up our data

## Number-izing our labels

Each row is a plane, and it's marked as either a surveillance plane, not a surveillance plane, or missing a label. How many do we have in each category?

In [17]:
df.label.value_counts(dropna=False)

NaN        19202
other        500
surveil       97
Name: label, dtype: int64

How do you feel about that split?

What is the difference between `other` and `NaN`?

**Prepare this column for machine learning.** What's wrong with it as `"surveil"` and `"other"`? Add a new column that we can use for classification.

In [18]:
df['is_surveil'] = (df.label.dropna() == 'surveil').astype(int)
df.head()

Unnamed: 0,adshex,label,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type,is_surveil
0,A,,0.120253,0.075949,0.183544,0.335443,0.28481,0.088608,0.044304,0.06962,0.120253,0.677215,0.021824,0.02055,0.06233,0.100713,0.794582,0.042374,0.060971,0.066831,0.106403,0.723421,0.020211,0.048913,0.27055,0.34409,0.097317,0.186651,0.011379,0.009426,158,0,11776,GRND,
1,A00000,,0.211735,0.155612,0.181122,0.19898,0.252551,0.204082,0.183673,0.168367,0.173469,0.267857,0.107348,0.14341,0.208139,0.177013,0.36409,0.177318,0.114457,0.129648,0.197694,0.380882,0.034976,0.048127,0.240732,0.356314,0.116116,0.159325,0.012828,0.013628,392,0,52465,TBM7,
2,A00002,other,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086,SHIP,0.0
3,A00008,,0.125,0.041667,0.208333,0.166667,0.458333,0.125,0.083333,0.125,0.166667,0.5,0.18796,0.278952,0.221048,0.190257,0.121783,0.014706,0.053309,0.149816,0.279871,0.502298,0.029871,0.044118,0.202665,0.380515,0.094669,0.182904,0.014706,0.020221,24,0,2176,PA46,
4,A0001E,,0.1,0.2,0.2,0.4,0.1,0.1,0.0,0.1,0.4,0.4,0.007937,0.026984,0.084127,0.179365,0.701587,0.04127,0.085714,0.039683,0.111111,0.722222,0.019048,0.049206,0.249206,0.326984,0.112698,0.206349,0.012698,0.011111,10,1135,630,C56X,


# Building our classifier

When we're about to classify, we usually just drop our target column to build our inputs and outputs:

```python
X = train_df.drop(column='column_you_are_predicting')
y = train_df.column_you_are_predicting
```

This time is a little different. First, we have unlabeled data in there! Use `.dropna()` to filter your training data so we only have labeled data.

In [19]:
train_df = df.dropna()
train_df.shape

(597, 35)

We also have a few extra columns that we aren't using for classification (like the text version of the type column and the transponder code). **We don't need to remove them now:** it's fine to drop multiple columns here that you aren't using, just a little bit messier. You also have to make sure you're dropping all the right ones.

Do a `.head()` to double-check all of the columns you need to drop when creating your `X`.

In [20]:
df.head(2)

Unnamed: 0,adshex,label,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations,type,is_surveil
0,A,,0.120253,0.075949,0.183544,0.335443,0.28481,0.088608,0.044304,0.06962,0.120253,0.677215,0.021824,0.02055,0.06233,0.100713,0.794582,0.042374,0.060971,0.066831,0.106403,0.723421,0.020211,0.048913,0.27055,0.34409,0.097317,0.186651,0.011379,0.009426,158,0,11776,GRND,
1,A00000,,0.211735,0.155612,0.181122,0.19898,0.252551,0.204082,0.183673,0.168367,0.173469,0.267857,0.107348,0.14341,0.208139,0.177013,0.36409,0.177318,0.114457,0.129648,0.197694,0.380882,0.034976,0.048127,0.240732,0.356314,0.116116,0.159325,0.012828,0.013628,392,0,52465,TBM7,


### Create your `X` and `y`.

When you do `train_df.drop`, you'll want to remove more than just your `0`/`1` surveillance label. What other columns do you not want to use as input? Maybe some categories you converted into codes?

In [21]:
X = train_df.drop(columns=['adshex', 'label', 'type', 'is_surveil'])
y = train_df.is_surveil

Triple-check that `X` is a list of numeric features and and `y` is a numeric label.

In [22]:
X.head(2)

Unnamed: 0,duration1,duration2,duration3,duration4,duration5,boxes1,boxes2,boxes3,boxes4,boxes5,speed1,speed2,speed3,speed4,speed5,altitude1,altitude2,altitude3,altitude4,altitude5,steer1,steer2,steer3,steer4,steer5,steer6,steer7,steer8,flights,squawk_1,observations
2,0.517241,0.103448,0.103448,0.103448,0.172414,0.862069,0.137931,0.0,0.0,0.0,0.990792,0.000921,0.0,0.0,0.008287,0.599448,0.400552,0.0,0.0,0.0,0.105893,0.090239,0.174954,0.244015,0.03407,0.202578,0.021179,0.06814,29,0,1086
29,0.0,0.254902,0.176471,0.313725,0.254902,0.058824,0.372549,0.294118,0.215686,0.058824,0.05265,0.191676,0.419769,0.329805,0.006099,0.244327,0.235985,0.123329,0.307023,0.089335,0.034801,0.038389,0.263342,0.375998,0.13203,0.120011,0.008611,0.006906,51,0,11149


In [23]:
y.head(2)

2     0.0
29    0.0
Name: is_surveil, dtype: float64

### Split into test and train datasets

We could be nice and lazy and use all our data for training, but it just isn't right! Taking a test using the exact same questions you studied is just cheating. Split your data into test and train.

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Classify using a logistic classifier

## Train your classifier

Build a `LogisticRegression` and fit it to your data, making sure you're training using only `X_train` and `y_train`.

* **Tip:** You'll want to give `LogisticRegression` an extra argument of `max_iter=4000` - it means "work a little harder than you expect," because otherwise it won't find an answer (by default it only has a `max_iter` of 100)

In [47]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, max_iter=4000)

In [48]:
clf.score(X_test, y_test)

0.9466666666666667

In [49]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not surveil', 'surveil'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not surveil,Predicted surveil
Is not surveil,125,6
Is surveil,2,17


## Examine the coefficients

What does it mean? What features is the classifier using? Do you care about the odds ratio? **What is even the point of this `LogisticRegression` thing?**

In [50]:
import numpy as np

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False)

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
21,steer2,3.387038,29.578218
17,altitude3,2.841694,17.144791
5,boxes1,2.253505,9.521045
19,altitude5,2.077599,7.985275
20,steer1,1.937424,6.94085
0,duration1,1.759796,5.811253
6,boxes2,1.465808,4.33104
18,altitude4,1.347521,3.847876
4,duration5,0.769442,2.158561
11,speed2,0.550424,1.733989


If we don't care about the odds ratio, using the `eli5` package can shrink our code by a lot (and give us color!)

## How well does our classifier perform?

Let's take a look at the confusion matrix to see how well this classifier finds surveillance planes. Make sure you're using `y_test` and `X_test`, not the full dataset.

# Classify using a decision tree

Now we'll use a decision tree. This is how you make one:

```python
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
```

But it's up to you to teach it what spy planes look like using your training data.

If we use `max_depth=` to limit the depth of the tree, it will help us visualize it. For example, `max_depth=5` will only allow the tree to make five decisions.

Make a decision tree and fit it to your data. Use a `max_depth=` of something between 2 to 5.

In [58]:
# from sklearn.linear_model import LogisticRegression

# clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
# clf.fit(X, y)

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [59]:
clf.score(X_test, y_test)

0.9733333333333334

In [60]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not surveil', 'surveil'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not surveil,Predicted surveil
Is not surveil,127,4
Is surveil,0,19


In [64]:
!pip install eli5
!brew install graphviz
#import eli5

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Updating Homebrew...
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 3 taps (phinze/cask, homebrew/core and homebrew/cask).
[34m==>[0m [1mNew Formulae[0m
cyrus-sasl          oras                qt-percona-server   tomcat@9
enzyme              python-tabulate     qt-postgresql       wllvm
gopass-jsonapi      qt-libiodbc         qt-unixodbc         xray
klee                qt-mariadb          snowpack
openmodelica        qt-mysql            sqlancer
[34m==>[0m [1mUpdated Formulae[0m
Updated 416 formulae.
[34m==>[0m [1mNew Casks[0m
code-composer-studio       mailtrackerblocker         simplelink-msp432e4-sdk
code-composer-studio       mailtrackerblocker         simplelink-msp432e4-sdk
devbook                    micro-sniff                uniflash
devbook                    micro-sniff                uniflash
devutils                   n1ghtshade                 veepn
devutils                   n

[34m==>[0m [1mDownloading from https://d29vzk4ow07wi7.cloudfront.net/215a03b5e14c336df9ffd[0m
######################################################################## 100.0%
[34m==>[0m [1mDownloading https://homebrew.bintray.com/bottles/graphviz-2.46.1.catalina.bo[0m
[34m==>[0m [1mDownloading from https://d29vzk4ow07wi7.cloudfront.net/ba5fd51f1c318e395ecbd[0m
######################################################################## 100.0%
[32m==>[0m [1mInstalling dependencies for graphviz: [32mgd[39m, [32mlibffi[39m, [32mgdbm[39m, [32mmpdecimal[39m, [32mpython@3.9[39m, [32mjasper[39m, [32mnetpbm[39m, [32mgts[39m, [32mlibpthread-stubs[39m, [32mxorgproto[39m, [32mlibxau[39m, [32mlibxdmcp[39m, [32mlibxcb[39m, [32mlibx11[39m, [32mlibxext[39m, [32mlibxrender[39m, [32mcairo[39m, [32mgdk-pixbuf[39m, [32micu4c[39m, [32mharfbuzz[39m, [32mpango[39m and [32mlibrsvg[39m[0m
[32m==>[0m [1mInstalling graphviz dependency: [32mgd[39m[0m


[34m==>[0m [1mPouring librsvg-2.50.3.catalina.bottle.tar.gz[0m
[34m==>[0m [1m/usr/local/opt/gdk-pixbuf/bin/gdk-pixbuf-query-loaders --update-cache[0m
🍺  /usr/local/Cellar/librsvg/2.50.3: 48 files, 141.5MB
[32m==>[0m [1mInstalling [32mgraphviz[39m[0m
[34m==>[0m [1mPouring graphviz-2.46.1.catalina.bottle.tar.gz[0m
🍺  /usr/local/Cellar/graphviz/2.46.1: 462 files, 12.2MB
[34m==>[0m [1mUpgrading 3 dependents:[0m
imagemagick 7.0.11-1 -> 7.0.11-3, python@3.8 3.8.8 -> 3.8.8_1, youtube-dl 2021.2.22 -> 2021.3.3
[32m==>[0m [1mUpgrading [32mpython@3.8[39m 3.8.8 -> 3.8.8_1 [0m
[34m==>[0m [1mDownloading https://homebrew.bintray.com/bottles/python%403.8-3.8.8_1.catali[0m
[34m==>[0m [1mDownloading from https://d29vzk4ow07wi7.cloudfront.net/898ace209da0f175407aa[0m
######################################################################## 100.0%
[34m==>[0m [1mPouring python@3.8-3.8.8_1.catalina.bottle.tar.gz[0m
[34m==>[0m [1m/usr/local/Cellar/python@3.8/3.8.8_1/

In [65]:
feature_names=list(X.columns)
label_names = ['not surveillance', 'surveillance']
eli5.show_weights(clf, feature_names=feature_names, target_names=label_names, show=['decision_tree'])


In [55]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=5)
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=5)

In [56]:
clf.score(X_test, y_test)

0.98

In [57]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not surveil', 'surveil'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not surveil,Predicted surveil
Is not surveil,130,1
Is surveil,2,17


## What are the important features?

We'll use slighyl different code for a decision tree, as it likes to draw big pictures if we don't stop it. The code looks like this:

```python
import eli5

feature_names=list(X.columns)
eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])
```

### Understanding the output

**Why is the feature importance difference than for logistic regression?**

Also, if you don't specify a `max_depth`, that's a LOT of zeroes! It doesn't even use most of the features! **Why not?**

## How well does the tree perform?

Display another confusion matrix with your new classifier.

## Visualize the tree

You can use `eli5` to visualize the decision tree itself! It usually takes up too much space, but since it's a special occasion we'll let it go.

If you'd like your graph to have colors colors, or to not use eli5, you can do it the old-fashioned way. You might need to `brew install graphviz` and `pip install graphviz`.

```python
from sklearn import tree
import graphviz

label_names = ['not surveillance', 'surveillance']
feature_names = X.columns

dot_data = tree.export_graphviz(clf,
                    feature_names=feature_names,  
                    filled=True,
                    class_names=label_names)  
graph = graphviz.Source(dot_data)  
graph
```

* **Tip:** You'll probably need to scroll sideways a bit

# One more classifier: Random forest

## Build and train your classifier

We can build a random forest classifier like this:

```python
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
```

But you're in charge of fitting it to your training data!

* **Tip:** You can also set `max_depth` here, but you won't be able to visualize the result.
* **Tip:** Increase `n_estimators` to 100 to make a better classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=5)
clf.fit(X_train, y_train)

## What are the important features?

### Understanding the output

What is a random forest, and **why is the feature importance difference than for the decision tree?** Isn't a random forest just like a decision tree or something?

## How well does it perform?

### How confident do you feel in the model?

# Actually finding spy planes

Now let's try ot actually find our spy planes

## Retrain our model

When we did test/train split, we trained our model with only a subset of our data, so we could test with the rest. Now that we're working in the "real world" we want to re-train it using not just `_train` and `_test` data, but instead **everything we have labels for.**

In [None]:
clf.fit(X, y)

## Filter for planes we want to predict

We have a dataframe of features that includes three types of planes:

* Those that are labeled as surveillance planes
* Those that are labeled as not surveillance
* Those that aren't labeled

Which do we want to predictions for? **Filter a new dataframe that's just those.**

* **Tip:** Scroll up to see where you created your `train_df`, it's the opposite!

How many planes do you have in that list? **Confirm it's about 19,200.**

## Predicting 

Build your `X` - remember you need to drop a few columns - and use that to make a prediction for each plane.

**Assign the prediction into the `predicted` column**.

* **Tip:** Scroll up to see where you created your features for training, it's similar
* **Tip:** pandas will yell at us about setting values on copies of a slice but it's fine

## How many planes did it predict to be surveillance planes?

It should be roughly around 70-80 planes.

## But.. what about those other ones? The ones that are just below the threshold?

The cutoff for a prediction of `1` is 50%, but since we have a lot of time we're interested in investigating the top 150. To get the probability for each row, you will use `clf.predict_proba` instead of `clf.predict`. Also, to get the predicted probability for the `1` category, you'll need to add `[:,1]` to the end of the

```python
clf.predict_proba(***your features***)[:,1]
```

**Create a new column called `predicted_prob` that is the chance that the plane is a surveillance plane.**

* **Tip:** You dropped three columns when using `clf.predict`, but if you drop the same three you'll get an error now. There's now an extra column that you'll need to drop! What is it?

### Get the top 200 predictions

Take a look at what the probabilities look like, showing the top 200 planes that are **most likely to be surveillance planes.**

Then save them to a file for later research.

# Questions

### Question 1

What kind of machine learning are we doing here, and why are we doing it?

### Question 2

What are a few different ways you can deal with categorical data? Think about how we dealt with race in the reveal regression compared to how we dealt with type in this dataset.

### Question 3

Every time we ran a machine learning algorithm on our dataset, we looked at feature importance.

* When might it be important to explain what our model found important?
* When might it not be important?

### Question 4

Using words and not column names, describe what the machine learning algorithm found to be important when identifying surveillance planes.

### Question 5

Why did we use test/train split when it would have been more effective to give our model all of the data from the start?

### Question 6

Why did we use a random forest instead of a decision tree or logistic regression? Was there something about the data?

### Question 7

Why did we use probability instead of just looking for planes with a predicted value of 1? It seems like we should have just trusted the algorithm, right?

### Question 8

What if our random forest or input dataset were flawed? What would be the repercussions?

### Question 9

The government could claim that we're threatening national security by publishing this paper as well as publishing this code - now anyone could look for planes that are surveilling them. What do you think?

### Question 10

We're using data from the past, but you can get real-time flight data from many services. Can you think of any uses for this algorithm using real-time instead of historical data?

### Question 11

This isn't a question, but if you look at `candidates.csv` and `candidates-annotates.csv` you can see how Buzzfeed did their research after finding a list of suspicious planes.