# Supervised Machine Learning Examples
Some examples of supervised machine learning examples in Python.
First, load up a ton of modules...

In [None]:
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from yellowbrick.classifier import ClassificationReport, ConfusionMatrix
from sklearn import metrics
pd.options.mode.chained_assignment = None

##  Load the data
Next, we have to load the data into a dataframe.  In order to have a balanced dataset, we will use 10000 records from Alexa which will represent the not malicious domains, and 10000 records from `gameoverdga` representing the malicious domains.  

You can see that at the end we have 10000 of each.

In [None]:
df = pd.read_csv( '../../data/dga-full.csv' )
#Filter to alexo and game over
df = df[df['dsrc'].isin(['alexa','gameoverdga'])]
df.dsrc.value_counts()

## Add a Target Column
For our datasets, we need a numeric column to represent the classes.  In our case we are going to call the column `isMalicious` and assign it a value of `0` if it is not malicious and `1` if it is.

In [None]:
df['isMalicious'] = df['dsrc'].apply( lambda x: 0 if x == "alexa" else 1 )

In [None]:
df['isMalicious'].value_counts()

## Perform the Train/Test Split
For this, let’s create a rather small training data se as it will reduce the time to train up a model.
Feel free to try a 15%, 20% or even a 30% portion for the training data (lower percentages for slower machines).

In this example, we will split 30% for train and 70% for test.

Normally you would want most of the data in the training data, but more training data can considerably extend the time neede to train up a model.

We're also going to need a list of column names for the feature columns as well as the target column. 

In [None]:
#train, test = train_test_split(df, test_size = 0.7)
features = ['length', 'dicts', 'entropy','numbers', 'ngram']
target = 'isMalicious'

feature_matrix = df[ features ]
target_vector = df[ target ]

#Your code here...


## Create the Classifiers
The next step is to create the classifiers. What you'll see is that scikit-learn maintains a constant interface for every machine learning algorithm.  For a supervised model, the steps are:
1.  Create the classifier object
2.  Call the `.fit()` method with the training data set and the target 
3.  To make a prediction, call the `.predict()` method

In [None]:
#Create the Random Forest Classifier
random_forest_clf = # Your code here ...

In [None]:
#Next, create the SVM classifier
svm_classifier = # Your code here ...

## Comparing the Classifiers
Now that we have two different classifiers, let's compare them and see how they perform. Fortunately, Scikit has a series of functions to generate metrics for you.  The first is the cross validation score.

In [None]:
scores = # Your code here ...

In [None]:
scores = # Your code here ...

We'll need to to get the predictions from both classifiers, so we add columns to the test and training sets for the predictions.

In [None]:
predictions_test = random_forest_clf.predict( feature_matrix_test )
predictions_train = random_forest_clf.predict( feature_matrix_train )
svm_predictions_test = svm_classifier.predict( feature_matrix_test)
svm_predictions_train = svm_classifier.predict( feature_matrix_train)

## Confusion Matrix
These are a little confusing (yuk yuk), but are a very valuable tool in evaluating your models.  Scikit-learn has a function to generate a confusion matrix as shown below.  
``` python
confusion_matrix( target_test, predictions_test)
```
Try this yourself to see what the confusion matrices look like for various models.

In [None]:
# Your code here ...

Next, try using YellowBrick to produce nicer, color coded confusion matrices.  Remember the syntax is:

```python
viz = ConfusionMatrix(RandomForestClassifier(), classes=[0,1])

viz.fit(feature_matrix_train, target_train)  
random_forest_visualizer.score(feature_matrix_test, target_test)  
g = viz.poof()    
```

In [None]:
# Your code here ...

And again for the SVM classifier.

In [None]:
# Your code here ...

### Calculate precision and recall for both models
Next, you are going to want to compare the models performance metrics.  While scikit-learn does provide all the scores as part of the metrics package, it is easier to calculate all the metrics at once using the classification report functionality.  The basic syntax is:

```python
classification_report(y_true, y_pred, target_names=target_names))
```
YellowBrick also has a nice classification report visualizer.  The basic syntax is below:
```python
visualizer = ClassificationReport(model, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()         
```

Do this for both models.

In [None]:
# Your code here ...

## Feature Importance
Random Forest has a feature which can calculate the importance for each feature it uses in building the forest.  This can be calculated with  this property:`random_forest_clf.feature_importances_`.

In [None]:
# Your code here ...

You can also visualize this with the following code from: #From: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

In [None]:
std = np.std([random_forest_clf.feature_importances_ for tree in random_forest_clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(feature_matrix_test.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(feature_matrix_test.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(feature_matrix_test.shape[1]), indices)
plt.xlim([-1, feature_matrix_test.shape[1]])
plt.show()

You can calculate the accuracy with the `metrics.accuracy()` method, and finally, there is the `metrics.classification-report()` which will calculate all the metrics except accuracy at once.

In [None]:
pscore = metrics.accuracy_score(target_test, predictions_test)
pscore_train = metrics.accuracy_score(target_train, predictions_train)

In [None]:
print( metrics.classification_report(target_test, predictions_test, target_names=['Malicious', 'Not Malicious'] ) )

In [None]:
svm_pscore = metrics.accuracy_score(target_test, svm_predictions_test)
svm_pscore_train = metrics.accuracy_score(target_train, svm_predictions_train)
print( metrics.classification_report(target_test, svm_predictions_test, target_names=['Malicious', 'Not Malicious'] ) )

In [None]:
print( svm_pscore, svm_pscore_train)

In [None]:
print( pscore, pscore_train)