Will there be a tornado today? Classification with Scikit-Learn
===

### Alyssa Batula


Jupyter Notebooks: https://github.com/abatula/MachineLearning-PBJ
 
8/14/2018<br>
IndyPy - Machine Learning

The Problem
===

* Dataset: [NOAA Weather Data](https://www.kaggle.com/noaa/gsod) from Kaggle 
    - Daily weather recordings from ~9,000 weather stations
    - Temperature, precipitation, fog, etc.
* Goal: predict if tornado will form based on other weather data
    - Classification: yes or no
* Compare 4 classifiers

Assumptions
===

* Dataset has been cleaned/scrubbed
    - No missing data
    - All features numeric
* Un-needed features removed

In [1]:
# Imports
from IPython.display import display

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The Data
===

In [2]:
data = pd.read_csv('data/NOAA.csv')
data[['temp', 'thunder', 'tornado_funnel_cloud']].describe()

Unnamed: 0,temp,thunder,tornado_funnel_cloud
count,4700.0,4700.0,4700.0
mean,58.785043,0.146596,0.255319
std,24.130119,0.35374,0.436087
min,-50.9,0.0,0.0
25%,44.6,0.0,0.0
50%,64.0,0.0,0.0
75%,79.1,0.0,1.0
max,101.9,1.0,1.0


Determine Target Variable and Features
---

* Classification: Tornado (1) or No Tornado (0)
* Target value in `y`
* Features in `X`

In [3]:
y = data['tornado_funnel_cloud']
X = data.drop('tornado_funnel_cloud', axis=1)

Training, Testing, and Validation
===

![Train/Test/Validation Split](img/TrainTestValSplit.png)

- **Training Set** - Portion of the data used to train a machine learning algorithm.
- **Validation Set** - (Optional) Portion of data (usually 10-30%) used for testing during parameter tuning or classifier selection.
- **Testing Set** - Portion of the data (usually 10-30%) not used in training, used to evaluate performance.

Training, Testing, and Validation
===

* **Training Set** - Homework
    - Many exercises, start knowing nothing
    - Use the answers to learn

* **Validation Set** - Practice Exam
    - Few exercises, use knowledge from training
    - Use answers to decide if ready for testing

* **Test Set** - Final Exam
    - Few exercises, use knowledge from training/validation
    - Use answers to give your algorithm final grade

Create a Test Set
===

In [4]:
from sklearn.model_selection import train_test_split

(X_trainval, 
 X_test, 
 y_trainval, 
 y_test) = train_test_split(X, y, 
                            test_size=0.25, # Percentage of data for test set
                            stratify=y, # Keep label distribution when splitting
                            random_state=42 # Set the random seed for repeatability
                           )
print(f'Original dataset size: {X.shape}')
print(f'Training & Validation dataset size: {X_trainval.shape}')
print(f'Test dataset size: {X_test.shape}')

Original dataset size: (4700, 28)
Training & Validation dataset size: (3525, 28)
Test dataset size: (1175, 28)


  from collections import Sequence


Create Validation and Training Sets
---

In [5]:
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, 
                                                  test_size=0.33, 
                                                  stratify=y_trainval, 
                                                  random_state=42 
                                                  )
print(f'Original dataset size: {X.shape}')
print(f'Training dataset size: {X_train.shape}')
print(f'Validation dataset size: {X_val.shape}')
print(f'Test dataset size: {X_test.shape}')

Original dataset size: (4700, 28)
Training dataset size: (2361, 28)
Validation dataset size: (1164, 28)
Test dataset size: (1175, 28)


Notes
---

* Test size increased from .25 to .33 because we're splitting a smaller subset of the original data
* Test and validation sets should be approximately the same size

Classifiers
===

* K Nearest Neighbors (KNN)
* Support Vector Machine (SVM)

In [6]:
# # Uncomment to re-create our KNN plot for display

# from sklearn import neighbors
# from matplotlib.colors import ListedColormap

# X = np.array([[0,1], [1,2], [0,3.5], [0.5,1], [1,4], [0.5, 3], [2,1], [3,4], [0.5, 2], [1.5, 3], [2.2, 2.2], [2.5, 1.8]])
# y = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 ,1]

# clf = neighbors.KNeighborsClassifier(3)
# clf.fit(X,y)

# h = .02  # step size in the mesh

# # Plot the decision boundary. For that, we will assign a color to each
# # point in the mesh [x_min, m_max]x[y_min, y_max].
# x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
# y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
# xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
#                      np.arange(y_min, y_max, h))

# Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
#                                                # in the mesh in order to find the 
#                                                # classification areas for each label
        
# # Create the color maps
# cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

# # Put the result into a color plot
# Z = Z.reshape(xx.shape)
# plt.figure(figsize=(8, 6))
# plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# # Plot the training points
# plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=100)
# plt.xlim(xx.min(), xx.max())
# plt.ylim(yy.min(), yy.max())
# plt.title("KNN with K=3", fontsize=22)
# plt.xlabel('Feature 1', fontsize=18)
# plt.ylabel('Feature 2', fontsize=18)

# plt.scatter(0.3, 1.7, c='k', marker='^', s=300)
# plt.scatter(0.3, 1.7, facecolors='none', edgecolors='k', s=14000)
# plt.savefig('img/knn_example.png', dpi=500)
# plt.show()

K Nearest Neighbors (KNN)
---

* Assigns the same label as majority of neighbors
* Parameters:
    - `n_neighbors` - Number of neighbors to consider
    - `weights` - Determine if closer neighbors are more important
        * `uniform` or `distance`

![KNN Example](img/knn_example.png)

* Label chosen based on vote of nearest labeled examples
* In plot above: red and blue circles are two classes, black triangle is unknown point
* Triangle will be labeled red, since we chose K=3 and 2/3 closest points are red

Training the KNN Classifier
---

In [7]:
from sklearn import neighbors
from sklearn.metrics import accuracy_score

In [8]:
n_neighbors = 15
clf = neighbors.KNeighborsClassifier(n_neighbors, weights='uniform')

In [9]:
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=2,
           weights='uniform')

In [10]:
predictions = clf.predict(X_val)

In [11]:
acc = accuracy_score(y_val, predictions) * 100

print(f'Accuracy is {acc:.2f}%')

Accuracy is 80.93%


Support Vector Machine (SVM)
---

* Find line or plane that maximizes the class separation 
* Parameters:
    - `C` - Penalty for misclassified training examples
        * High C - high penalty - complex boundary
    - `kernel` - Type of boundary
        * `linear`, `rbf`, `poly`

![SVM](img/SVMBoundary.png)



The solid line is the decision boundary, dividing the red and blue classes. Notice that on either side of the boundary, there is a dotted line that passes through the closest datapoints. The distance between the solid boundary line and this dotted line is what an SVM tries to maximize. 

The points that touch the dotted lines are called "support vectors". These points are the only ones that matter when determining boundary locations. All other datapoints can be added, moved, or removed from the dataset without changing the classification boundary, as long as they do not cross that dotted line.

Training the SVM Classifier
===

In [12]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [13]:
clf = SVC(C=1.0, kernel='rbf')

In [14]:
clf.fit(X_train, y_train)
predictions = clf.predict(X_val)

In [15]:
acc = accuracy_score(y_val, predictions) * 100

print(f'Accuracy is {acc:.2f}%')

Accuracy is 74.91%


Choose the Best Classifier
===

In [16]:
clf_svm1 = SVC(C=1.0, kernel='rbf')
clf_svm2 = SVC(C=0.1, kernel='sigmoid')
clf_knn1 = neighbors.KNeighborsClassifier(5, weights='uniform')
clf_knn2 = neighbors.KNeighborsClassifier(10, weights='distance')

In [17]:
clf_svm1.fit(X_train, y_train)
clf_svm2.fit(X_train, y_train)
clf_knn1.fit(X_train, y_train)
clf_knn2.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='distance')

In [18]:
pred_svm1 = clf_svm1.predict(X_val)
pred_svm2 = clf_svm2.predict(X_val)
pred_knn1 = clf_knn1.predict(X_val)
pred_knn2 = clf_knn2.predict(X_val)

Choose the Best Classifier (Continued)
===

In [19]:
valAcc_svm1 = accuracy_score(y_val, pred_svm1) * 100
valAcc_svm2 = accuracy_score(y_val, pred_svm2) * 100
valAcc_knn1 = accuracy_score(y_val, pred_knn1) * 100
valAcc_knn2 = accuracy_score(y_val, pred_knn2) * 100

In [20]:
print(f'SVM1 Accuracy: {np.mean(valAcc_svm1):.2f}%')
print(f'SVM2 Accuracy: {np.mean(valAcc_svm2):.2f}%')
print(f'KNN1 Accuracy: {np.mean(valAcc_knn1):.2f}%')
print(f'KNN2 Accuracy: {np.mean(valAcc_knn2):.2f}%')

SVM1 Accuracy: 74.91%
SVM2 Accuracy: 74.48%
KNN1 Accuracy: 81.87%
KNN2 Accuracy: 82.39%


Evaluate the Winner
---

* Retrain winning classifier on training + validation data
* Evaluate on test set

In [21]:
clf_knn2.fit(X_trainval, y_trainval)

predictions = clf_knn2.predict(X_test)

acc = accuracy_score(y_test, predictions)
print(f'Final Accuracy: {acc*100:.2f}%')

Final Accuracy: 83.74%
