# Snow Day Classifier

New York City schools have closed for snow 15 times in the last 40 years. When they do, city operations are disrupted; parents must seek other accommodations, students miss out on educational opportunities and access to basic health services and food, and many other challenges arise. As such, a useful predictor for snow days would be critical to municipal preparedness. However, the decision to close schools is not simply a function of snowfall, the climate is changing, and different mayoral administrations seem to handle the issue differently, not to mention control of schools has shifted over the past several decades. Below, I attempt to build a classifier using NOAA weather data, recognizing that success will be measured not in pounds or ounces, but in grams.

In [22]:
import numpy as np
import pandas as pd
import datetime as dt
import pandas_profiling
import matplotlib as plt
import pickle

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, fbeta_score, make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

### Data Pre-processing

Weather data was retrieved from NOAA. Data was collected near JFK and represents daily weather beginning in 1977. A full description of the attributes can be found in GHCND_documentation.pdf, but a brief description is available in the attributes dictionary. Comprehensive data regarding school schedules is not readily available, so it was not possible to systematically omit days off from the data. Instead, February 17, 2003 was excluded, as that was a historically notable snow storm that occurred during the winter break. After pre-processing, the data consist of 5355 observations with 20 features.

To combat the obvious imbalance in the data SMOTE is used to generate synthetic positive instances in the data, and Tomek is used to exclude negative class instances in the boundary zone between classes. As the data are imbalanced an F score is used to evaluate the models rather than accuracy. An F1 score would weight precision and recall equally, but the decision was made to weight precision more highly, as cancelling school was considered more serious than not cancelling school in poor weather conditions.

In [23]:
# NOAA weather data
weather_data = pd.read_csv('1652746.csv', parse_dates=['DATE'], low_memory=False)

# List of snow days in NYC
snow_days = pd.read_csv('snow_days.csv', header=None, parse_dates=True)
snow_days = snow_days[0].tolist()

#Attributes in the weather data relevent to snow
attributes = {
    'DATE' : 'Date',
    'AWND' : 'Average daily wind speed',
    'FMTM' : 'Time of fastest mile',
    'PRCP' : 'Precipitation',
    'SNOW' : 'Snowfall',
    'SNWD' : 'Snow depth',
    'TAVG' : 'Average temperature',
    'TMIN' : 'Minimum temperature',
    'TSUN' : 'Total daily sunshine',
    'WESD' : 'Water equivalent of snow on the ground',
    'WSFG' : 'Peak guest wind speed',
    'WV01' : 'Fog, ice fog, or freezing fog in the vicinity',
    'WT04' : 'Ice pellets, sleet, snow pellets, or small hail',
    'WT05' : 'Hail (may include small hail)',
    'WT06' : 'Glaze or rime',
    'WT09' : 'Blowing or drifting snow',
    'WT11' : 'High or damaging winds',
    'WT15' : 'Freezing drizzle',
    'WT17' : 'Freezing rain',
    'WT18' : 'Snow, snow pellets, snow grains, or ice crystals',
    'WT22' : 'Ice fog or freezing fog'
}

# Dataframe with relevent weather data
df = weather_data[[i for i in attributes]]

# Filering out weekends
df = df[df.DATE.dt.weekday.isin(range(5))]

# Filtering out months with no snow, mostly as a ram saver
df = df[df.DATE.dt.month.isin([11, 12, 1, 2, 3, 4])]

# Filter out a snow storm that occurred during winter break (2/17/2003)
df.drop([9178])

#Adding boolean snow day column
df['snowday'] = df.DATE.isin(snow_days)

# Get rid of date data, remove outlier data, and set NaN to 0
df_dateless = df.drop('DATE', axis=1)
# df_dateless = df_dateless[df_dateless.apply(lambda x: np.abs(x - x.mean()) / x.std() < 4, axis=1)]
df_dateless = df_dateless.fillna(0)

# Split data into observations and labels
X = df_dateless.iloc[:,:-1]
y = df_dateless.iloc[:,-1]

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=56493)

sm = SMOTE(sampling_strategy = .05, random_state=56493)

# Oversampling with SMOTE to increase representation of minority class
X_train_SMOTE, y_train_SMOTE = sm.fit_sample(X_train, y_train.ravel())

# Undersampling with Tomek to remove instances of majority class near snow days
X_train_SMOTE_Tomek, y_train_SMOTE_Tomek = TomekLinks().fit_sample(X_train_SMOTE, y_train_SMOTE)

In [25]:
# Define f0.5 scorer to weight precision higher than recall.
f_beta = make_scorer(fbeta_score, beta=0.5)

In [17]:
X.describe()

Unnamed: 0,AWND,FMTM,PRCP,SNOW,SNWD,TAVG,TMIN,TSUN,WESD,WSFG,WV01,WT04,WT05,WT06,WT09,WT11,WT15,WT17,WT18,WT22
count,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0,5355.0
mean,10.606773,938.567134,0.113892,0.139122,0.467171,12.935574,34.061811,0.528852,0.039552,12.240131,0.0,0.038095,0.021662,0.018861,0.011204,0.001307,0.005229,0.00859,0.127544,0.005415
std,7.32643,965.39783,0.296688,0.831902,1.928148,20.017684,10.131859,38.700227,0.245072,14.780051,0.0,0.191444,0.145591,0.136046,0.105266,0.036135,0.072128,0.092292,0.333613,0.073397
min,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.38,0.0,0.0,0.0,0.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10.74,950.0,0.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,14.54,1706.0,0.05,0.0,0.0,32.0,41.0,0.0,0.0,25.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,308.03,32767.0,4.68,21.6,28.0,75.0,64.0,2832.0,5.0,60.8,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Support Vector Machine

The first model attempted is an SVM, as per Professor Soon Chun's suggestions that SVMs do well with sparse data. The model has a score of 18% which decreases with parameter tuning, suggesting overfitting.

In [5]:
basic_svm = SVC(gamma='scale')
basic_svm.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [6]:
y_true = y_test
y_pred = basic_svm.predict(X_test)

In [7]:
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1598,5,1603
True,3,1,4
All,1601,6,1607


In [8]:
f_beta(basic_svm, X_test, y_test)

0.17857142857142855

In [9]:
# parameter tuning

parameters = {'gamma': [0.001, 0.01, 0.1],
            'C': [1, 10, 100],
            'kernel':['poly', 'rbf'],
            'degree': [1, 2, 3]}
svm_search = GridSearchCV(SVC(), parameters, cv=3, scoring=f_beta, verbose=3)
svm_search.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek.ravel())
svm_search.best_params_

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[CV] C=1, degree=1, gamma=0.001, kernel=poly .........................
[CV]  C=1, degree=1, gamma=0.001, kernel=poly, score=0.9124087591240876, total=   0.1s
[CV] C=1, degree=1, gamma=0.001, kernel=poly .........................
[CV]  C=1, degree=1, gamma=0.001, kernel=poly, score=0.9271523178807948, total=   0.1s
[CV] C=1, degree=1, gamma=0.001, kernel=poly .........................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV]  C=1, degree=1, gamma=0.001, kernel=poly, score=0.8270676691729324, total=   0.1s
[CV] C=1, degree=1, gamma=0.001, kernel=rbf ..........................
[CV]  C=1, degree=1, gamma=0.001, kernel=rbf, score=0.8715596330275229, total=   0.1s
[CV] C=1, degree=1, gamma=0.001, kernel=rbf ..........................
[CV]  C=1, degree=1, gamma=0.001, kernel=rbf, score=0.8527131782945736, total=   0.1s
[CV] C=1, degree=1, gamma=0.001, kernel=rbf ..........................
[CV]  C=1, degree=1, gamma=0.001, kernel=rbf, score=0.7758620689655172, total=   0.1s
[CV] C=1, degree=1, gamma=0.01, kernel=poly ..........................
[CV]  C=1, degree=1, gamma=0.01, kernel=poly, score=0.9124087591240876, total=   0.2s
[CV] C=1, degree=1, gamma=0.01, kernel=poly ..........................
[CV]  C=1, degree=1, gamma=0.01, kernel=poly, score=0.8962264150943395, total=   0.2s
[CV] C=1, degree=1, gamma=0.01, kernel=poly ..........................
[CV]  C=1, degree=1, gamma=0.01, kernel=poly, score=0.804

[Parallel(n_jobs=1)]: Done 162 out of 162 | elapsed: 20.3min finished


{'C': 1, 'degree': 3, 'gamma': 0.1, 'kernel': 'poly'}

In [10]:
best_svm = SVC(C = 1, degree = 3, gamma = 0.1, kernel = 'poly')
best_svm.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
y_true = y_test
y_pred = best_svm.predict(X_test)
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1597,6,1603
True,3,1,4
All,1600,7,1607


In [12]:
f_beta(best_svm, X_test, y_test)

0.15625

### Logistic Regression

Logistic regression performs similarly and also seems to overfit with parameter tuning. As the logistic model scores comparably to the SVM and has similar computational performance, there is no clear winner here.

In [13]:
basic_logit = LogisticRegression()
basic_logit.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [14]:
y_true = y_test
y_pred = basic_logit.predict(X_test)
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1592,11,1603
True,2,2,4
All,1594,13,1607


In [15]:
f_beta(basic_logit, X_test, y_test)

0.1785714285714286

In [17]:
# parameter tuning

parameters = {'C': [1, 10, 100],
             'class_weight': ['balanced', None],
             'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga']}
svm_search = GridSearchCV(LogisticRegression(), parameters, cv=3, scoring=f_beta, verbose=3)
svm_search.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek.ravel())
svm_search.best_params_

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] C=1, class_weight=balanced, solver=newton-cg ....................
[CV]  C=1, class_weight=balanced, solver=newton-cg, score=0.8288770053475936, total=   0.1s
[CV] C=1, class_weight=balanced, solver=newton-cg ....................
[CV]  C=1, class_weight=balanced, solver=newton-cg, score=0.8115183246073298, total=   0.1s
[CV] C=1, class_weight=balanced, solver=newton-cg ....................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV]  C=1, class_weight=balanced, solver=newton-cg, score=0.7635467980295567, total=   0.1s
[CV] C=1, class_weight=balanced, solver=lbfgs ........................
[CV]  C=1, class_weight=balanced, solver=lbfgs, score=0.7868020304568527, total=   0.0s
[CV] C=1, class_weight=balanced, solver=lbfgs ........................
[CV]  C=1, class_weight=balanced, solver=lbfgs, score=0.682819383259912, total=   0.0s
[CV] C=1, class_weight=balanced, solver=lbfgs ........................
[CV]  C=1, class_weight=balanced, solver=lbfgs, score=0.6601731601731601, total=   0.0s
[CV] C=1, class_weight=balanced, solver=liblinear ....................
[CV]  C=1, class_weight=balanced, solver=liblinear, score=0.7948717948717948, total=   0.0s
[CV] C=1, class_weight=balanced, solver=liblinear ....................
[CV]  C=1, class_weight=balanced, solver=liblinear, score=0.8115183246073298, total=   0.0s
[CV] C=1, class_weight=balanced, solver=liblinear ....................
[CV]  C=1, class_weight=balanced, s



[CV]  C=1, class_weight=balanced, solver=saga, score=0.11303890641430074, total=   0.1s
[CV] C=1, class_weight=balanced, solver=saga .........................
[CV]  C=1, class_weight=balanced, solver=saga, score=0.10520487264673313, total=   0.1s
[CV] C=1, class_weight=balanced, solver=saga .........................
[CV]  C=1, class_weight=balanced, solver=saga, score=0.43442622950819676, total=   0.1s
[CV] C=1, class_weight=None, solver=newton-cg ........................




[CV]  C=1, class_weight=None, solver=newton-cg, score=0.8673469387755102, total=   0.1s
[CV] C=1, class_weight=None, solver=newton-cg ........................
[CV]  C=1, class_weight=None, solver=newton-cg, score=0.9090909090909092, total=   0.1s
[CV] C=1, class_weight=None, solver=newton-cg ........................
[CV]  C=1, class_weight=None, solver=newton-cg, score=0.8540372670807453, total=   0.1s
[CV] C=1, class_weight=None, solver=lbfgs ............................


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[CV]  C=1, class_weight=None, solver=lbfgs, score=0.8448275862068965, total=   0.0s
[CV] C=1, class_weight=None, solver=lbfgs ............................
[CV]  C=1, class_weight=None, solver=lbfgs, score=0.911949685534591, total=   0.0s
[CV] C=1, class_weight=None, solver=lbfgs ............................
[CV]  C=1, class_weight=None, solver=lbfgs, score=0.7934131736526946, total=   0.0s
[CV] C=1, class_weight=None, solver=liblinear ........................
[CV]  C=1, class_weight=None, solver=liblinear, score=0.8391608391608392, total=   0.0s
[CV] C=1, class_weight=None, solver=liblinear ........................
[CV]  C=1, class_weight=None, solver=liblinear, score=0.920245398773006, total=   0.0s
[CV] C=1, class_weight=None, solver=liblinear ........................
[CV]  C=1, class_weight=None, solver=liblinear, score=0.8333333333333335, total=   0.0s
[CV] C=1, class_weight=None, solver=saga .............................
[CV] ... C=1, class_weight=None, solver=saga, score=0.0, tot

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[CV] ... C=1, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=1, class_weight=None, solver=saga .............................
[CV] ... C=1, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=10, class_weight=balanced, solver=newton-cg ...................
[CV]  C=10, class_weight=balanced, solver=newton-cg, score=0.8288770053475936, total=   0.1s
[CV] C=10, class_weight=balanced, solver=newton-cg ...................
[CV]  C=10, class_weight=balanced, solver=newton-cg, score=0.856353591160221, total=   0.1s
[CV] C=10, class_weight=balanced, solver=newton-cg ...................
[CV]  C=10, class_weight=balanced, solver=newton-cg, score=0.8115183246073298, total=   0.1s
[CV] C=10, class_weight=balanced, solver=lbfgs .......................
[CV]  C=10, class_weight=balanced, solver=lbfgs, score=0.6674208144796381, total=   0.0s
[CV] C=10, class_weight=balanced, solver=lbfgs .......................
[CV]  C=10, class_weight=balanced, solver=lbfgs, score=0.67099567



[CV]  C=10, class_weight=balanced, solver=liblinear, score=0.7948717948717948, total=   0.0s
[CV] C=10, class_weight=balanced, solver=saga ........................
[CV]  C=10, class_weight=balanced, solver=saga, score=0.11303890641430074, total=   0.1s
[CV] C=10, class_weight=balanced, solver=saga ........................
[CV]  C=10, class_weight=balanced, solver=saga, score=0.10832383124287345, total=   0.1s
[CV] C=10, class_weight=balanced, solver=saga ........................




[CV]  C=10, class_weight=balanced, solver=saga, score=0.4372937293729373, total=   0.1s
[CV] C=10, class_weight=None, solver=newton-cg .......................
[CV]  C=10, class_weight=None, solver=newton-cg, score=0.8892617449664431, total=   0.1s
[CV] C=10, class_weight=None, solver=newton-cg .......................
[CV]  C=10, class_weight=None, solver=newton-cg, score=0.9090909090909092, total=   0.1s
[CV] C=10, class_weight=None, solver=newton-cg .......................
[CV]  C=10, class_weight=None, solver=newton-cg, score=0.8787878787878788, total=   0.1s
[CV] C=10, class_weight=None, solver=lbfgs ...........................
[CV]  C=10, class_weight=None, solver=lbfgs, score=0.8275862068965517, total=   0.0s
[CV] C=10, class_weight=None, solver=lbfgs ...........................
[CV]  C=10, class_weight=None, solver=lbfgs, score=0.8962264150943395, total=   0.0s
[CV] C=10, class_weight=None, solver=lbfgs ...........................
[CV]  C=10, class_weight=None, solver=lbfgs, scor

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[CV]  C=10, class_weight=None, solver=liblinear, score=0.8892617449664431, total=   0.0s
[CV] C=10, class_weight=None, solver=liblinear .......................
[CV]  C=10, class_weight=None, solver=liblinear, score=0.9090909090909092, total=   0.0s
[CV] C=10, class_weight=None, solver=liblinear .......................
[CV]  C=10, class_weight=None, solver=liblinear, score=0.8787878787878788, total=   0.0s
[CV] C=10, class_weight=None, solver=saga ............................
[CV] .. C=10, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=10, class_weight=None, solver=saga ............................
[CV] .. C=10, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=10, class_weight=None, solver=saga ............................


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[CV] .. C=10, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=100, class_weight=balanced, solver=newton-cg ..................
[CV]  C=100, class_weight=balanced, solver=newton-cg, score=0.8469945355191256, total=   0.2s
[CV] C=100, class_weight=balanced, solver=newton-cg ..................
[CV]  C=100, class_weight=balanced, solver=newton-cg, score=0.8757062146892656, total=   0.2s
[CV] C=100, class_weight=balanced, solver=newton-cg ..................
[CV]  C=100, class_weight=balanced, solver=newton-cg, score=0.8469945355191256, total=   0.2s
[CV] C=100, class_weight=balanced, solver=lbfgs ......................




[CV]  C=100, class_weight=balanced, solver=lbfgs, score=0.6888888888888889, total=   0.0s
[CV] C=100, class_weight=balanced, solver=lbfgs ......................
[CV]  C=100, class_weight=balanced, solver=lbfgs, score=0.6378600823045268, total=   0.0s
[CV] C=100, class_weight=balanced, solver=lbfgs ......................
[CV]  C=100, class_weight=balanced, solver=lbfgs, score=0.6378600823045268, total=   0.0s
[CV] C=100, class_weight=balanced, solver=liblinear ..................
[CV]  C=100, class_weight=balanced, solver=liblinear, score=0.8378378378378378, total=   0.0s
[CV] C=100, class_weight=balanced, solver=liblinear ..................
[CV]  C=100, class_weight=balanced, solver=liblinear, score=0.8378378378378378, total=   0.0s
[CV] C=100, class_weight=balanced, solver=liblinear ..................
[CV]  C=100, class_weight=balanced, solver=liblinear, score=0.8378378378378378, total=   0.0s
[CV] C=100, class_weight=balanced, solver=saga .......................




[CV]  C=100, class_weight=balanced, solver=saga, score=0.1123301985370951, total=   0.1s
[CV] C=100, class_weight=balanced, solver=saga .......................
[CV]  C=100, class_weight=balanced, solver=saga, score=0.1075877689694224, total=   0.1s
[CV] C=100, class_weight=balanced, solver=saga .......................
[CV]  C=100, class_weight=balanced, solver=saga, score=0.43442622950819676, total=   0.1s
[CV] C=100, class_weight=None, solver=newton-cg ......................
[CV]  C=100, class_weight=None, solver=newton-cg, score=0.8892617449664431, total=   0.2s
[CV] C=100, class_weight=None, solver=newton-cg ......................
[CV]  C=100, class_weight=None, solver=newton-cg, score=0.9090909090909092, total=   0.2s
[CV] C=100, class_weight=None, solver=newton-cg ......................




[CV]  C=100, class_weight=None, solver=newton-cg, score=0.8832335329341318, total=   0.2s
[CV] C=100, class_weight=None, solver=lbfgs ..........................
[CV]  C=100, class_weight=None, solver=lbfgs, score=0.8793103448275862, total=   0.0s
[CV] C=100, class_weight=None, solver=lbfgs ..........................
[CV]  C=100, class_weight=None, solver=lbfgs, score=0.8962264150943395, total=   0.0s
[CV] C=100, class_weight=None, solver=lbfgs ..........................
[CV]  C=100, class_weight=None, solver=lbfgs, score=0.8284023668639052, total=   0.0s
[CV] C=100, class_weight=None, solver=liblinear ......................
[CV]  C=100, class_weight=None, solver=liblinear, score=0.8940397350993378, total=   0.0s
[CV] C=100, class_weight=None, solver=liblinear ......................
[CV]  C=100, class_weight=None, solver=liblinear, score=0.9090909090909092, total=   0.0s
[CV] C=100, class_weight=None, solver=liblinear ......................
[CV]  C=100, class_weight=None, solver=libline

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


[CV] . C=100, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=100, class_weight=None, solver=saga ...........................
[CV] . C=100, class_weight=None, solver=saga, score=0.0, total=   0.1s
[CV] C=100, class_weight=None, solver=saga ...........................
[CV] . C=100, class_weight=None, solver=saga, score=0.0, total=   0.1s


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    4.8s finished


{'C': 100, 'class_weight': None, 'solver': 'newton-cg'}

In [18]:
best_logit = LogisticRegression(C = 100, solver = 'newton-cg')
best_logit.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [19]:
y_true = y_test
y_pred = best_logit.predict(X_test)
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1591,12,1603
True,3,1,4
All,1594,13,1607


In [20]:
f_beta(best_logit, X_test, y_test)

0.0892857142857143

### Naive Bayes

Naive Bayes does not perform well at all. The conditional probabilities for the positive class are likely not highly representative of snow day conditions, and there may be lots of noise in the majority class.

In [21]:
basic_nb = GaussianNB()
basic_nb.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)

GaussianNB(priors=None, var_smoothing=1e-09)

In [22]:
y_true = y_test
y_pred = basic_nb.predict(X_test)
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1534,69,1603
True,1,3,4
All,1535,72,1607


In [23]:
f_beta(basic_nb, X_test, y_test)

0.051369863013698634

### KNN

Finally, KNN performs decently with factory settings, but parameter tuning gives a model with far superior performance. Weighting nodes by their distance is an intuitive improvement, but it is not immediately obvious how the manhattan distance is an improvement on the generic Euclidean distance. Further investigation is needed.

In [24]:
basic_knn = KNeighborsClassifier()
basic_knn.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [25]:
y_true = y_test
y_pred = basic_knn.predict(X_test)
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1594,9,1603
True,3,1,4
All,1597,10,1607


In [26]:
f_beta(basic_knn, X_test, y_test)

0.11363636363636363

In [28]:
# parameter tuning

parameters = {'n_neighbors': [3, 5, 7],
             'weights': ['uniform', 'distance'],
             'p': [1, 2]}
svm_search = GridSearchCV(KNeighborsClassifier(), parameters, cv=3, scoring=f_beta, verbose=3)
svm_search.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek.ravel())
svm_search.best_params_

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] n_neighbors=3, p=1, weights=uniform .............................
[CV]  n_neighbors=3, p=1, weights=uniform, score=0.9615384615384615, total=   0.1s
[CV] n_neighbors=3, p=1, weights=uniform .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV]  n_neighbors=3, p=1, weights=uniform, score=0.9183673469387754, total=   0.0s
[CV] n_neighbors=3, p=1, weights=uniform .............................
[CV]  n_neighbors=3, p=1, weights=uniform, score=0.9302325581395349, total=   0.0s
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV]  n_neighbors=3, p=1, weights=distance, score=0.9693877551020408, total=   0.0s
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV]  n_neighbors=3, p=1, weights=distance, score=0.9354838709677419, total=   0.0s
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV]  n_neighbors=3, p=1, weights=distance, score=0.9489051094890509, total=   0.0s
[CV] n_neighbors=3, p=2, weights=uniform .............................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV]  n_neighbors=3, p=2, weights=uniform, score=0.9172661870503598, total=   0.0s
[CV] n_neighbors=3, p=2, weights=uniform .............................
[CV]  n_neighbors=3, p=2, weights=uniform, score=0.896551724137931, total=   0.0s
[CV] n_neighbors=3, p=2, weights=uniform .............................
[CV]  n_neighbors=3, p=2, weights=uniform, score=0.847107438016529, total=   0.0s
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV]  n_neighbors=3, p=2, weights=distance, score=0.9265734265734265, total=   0.0s
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV]  n_neighbors=3, p=2, weights=distance, score=0.906040268456376, total=   0.0s
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV]  n_neighbors=3, p=2, weights=distance, score=0.8720930232558141, total=   0.0s
[CV] n_neighbors=5, p=1, weights=uniform .............................
[CV]  n_neighbors=5, p=1, weights=uniform, score=0.9302325581395349, total= 

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    2.8s finished


{'n_neighbors': 3, 'p': 1, 'weights': 'distance'}

In [26]:
best_knn = KNeighborsClassifier(n_neighbors = 3, p = 1, weights ='distance')
best_knn.fit(X_train_SMOTE_Tomek, y_train_SMOTE_Tomek)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=1,
           weights='distance')

In [27]:
y_true = y_test
y_pred = best_knn.predict(X_test)
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,False,True,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1603,0,1603
True,3,1,4
All,1606,1,1607


In [28]:
f_beta(best_knn, X_test, y_test)

0.625

### Save file out

The tuned KNN is the obvious choice. The next step in this project will be to design a web interface that inputs weather features and uses the saved KNN parameters to classify the day as a snow day or not.

In [32]:
model = best_knn

with open('model.pickle', 'wb') as file:
    pickle.dump(model, file)

In [32]:
X_test

Unnamed: 0,AWND,FMTM,PRCP,SNOW,SNWD,TAVG,TMIN,TSUN,WESD,WSFG,WV01,WT04,WT05,WT06,WT09,WT11,WT15,WT17,WT18,WT22
4454,8.28,1550.0,0.00,0.0,0.0,0.0,41.0,0.0,0.0,19.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12899,8.50,0.0,0.00,0.0,0.0,54.0,43.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3014,12.08,1450.0,0.00,0.0,0.0,0.0,45.0,0.0,0.0,26.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14211,8.50,0.0,0.00,0.0,0.0,45.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1200,0.00,0.0,0.00,0.0,0.0,0.0,40.0,0.0,0.0,41.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10657,6.71,35.0,0.05,0.8,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3729,11.63,1750.0,0.00,0.0,0.0,0.0,34.0,0.0,0.0,26.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10217,13.87,2335.0,0.00,0.0,0.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9080,16.11,113.0,0.09,0.0,0.0,61.0,58.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5150,10.96,1850.0,0.00,0.0,0.0,0.0,24.0,0.0,0.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
best_knn.predict(([0, 0, 1, 10, 5, 25, 19, 0, 2, 20, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]))

AttributeError: 'Series' object has no attribute 'reshape'