## Benchmarking Results ##

***Description of Notebook:***

Built naive models to get a prediction benchmark for our classification problem. The models used are:

1. Logistic Regression
2. Decision Tree Classifier
3. K Nearest Neighbors
4. Support Vector Classifier

Decision Tree Classifier performed best with ROC AUC Score of .754 and Log Loss of 8.502.

### Results

**Logistic Regression:**

*ROC AUC Score:* 0.520

*Log Loss:* 16.605

**Decision Tree Classifier:**

*ROC AUC Score:* 0.754

*Log Loss:* 8.502

**K Nearest Neighbors:**

*ROC AUC Score:* 0.696

*Log Loss:* 10.495

**Support Vector Classifier:**

*ROC AUC Score:* 0.500

*Log Loss:* 17.469

## Benchmarking Code ##

** Data Import **

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, classification_report, accuracy_score, log_loss

** Data Import **

In [3]:
train_data = pd.read_csv('./data/madelon_train.data', 
                         sep = ' ', header = None).drop(500, axis = 1)
train_labels = pd.read_csv('./data/madelon_train.labels', sep = ' ', header = None)

In [4]:
train_data.shape, train_labels.shape

((2000, 500), (2000, 1))

In [5]:
val_data = pd.read_csv('./data/madelon_valid.data', 
                         sep = ' ', header = None).drop(500, axis = 1)
val_labels = pd.read_csv('./data/madelon_valid.labels', sep = ' ', header = None)

In [6]:
val_data.shape, val_labels.shape

((600, 500), (600, 1))

In [7]:
test_data = pd.read_csv('./data/madelon_test.data', 
                         sep = ' ', header = None).drop(500, axis = 1)

In [8]:
train_data['target'] = train_labels
val_data['target'] = val_labels

In [28]:
uci_df = train_data.append(val_data)

In [29]:
train_data.shape, val_data.shape

((2000, 501), (600, 501))

In [30]:
uci_df.shape

(2600, 501)

In [31]:
uci_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2600 entries, 0 to 599
Columns: 501 entries, 0 to target
dtypes: int64(501)
memory usage: 10.0 MB


In [32]:
uci_df.get_dtype_counts()

int64    501
dtype: int64

** Select the data set to use in work flow here. **

In [21]:
df = uci_df.copy()

# Step 1 - Benchmarking #

In [None]:
predictors = df[df.columns[0:500]]
target = df[df.columns[500]]

In [None]:
predictors.shape

In [None]:
target.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = .2, random_state = 42)

### Logistic Regression

In [None]:
log_reg = LogisticRegression(penalty = 'l1', C = 10000)

In [None]:
log_reg.fit_transform(X_train, y_train)

In [None]:
print("Accuracy Score:", accuracy_score(y_test, log_reg.predict(X_test)))

In [None]:
print("ROC AUC Score:", roc_auc_score(y_test, log_reg.predict(X_test)))

In [None]:
print("Log Loss:", log_loss(y_test, log_reg.predict(X_test)))

**Accuracy Score:** 0.519

**ROC AUC Score:** 0.520

**Log Loss:** 16.605

In [None]:
print(classification_report(y_test, log_reg.predict(X_test)))

### Decision Tree Classifier

In [None]:
dt_clf = DecisionTreeClassifier()

In [None]:
dt_clf.fit(X_train, y_train)

In [None]:
print("Accuracy Score:", accuracy_score(y_test, dt_clf.predict(X_test)))

In [None]:
print("ROC AUC Score:", roc_auc_score(y_test, dt_clf.predict(X_test)))

In [None]:
print("Log Loss:", log_loss(y_test, dt_clf.predict(X_test)))

**Accuracy Score:** 0.754

**ROC AUC Score:** 0.754

**Log Loss:** 8.502

In [None]:
print(classification_report(y_test, dt_clf.predict(X_test)))

### K Nearest Neighbor ###

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5)

In [None]:
knn.fit(X_train, y_train)

In [None]:
print("Accuracy Score:", accuracy_score(y_test, knn.predict(X_test)))

In [None]:
print("ROC AUC Score:", roc_auc_score(y_test, knn.predict(X_test)))

In [None]:
print("Log Loss:", log_loss(y_test, knn.predict(X_test)))

**Accuracy Score:** 0.696

**ROC AUC Score:** 0.696

**Log Loss:** 10.495

In [None]:
print(classification_report(y_test, knn.predict(X_test)))

### Support Vector Classifier

In [None]:
svc = SVC(C = 10000)
svc.fit(X_train, y_train)

In [None]:
print("Accuracy Score:", accuracy_score(y_test, svc.predict(X_test)))

In [None]:
print("ROC AUC Score:", roc_auc_score(y_test, svc.predict(X_test)))

In [None]:
print("Log Loss:", log_loss(y_test, svc.predict(X_test)))

**Accuracy Score:** 0.494

**ROC AUC Score:** 0.500

**Log Loss:** 17.469