## Imbalanced Data Experiments

**Amélie Buc** / *August 2018* / Imbalanced data set with classification.

## Import Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree, metrics
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

## Read Data

In [2]:
# to access the location of the directory you are working in:
# import os
# print(os.getcwd())

spreadsheet_file_path = "/Users/ameliebuc/Documents/byond_internship/ImBlanced-Classification.csv"
data = pd.read_csv(spreadsheet_file_path, encoding = 'utf-8')
data.describe()

Unnamed: 0,Label,b,c,d,e,f,g,h,i,j,k,l,m
count,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0
mean,0.040211,36.845979,1.288118,1.564845,15.765939,7.881702,72.928192,3.200935,736.620633,951.564281,2253.277,1.95351,0.339961
std,0.196458,13.241031,0.453162,0.496761,26.337659,18.785623,40.728075,6.440581,292.545306,749.563452,14034.04,0.642311,0.473705
min,0.0,18.676712,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
25%,0.0,25.920548,1.0,1.0,0.0,2.0,27.0,0.0,743.0,0.0,850.0,2.0,0.0
50%,0.0,34.178082,1.0,2.0,8.0,2.0,99.0,0.0,826.0,1271.0,1200.0,2.0,0.0
75%,0.0,45.328767,2.0,2.0,8.0,8.0,99.0,3.317808,904.0,1624.0,2050.0,2.0,1.0
max,1.0,95.476712,2.0,2.0,81.0,99.0,99.0,56.356164,1138.0,1900.0,2000000.0,3.0,1.0


## Load Data

In [12]:
# Set your prediction target and the features you'll need to predict instances
y = data.Label
# take out b, c and m
data_features = ['d', 'e', 'f', 'g','h', 'i', 'j', 'k', 'l']

# Store the data corresponding to data_features in the dictionary data to X
X = data[data_features]

# How imbalanced is the data?
imb_count = data.Label.value_counts()
print("The data is imbalanced in the following way: \n{}".format(imb_count))

# Split data into sets for testing and for validation to make accuracy be represented right
# One issue is that the data is orderered (most 1s at end) so RANDOMIZE splitting 
#   to make sure some 1s go into each set
train_X, val_X, train_y, val_y = train_test_split(X,y,test_size=0.2,random_state=90)

# Oversample (res=resample) the training data using SMOTE algorithm
sm = SMOTE(random_state=12)
x_train_res, y_train_res = sm.fit_sample(train_X, train_y)


The data is imbalanced in the following way: 
0    23845
1      999
Name: Label, dtype: int64


## Build Model

In [15]:
# Create a Classifier
model = RandomForestClassifier(n_estimators=200, criterion = "gini",random_state=3,class_weight="balanced")

# Train the model using the training sets 
model.fit(x_train_res, y_train_res)

# Predict Output 
predicted= model.predict(val_X)
print (predicted)


[0 0 0 ... 0 0 0]


## Performance Metrics

In [16]:
# Compute precision-recall score 


accuracy_measure = metrics.accuracy_score(val_y, predicted)

print('Accuracy: {0:0.8f}'.format(accuracy_measure))

precision_measure = metrics.precision_score(val_y,predicted)

print('Precision: {0:0.8f}   What proportion of positive identifications was actually correct?'.format(precision_measure))

recall_measure = metrics.recall_score(val_y,predicted)

print('Recall: {0:0.8f}      What proportion of actual positives was identified correctly?'.format(recall_measure))

Accuracy: 0.92090964
Precision: 0.18000000   What proportion of positive identifications was actually correct?
Recall: 0.26865672      What proportion of actual positives was identified correctly?


## Further Task Ideas
- Improve precision and recall.
- Gridsearch
- Check how 0 and 1 data compare per feature to see what features aren't needed because they don't make 0 and 1 be differentiated
- Try new models (neural network: 10, 3 layers each)
- Employ a learning curve