# Part 3 Balancing Classes
---

## Class Distribution
Up until this point, I had mostly just been working with 5% of my total data. After playing around with Tableau for a while, I came up with some plots similar to these:  
<img src='pics/bars.svg'> <img src='pics/pie.svg'>

Clearly I need to correct this somehow. Up until now, I had been mostly seeing what the models could do with what I gave them, but moving forward I wanted something a little better.

In [None]:
from SciServer import SkyQuery, SciDrive
import pandas as pd
import numpy as np
import model_processes as mp
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, balanced_accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.utils import check_sampling_strategy
from xgboost import XGBClassifier

Load table containing all 500k+ rows of my data

In [None]:
table_source = 'webuser.AllStars'
database = 'MyDB'

df = SkyQuery.getTable(table_source, datasetName=database)
df = df.set_index('#SPECOBJID')

In [None]:
df.shape
#output: (524354, 341)

In [None]:
df_grouped = mp.grouped_frame(df)
X, y = df_grouped.iloc[:,1:].sort_index(axis=1), df_grouped.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=25, stratify=y)

## Random Undersampling
I started with just randomly undersampling all classes except the minority

In [None]:
X_sampled, y_sampled = RandomUnderSampler(random_state=42).fit_sample(X_train, y_train)

In [None]:
params = {'objective':'multi:softmax', 'tree_method':'hist', 'n_estimators':500,
          'num_class':9, 'seed':42}

xbg_model = XGBClassifier(params).fit(X_sampled, y_sampled)
y_pred = xbg_model.predict(X_test)

cf = (confusion_matrix(y_true=y_test, y_pred=y_pred, normalize='true') * 100).astype(int)
ba = balanced_accuracy_score(y_test, y_pred)

Output: ba = 0.8432023516694044 <- already a jump of around 0.07!

In [None]:
class_names = sorted(list(y_test.unique()))

In [None]:
fig = mp.print_confusion_matrices({'XGBoost':cf.astype(int)}, class_names, figsz=(8,5))
fig.show()

<img src='pics/xgb_sampled.png'>

Confusion matrix also looks significantly better for the minority classes

## Balanced Random Forest
Next I tried out imblearn's Balanced Random Forest.

In [None]:
model = RandomForestClassifier()
balanced_model = BalancedRandomForestClassifier()

In [None]:
model.fit(X_train, y_train)
y_pred_rf = model.predict(X_test)
cf_rf = (confusion_matrix(y_true=y_test, y_pred=y_pred_rf, normalize='true') * 100).astype(int)
ba_rf = balanced_accuracy_score(y_test, y_pred_rf)

In [None]:
balanced_model.fit(X_train, y_train)
y_pred_brf = balanced_model.predict(X_test)
cf_brf = (confusion_matrix(y_true=y_test, y_pred=y_pred_brf, normalize='true') * 100).astype(int)
ba_brf = balanced_accuracy_score(y_test, y_pred_brf)

Balanced accuracy scores:  
Undersampling from above: 0.8432023516694044  
Random Forest: 0.7664527226475953  
Balanced Random Forest: 0.8412801777174002

In [None]:
fig = mp.print_confusion_matrices({'Random Forest':cf_rf, 
                                  'Balanced Random Forest':cf_brf,
                                  'XGBoost Random Undersample':cf}, 
                                  class_names, figsz=(8,5))
fig.show()

<img src='pics/brf_vs_rf_vs_under.png'>

So far, looks like both are a significant improvement, although balanced random forest took ages to train, so undersampling is winning thus far.

## A Combined Strategy

In [None]:
over = check_sampling_strategy('auto', y_train, 'over-sampling')
under = check_sampling_strategy('auto', y_train, 'under-sampling')

over:  
[('A', 124118), 
             ('CV', 168574), 
             ('CarbonWD', 161148), 
             ('G', 138178), 
             ('K', 85275), 
             ('LT', 167068), 
             ('M', 100463), 
             ('OB', 168474)]  
under:  
[('A', 1735), 
             ('CarbonWD', 1735), 
             ('F', 1735), 
             ('G', 1735), 
             ('K', 1735), 
             ('LT', 1735), 
             ('M', 1735), 
             ('OB', 1735)]

Ok so definitely don't want to try oversampling - not only would I end up with stupid amounts of data, for some classes I'd have around 10 times as much synthetic data as real data.  
But undersampling is giving me a pretty small sample size to work with: about 3% of my original data.  
So let's try something in between:  
- Under-sample: M, F, A, K, G
- Over-sample: CarbonWD, LT, OB, CV

In [None]:
under = {'M':10000, 'F':10000, 'A':10000, 'K':10000, 'G':10000}
over = {'CarbonWD':10000, 'LT':10000, 'OB':10000, 'CV':10000}

undersample = RandomUnderSampler(sampling_strategy=under, random_state=42)
oversample_rand = RandomOverSampler(sampling_strategy=over, random_state=42)
oversample_smote = SMOTE(sampling_strategy=over, random_state=42)

In [None]:
X_under, y_under = undersample.fit_sample(X_train, y_train)

I decided to try out both random over sampling and SMOTE, and see which does better.

In [None]:
X_over_rand, y_over_rand = oversample_rand.fit_sample(X_under, y_under)
X_over_smote, y_over_smote = oversample_smote.fit_sample(X_under, y_under)

In [None]:
xbg_model_rand = XGBClassifier(params).fit(X_over_rand, y_over_rand)
y_pred_rand = xbg_model_rand.predict(X_test)

In [None]:
xbg_model_smote = XGBClassifier(params).fit(X_over_smote, y_over_smote)
y_pred_smote = xbg_model_smote.predict(X_test)

In [None]:
cf_rand = (confusion_matrix(y_true=y_test, y_pred=y_pred_rand, normalize='true') * 100).astype(int)
cf_smote = (confusion_matrix(y_true=y_test, y_pred=y_pred_smote, normalize='true') * 100).astype(int)
ba_rand = balanced_accuracy_score(y_test, y_pred_rand)
ba_smote = balanced_accuracy_score(y_test, y_pred_smote)

Balanced Accuracy Scores:

Balanced Random Forest: 0.8413  
XGBoost Random Undersample: 0.8432  
XGBoost Under/Oversample Random: 0.8454  
XGBoost Under/Oversample Smote: 0.8446  
<img src='pics/resample_all.png'>

All these methods seem pretty even from these results! I was struggling to decide, so in a moment of insanity, I ran cross_val_score on all these models (it took sooooo long)

In [None]:
c = cross_val_score(xbg_model, X_sampled, y_sampled, scoring='balanced_accuracy')
c_brf = cross_val_score(balanced_model, X_train, y_train, scoring='balanced_accuracy')
c_rand = cross_val_score(XGBClassifier(params), X_over_rand, y_over_rand, scoring='balanced_accuracy')
c_smote = cross_val_score(XGBClassifier(params), X_over_smote, y_over_smote, scoring='balanced_accuracy')

Cross Val Scores:

Balanced Random Forest: [0.8452 0.8402 0.8393 0.8458 0.8402]  
XGBoost Random Undersample: [0.8473 0.845  0.8444 0.8405 0.8479]  
XGBoost Under/Oversample Random: [0.9348 0.937  0.9397 0.9346 0.9408]  
XGBoost Under/Oversample Smote: [0.8752 0.9219 0.9301 0.9253 0.9296]

Ok so I realize that this score isn't as reflective of model performance since it's testing on data that been altered, but! Either it means something that the Under/Oversample Random model did so well or it doesn't, and if it doesn't, by the other metrics all of these were about the same, so let's go with this one.

In [None]:
df_sampled = X_over_rand.merge(y_over_rand, left_index=True, right_index=True)#best performing sampling method
df_sampled = df_sampled[['SUBCLASS', *df_sampled.columns[:-1]]]
df_test = X_test.merge(y_test, left_index=True, right_index=True)#best performing sampling method
df_test = df_test[['SUBCLASS', *df_test.columns[:-1]]]

In [None]:
try: 
    SciDrive.upload(path='metis_project_3/AllStarsSampled.csv', data = df_sampled.to_csv())
    SciDrive.upload(path='metis_project_3/AllStarsSampledTest.csv', data = df_test.to_csv())
except:
    sp.login()
    SciDrive.upload(path='metis_project_3/AllStarsSampled.csv', data = df_sampled.to_csv())
    SciDrive.upload(path='metis_project_3/AllStarsSampledTest.csv', data = df_test.to_csv())