# Classification - Final assignment

https://archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29

1. You'll need to download the following files from the link: [statlog shuttle](https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/shuttle/)
    * shuttle.trn.Z
    * shuttle.tst
2. These two files are the training and test set respectively. You'll have to uncompress the training file. On *nix based environments this would be 
    
```$ uncompress shuttle.trn.Z ```

Assumptions:
1. We don't have a description of each of the features, so it's not possible to know what the correct range of values is for each feature.
2. I'm assuming that all of the values for each feature are within a proper range

In [None]:
import numpy as np
feature_names = [f"f{i}" for i in range(1, 10)]
cols_names = np.append(feature_names, ["target"])

In [None]:
import pandas as pd
train_df = pd.read_table('shuttle.trn', sep='\s+', names=cols_names)
train_df.shape

In [None]:
train_df.head()

In [None]:
test_df = pd.read_table('shuttle.tst', sep='\s+', names=cols_names)
test_df.shape

In [None]:
test_df.head()

## Some exploration of data

*Inspect frequencies of different labels*

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x=train_df['target'])
plt.show()

In [None]:
train_df['target'].value_counts()

In [None]:
train_df['target'].value_counts(normalize=True).apply(lambda x: "{:.3f}".format(x))

In [None]:
test_df['target'].value_counts()

In [None]:
test_df['target'].value_counts(normalize=True).apply(lambda x: "{:.3f}".format(x))

*Dataset is very imbalanced. Frequencies for labels = [2, 3, 6, 7] are quite low, relative to labels=[1, 4, 5].
But training and test sets have a similar frequency.*

*Note: ~78% of labels are for label=1, so a model that always predicts label=1 for the target, would have an accuracy of 78%. We need to use precision/recall or another metric that handles imbalanced datasets well, when evaluating our model*

### Looking at data in training set

In [None]:
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    _ = train_df.hist(figsize=(20,15), grid=False, bins=50)

In [None]:
import seaborn as sns

_ = sns.histplot(train_df['f2'], kde=True, stat="density", linewidth=0 )

In [None]:
train_df['f2'].describe()

*That's interesting, the IQR (75pct - 25pct) is 0. Values are mostly 0, but some Max and min values on the extremes are pulling the mean left*

In [None]:
_ = sns.boxplot(y=train_df['f2'], x=train_df["target"].values).set_title("Boxplot of target against f2")

In [None]:
_ = sns.histplot(train_df['f4'], kde=True, stat="density", linewidth=0 ).set_title("Distribution of f4 in training set")

In [None]:
train_df[['f4']].describe()

*Similar to f2, values for f4 are mostly 0, except for some extremes*

In [None]:
print(train_df[train_df['f4'] == 0].shape)
print(train_df[train_df['f4'] == 1].shape)

In [None]:
test_df['target'].value_counts(normalize=True).apply(lambda x: "{:.3f}".format(x))

In [None]:
f4_zero_mask = train_df['f4'] == 0

f4_zeros = train_df[f4_zero_mask]['target'].value_counts(normalize=False)
f4_zeros

In [None]:
f4_non_zeros = train_df[~f4_zero_mask]['target'].value_counts(normalize=False)
f4_non_zeros

*Looking at ratios between targets with 0s and non 0s...*

In [None]:
f4_non_zeros.divide(f4_zeros, fill_value=0)

*For f4, all the targets with label=2, have a zero. label=6 are equally likely to have a zero or non-zero value. labels=[1, 3, 4] are less likely to have a non-zero value. label=7 is much more likely to have a non-zero value.*

In [None]:
_ = sns.histplot(train_df['f6'], kde=True, stat="density", linewidth=0 ).set_title("Distribution of f6 in training set")

In [None]:
_ = sns.boxplot(y=train_df['f6'], x=train_df["target"].values).set_title("Boxplot of target against f6")

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
_ = sns.heatmap(train_df.corr(), vmin=-1, vmax=1, cmap='coolwarm', annot=True)

*f2, f4, f6 (the features examined above) are showing little correlation with the target*

*f1, f5, f7, f8, f9 show medium to high correlation with target*

*f1 & f7 have strong negative correlation. f5 has strong negative correlation with f8 & f9. f8 and f9 are positively correlated with one another.*

In [None]:
_ = sns.boxplot(y=train_df['f1'], x=train_df["target"].values).set_title("Boxplot of f1 against against target")

In [None]:
_ = sns.boxplot(y=train_df['f5'], x=train_df["target"].values).set_title("Boxplot of f5 against target")

*Possible outliers in f5, for labels=[1,5]*

In [None]:
_ = sns.boxplot(y=train_df['f7'], x=train_df["target"].values).set_title("Boxplot of f7 against target")

In [None]:
_ = sns.boxplot(y=train_df['f8'], x=train_df["target"].values).set_title("Boxplot of f8 against target")

*Possible outliers in f8, for labels=[1,5]*

In [None]:
_ = sns.boxplot(y=train_df['f9'], x=train_df["target"].values).set_title("Boxplot of f9 against target")

*Possible outliers in f9, for labels=[1,5]*

### Using zscores to measure possible outliers...

In [None]:
from scipy import stats
z = np.abs(stats.zscore(train_df[['f5', 'f8', 'f9']]))
z

In [None]:
scores = train_df[['f5', 'f8', 'f9','target']].copy()
scores['zscore_f5'] = z[:, 0]
scores['zscore_f8'] = z[:, 1]
scores['zscore_f9'] = z[:, 2]
threshold = 4
mask_outlier_candidates = (scores['zscore_f5'] > threshold) | (scores['zscore_f8'] > threshold)
scores[mask_outlier_candidates]

In [None]:
train_df[mask_outlier_candidates].index.values

In [None]:
_ = sns.countplot(x=scores[mask_outlier_candidates]['target'])

In [None]:
train_df.drop(train_df[mask_outlier_candidates].index.values, axis=0, inplace=True)
train_df.reindex()

*The outlier values for f4, f8 occur over targets 1, 4, 5 similarly. Without knowing the correct range of values foreach feature, it's difficult to know if the these values are true outliers or not, and whether it's appropriate to replace them, or drop them. There are 11 of these rows. For the moment I will drop them.* 

In [None]:
scores[(scores['zscore_f9'] > threshold)]

In [None]:
_ = sns.countplot(x=scores[(scores['zscore_f9'] > 4) & (~mask_outlier_candidates)]['target'])

*All the values f9 that have an extreme value, have a target of label=5. This is interesting, and a potentially useful piece of information for the model.* 

In [None]:
plt.figure(figsize=(14, 10))
_ = sns.scatterplot(x="f8", y="f9", 
                    hue='target', 
                    style='target',
                    palette="Set2",
                    legend='full',                    
                    data=train_df).set_title("Scatter f8 against f9 with target")

In [None]:
plt.figure(figsize=(14, 10))
_ = sns.scatterplot(x="f1", y="f7", 
                    hue='target', 
                    style='target',
                    palette="Set2",
                    legend='full',                    
                    data=train_df).set_title("Scatter f7 against f1 with target")

## Initial Evaluation on Random Forest model

*Going to try a Random forest Model*

In [None]:
from sklearn.model_selection import train_test_split

X = train_df[feature_names]
y = train_df['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.40, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
_ = clf.fit(X_train, y_train)

print("Training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Validation set: {:.3f}".format(clf.score(X_val, y_val)))

In [None]:
from sklearn.metrics import classification_report
clf = RandomForestClassifier(random_state=42)
_ = clf.fit(X_train, y_train)

print("Training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Validation set: {:.3f}".format(clf.score(X_val, y_val)))
print(classification_report(y_val, clf.predict(X_val), digits=3))

*Very good scores, except for class=6 which has poor recall.* 
*Going to add some under and over sampling with SMOTETomek*

In [None]:
y.value_counts().to_dict()

In [None]:
y_val.value_counts().to_dict()

In [None]:
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE

smote = SMOTETomek(smote=SMOTE(k_neighbors=2, sampling_strategy={  
                                            2: 50, 
                                            7: 50, 
                                            6: 50}))

X_smote, y_smote = smote.fit_resample(X_train, y_train)
y_smote.value_counts()

In [None]:
clf = RandomForestClassifier(n_estimators=10, random_state=42)
_ = clf.fit(X_smote, y_smote)

print(classification_report(y_val, clf.predict(X_val), digits=3))

*The results look much better now for labels 6 and 7. It could be that we've just being lucky with the training/validation splits, so going to run using cv splits*

## Cross Validation
Let's run the training set through cross_validation

In [None]:
#import sklearn
#sorted(sklearn.metrics.SCORERS.keys())

In [None]:
from sklearn.model_selection import cross_validate
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
   ('sampling', SMOTETomek(smote=SMOTE(k_neighbors=3, sampling_strategy={  
                                            2: 50, 
                                            7: 50, 
                                            6: 50}))),
   ('model', RandomForestClassifier(n_estimators=10, random_state=42))
])

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    #Using recall_macro & precision_macro to give equal weight to each class.
    scores = cross_validate(pipeline, X, y, scoring=['recall_macro', 'precision_macro'], cv=5) 

print(f'Mean recall_macro score: {np.mean(scores["test_recall_macro"]):.3f}')
print(f'Mean precision_macro score: {np.mean(scores["test_precision_macro"]):.3f}')

*These macro recall and precision scores are similar to what we got for the classification_report. 
"Macro" recall gives equal weighting to all labels, regardless of the number of samples of each label. So if you have an imbalanced dataset, macro recall will capture classifiction misses*

## Final Score: Test Set

In [None]:
y_actual = test_df['target'] 
y_hat = pipeline.fit(X,y).predict(test_df[feature_names]) #Fit all of X, y now that we have model built.
print(classification_report(y_actual, y_hat, digits=3))
clf.score(test_df[feature_names], y_actual)

*The RF model has performed well on the test set also, achieving an overall accuracy of 0.9997*