# MSDS 7331 - Case Study 3 - Clasification of e-mail as ham or spam
Daniel Crouthamel

Sophia Wu

Fabio Savorgnan

Bo Yun

# Introduction

In this study, we will be building a classifier to predict busines that will go on bankrupcy. 

# Business Understanding

You should always state the objective at the beginning of every case (a guideline you should follow in real life as well) and provide some initial "Business Understanding" statements (i.e., what is trying to be solved for and why might it be important)

In [None]:
#importing libraries and reading in file
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from scipy.stats import randint as sp_randint

#general sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import ConfusionMatrixDisplay

# Yellowbrick
from yellowbrick.model_selection import FeatureImportances

#Pipeline
from sklearn.pipeline import make_pipeline

# Files
from os import listdir, getcwd, chdir
from os.path import isfile, join, dirname, realpath
from scipy.io import arff



# Data engeniering

Summarize the data being used in the case using appropriate mediums (charts, graphs, tables); address questions such as: Are there missing values? Which variables are needed (which ones are not)? What assumptions or conclusions are you drawing that need to be relayed to your audience?

## Load the data and EDA

In [None]:
files = ['data/1year.arff', 'data/2year.arff', 'data/3year.arff', 'data/4year.arff', 'data/5year.arff']


df = pd.DataFrame(arff.loadarff(files[0])[0])

for f in files[1:]:
    data_temp = arff.loadarff(f)
    df_temp = pd.DataFrame(data_temp[0])
    print(df_temp.shape)
    df = df.merge(df_temp,how='outer') 

In [None]:
print(df.shape)

df.head()

In [None]:
df.describe()

In [None]:
# Checking missing values 
nan_columns = []
nan_values = []

for column in df.columns:
    nan_columns.append(column)
    nan_values.append(df[column].isnull().sum())
    
nan_dict = {'Attributes': nan_columns, "Nan Count": nan_values}
nan_df = pd.DataFrame(nan_dict)

ax = nan_df.plot(kind='barh', stacked=True, figsize=(20, 20), rot=0, xlabel='Attribute', ylabel='Count', title = 'NANS')
for c in ax.containers:
    ax.bar_label(c, label_type='edge', fontsize = 14)

In [None]:
# Now fill all missing values with the mean of the column 
df = df.where(pd.notna(df), df.interpolate(), axis="columns")

In [None]:
# After filling th missing values, check it again.
missing = df.isnull().any(axis=1).sum()
len_before = df.shape[0]
print(f"Total records missing data: {missing}\n"
      f"Total percent of incomplete records: {missing/len_before*100:.2f}%"
     )

## Evaluate and transform to binary 0 0r 1 the target

In [None]:
df["class"].unique


In [None]:
# This show that the target is very inbalanced
df['class'].value_counts(normalize=False)

In [None]:
# Plot to better show the inbalanced target
plt.hist(df['class'])

In [None]:
# Convert the target in 0 an1

# classes = []

# for index, row in df.iterrows():
#     class_val = row['class']
#     if class_val not in classes:
#         classes.append(class_val)

# class_dict = {}

# for index, i in enumerate(classes):
#     class_dict.update({i:str(index)})
    
# df['class'] = df['class'].map(class_dict)

# All of the code above can be replaced with this
df['class'] = df['class'].replace([b'0',b'1'],[0,1])

df['class'].unique()

In [None]:
%%time
#Check all correlations
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

# show the heatmap
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(18, 9))
chart=sns.heatmap(df.corr(), cmap=cmap, annot=False)
chart.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
f.tight_layout()

In [None]:
#List all the top correlations
sort_corr = pd.DataFrame(abs(df.corr().unstack().sort_values().drop_duplicates()))
sort_corr.rename(columns={0:'Top Abs Corr'}, inplace=True)
sort_corr.head(15)

In [None]:
# Dataframe
df.head()

## Explore the differents columns of the data with pandas profiling

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")

We decided that missing values less than 10 % we bould keep as variables for our model because we would impute with the meam for the missing values. We will use the simple imputer.

So we decided to take out the Attr 21  "Sales (n) / sales (n-1)" because it has 13.5 missing values, also we decide to take out Attr 37 "Profit on operating activities / financial expenses" because it has 43.7 missing values. We believe specially for Attr 37 that this amount of missing values would not be able to replace in a meaninfull way with imputation.

We also plan to normalize the data using the rubost scaler.

Please see the attached pandas profiles.
  


In [None]:
# Sales (n) / sales (n-1)
plt.hist(df["Attr21"])

In [None]:
# Profit on operating activities / financial expenses
plt.hist(df["Attr37"])

In [None]:
# Final dataframe
df= df.drop(["Attr21"], axis = 1)
df= df.drop(["Attr37"], axis = 1)
df.head()

# Model preparation

Which methods are you proposing to utilize to solve the problem?  Why is this method appropriate given the business objective? How will you determine if your approach is useful (or how will you differentiate which approach is more useful than another)?  More specifically, what evaluation metrics are most useful given that the problem is a binary-classification one (ex., Accuracy, F1-score, Precision, Recall, AUC, etc.)?

## Randon Forest

In [None]:
# prepare test and train data

X = df.loc[:, df.columns != 'class'].values
y = df['class'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Impute

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X_train)
X_train = imp_mean.transform(X_train)
X_test = imp_mean.transform(X_test)

# Normalize the data
transformer = RobustScaler().fit(X_train)
transformer = RobustScaler().fit(X_test)
X_train = transformer.transform(X_train)
X_test= transformer.transform(X_test)

# Model building and Evaluation

In this case, your primary task is to build both a Random Forest and XGBoost model to accurately predict bankruptcy and will involve the following steps:

- Specify your sampling methodology
- Setup your models - highlighting any important parameters
- Analyze each model's performance - referencing your chosen evaluation metric (including supplemental visuals and analysis where appropriate)

## Set up the Grid Search for Random Forest

In [None]:
# build a classifier
clf = RandomForestClassifier(n_estimators=20)


# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X_train, y_train)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)


## Evaluation of the model

In [None]:
y_hat_rf_train = grid_search.predict(X_train)
accuracy_score(y_hat_rf_train, y_train)

In [None]:
# Confusion matrix train
confusion_matrix(y_train, y_hat_rf_train)

In [None]:
y_hat_rf_test = grid_search.predict(X_test)
accuracy_score(y_hat_rf_test, y_test)

In [None]:
# Confusion matrix test
confusion_matrix(y_test, y_hat_rf_test)
disp = ConfusionMatrixDisplay.from_estimator(grid_search, X_test, y_test)

In [None]:
# Precision and recall
print("Recall:", recall_score(y_test, y_hat_rf_test, pos_label="1", average='binary'))
print("Precision:", precision_score(y_test, y_hat_rf_test, pos_label="1", average='binary'))

## Plot evaluation

In [None]:
disp = plot_precision_recall_curve(grid_search, X_test, y_test,)
disp.ax_.set_title('Precision-Recall Curve')

## ROC

In [None]:
Disp = RocCurveDisplay.from_estimator(grid_search, X_test, y_test)

## GBoost model

In [None]:
clf = GradientBoostingClassifier(n_estimators=100, learning_rate= 0.1,
    max_depth=10, random_state=0).fit(X_train, y_train)

## Evaluation of the model for comparison to random forest, in order to see if we can improve the Random forest best model

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
clf.feature_importances_

In [None]:
y_hat_G = clf.predict(X_test)

In [None]:
# Precision and recall
print("Recall:", recall_score(y_test, y_hat_G, pos_label="1", average='binary'))
print("Precision:", precision_score(y_test, y_hat_G, pos_label="1", average='binary'))

## Plot precision and recall

In [None]:
disp = plot_precision_recall_curve(clf, X_test, y_test,)
disp.ax_.set_title('Precision-Recall Curve')

In [None]:
# Confusion matrix
confusion_matrix(y_test, y_hat_G)
disp = ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)

## ROC 

In [None]:
isp = RocCurveDisplay.from_estimator(clf, X_test, y_test)

## Try the grid search for GBoost in order to see if there is any improvement

In [None]:
param_test1 = {'n_estimators':range(100, 500, 1000),
'max_depth':range(10,16), 'min_samples_split':range(1, 10)
}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,min_samples_leaf= 1,random_state= 0), 
param_grid = param_test1,n_jobs=4).fit(X_train, y_train)
gsearch1.score(X_test, y_test)

In [None]:
y_hat_GG = gsearch1.predict(X_test)
# Precision and recall
print("Recall:", recall_score(y_test, y_hat_GG, pos_label="1", average='binary'))
print("Precision:", precision_score(y_test, y_hat_GG, pos_label="1", average='binary'))

# Model Interpretability & Explainability

Using at least one of your models above (if multiple were trained):

- Which variable(s) was (were) "most important" and why?  How did you come to the conclusion and how should your audience interpret this?

In [None]:
# yellow brick

# Set xyz to the appropriate variable
viz = FeatureImportances(xyz, topn=6, relative=False)
viz.fit(X_test, y_test)
viz.show()

# Conclusion

After all of your technical analysis and modeling; what are you proposing to your audience and why?  How should they view your results and what should they consider when moving forward?  Are there other approaches you'd recommend exploring?  This is where you "bring it all home" in language they understand.