***Creator: Changhee Kang***

**Part 3 - Developing predictive model with sampling techniques applied**

**Note that:** data manupulation and exploration techniques illustrated in this demonstraton does not mean that the audience have to follow the same ways as shown here.

# Device Failure and Maintenance Prediction Model

It is to build a predictive model with diagnoses of telemetry attributes to classify whether maintenance should be performed on devices or not. The column used for prediction is set with the column name, "failure", with binary value 0 for "non-failure" and 1 for "failure". The goal is to minimize false positives and false negatives.

# Assumptions

As there is no meta data that describes the current dataset, assumptions can be applied to the current dataset. The dataset consists of diagnoses of telemetry attributes, so it might be rational that some variables are assumed to consist of *categorical nominal type values* while other variables would consist of *continuous type values*. 

# Roadmap

 *'Oversampling'* method will be applied to the current dataset to inflate the number of data point in the minority class. 

The reason for applying *'Oversampling'* will be explained in detail. When developing the predictive model, a model development with and without cross-validation will be compared to see the impact of cross-validation involvments in the predictive model development. Moreover, the right way of applying sampling techniques with cross-validation will also be discussed here. *'Random Oversampling'* is the sampling technique to be adopted to inflate the number of data points in the minority class of the dataset.

# Data Loading

import necessary python modules to load the current dataset into memory.

In [2]:
import pandas as pd
import numpy as np

Load the dataset into memory.

In [3]:
datafile = r'/home/thomas/Downloads/device_failure.csv'
dataset = pd.read_csv(datafile, sep=',', engine='python')

# Build the predictive model with *'Random Sampling'*

import necessary modules for model developments.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from matplotlib.colors import ListedColormap
from imblearn import over_sampling

from sklearn import metrics
from sklearn.metrics import auc, accuracy_score, average_precision_score, precision_recall_curve, roc_auc_score, roc_curve, f1_score, precision_score, recall_score, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import recall_score

from IPython.display import HTML, display
import tabulate

import warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module" or "once"

Define a model development procedure. The reason why *'Random Oversampling'* is adopted for a binary classification model should be explained here. Without any sampling techniques, the predictive model did not find any devices with failure, at all. *'Oversampling'* will help increase chances of devices with failure to be selected by *'Random Forest'* algorithm by infalting the number of data points in the minority class to the equal number of data points in the majority class. Thus, the predictive model becomes to learn bot ways to failure and non-failure. One thing to be noted is that *'Oversampling'* is only applied to the training dataset and it is applied inside the cross-validation loop.

In [5]:
'''
    Models are built with stratified 10-cross-validation by default. The splitting of data into 
    folds will ensure that each fold has the same proportion of observations with a given 
    categorical value, such as the class outcome value. 
    
    This will produce a predictive model according to a set of function parameters such as:
    - X: holds independent variables.
    - Y: holds the target variable.
    - DEPTH: the maximum number of depths each estimator can span.
    - NEIGHBORS: the number of neighbors to use for data augmentation.
    - ESTIMATORS: the number of estimators.
    - WICH_RF: the choice of model development approaches.(no sampling, over-sampling, smote-sampling)
        1 - SMOTE.
        2 - Random oversampling on minority class.
        0 - No sampling techniques applied.
    
    At the end of the funtion execution, statistical results will be provided on the console.
'''
def run_model(SUMMARY_TABLE, X, Y, DEPTH, NEIGHBORS, ESTIMATORS, WHICH_MODEL, WHICH_SAMPLING):
    predicted = None
    
    # Convert to dataframe to number to split for cross-valation runs.
    numpy_X = X.to_numpy()
    numpy_Y = Y.to_numpy()

    # Create 10-stratified-folds.
    skf = StratifiedKFold(n_splits=10)
    rnd_cnt = 1

    total_fpr = 0
    total_acc = 0
    total_err = 0
    total_precision = 0
    total_recall = 0
    total_f1 = 0
    total_tp = 0
    total_tn = 0
    total_train_acc = 0
    total_fp = 0
    
    # Run 10-stratified-cross-validations.
    for train_index, test_index in skf.split(numpy_X, numpy_Y):
        # Data is numpied.
        x_train, x_test = numpy_X[train_index], numpy_X[test_index]
        y_train, y_test = numpy_Y[train_index], numpy_Y[test_index]

        # convert to dataframe from numpied data.
        x_train = pd.DataFrame(x_train.reshape(-1, len(X.columns)),columns=X.columns)    
        x_test = pd.DataFrame(x_test.reshape(-1, len(X.columns)),columns=X.columns)
        y_train = pd.DataFrame(y_train.reshape(-1, 1))
        y_test = pd.DataFrame(y_test.reshape(-1, 1))    
        
        '''
            Do sampling accordingly.
        '''
        if WHICH_SAMPLING == 'OVER': 
            # Concatenate y_train to x_train to over-sample".
            train_dataset = x_train
            train_dataset['failure'] = y_train

            # Split train dataset into true negatives and true positives
            train_target_0 = train_dataset[train_dataset.iloc[:,-1] == 0]
            train_target_1 = train_dataset[train_dataset.iloc[:,-1] == 1]

            #Over-sample data with replacement allowed for the minority class.
            train_target_1_oversample = train_target_1.sample(len(train_target_0), replace=True)
            train_target_1_oversample.failure.value_counts()
            train_oversample = pd.concat([train_target_1_oversample, train_target_0], axis=0)

            x_train = train_oversample # Assign over-sampled train dataset to x_train.
            y_train = x_train.failure # Copy y values from x_train.
            x_train.drop('failure', axis=1, inplace=True) # Drop y values from x_train.
        elif WHICH_SAMPLING == 'UNDER':   
            # Create random forest classifier with downsampling for the majority class.
            x_train_res, y_train_res = RandomUnderSampler(random_state=0).fit_sample(x_train, y_train)
            x_train = x_train_res
            y_train = y_train_res
        else: 
            pass
        
        '''
            Select and run the model accordingly.
        '''
        train_acc = 0
        
        if WHICH_MODEL == 'RF': # Random Forest.
            rf = RandomForestClassifier(random_state=0, n_estimators=ESTIMATORS, max_depth=DEPTH)
            rf.fit(x_train,y_train) # Fit the model to training dataset.
            #predicted = rf.predict(x_test) # Get prediction result.
            train_acc = rf.score(x_train, y_train) # train set 정확도
            y_pred = rf.predict(x_test)#(rf.predict_proba(x_test)[:,1] >= prob).astype(bool)
            
        
        # Show statistics for each round.
        #print("Fold - {}".format(rnd_cnt))
        '''
            Confusion Matrix
        '''
        #print("-------------  Confusion Matrix  ----------------")
        cm = confusion_matrix(y_test, y_pred)
        print("Fold - {}".format(rnd_cnt))
        print(cm)
        print()
        '''
            Get individual counts from confusion matrix for further statistics for 
            true positive, true negatve, false positive, and false negative.
        '''
        TN = cm[0][0]
        FP = cm[0][1]
        FN = cm[1][0]
        TP = cm[1][1]
        
        # Compute totals of true positives and true negatives.
        total_tp += TP
        total_tn += TN
        total_fp += FP
        
        '''
            Statistics for accuray, precision, recall, overall error, and etc..
        '''
        #print("-------------     Statistics     ----------------")
        acc = (TN+TP)/(TN+TP+FN+FP)
        precision = TP/(FP+TP)
        recall = TP/(FN+TP)
        f1 = 2*(precision*recall/(precision+recall))
        overall_error = (FP+FN)/(TP+TN+FP+FN)     
        fpr = FP/(TN+FP)
        
        # Compute totals of other statistics.
        total_err += overall_error
        total_acc += acc
        total_precision += precision
        total_recall += recall
        total_f1 += f1
        total_fpr += fpr
        total_train_acc += train_acc
        
        # Increment round by 1.
        rnd_cnt += 1
    
    # Compute averages.
    avg_err = round((total_err/10)*100, 2)
    avg_acc = round((total_acc/10)*100, 2)
    avg_pre = round((total_precision/10)*100, 2)
    avg_rec = round((total_recall/10)*100, 2)
    avg_f1 = round((total_f1/10)*100, 2)
    avg_tp = round((total_tp/10), 2)
    avg_tn = round((total_tn/10), 2)
    avg_fpr = round((total_fpr/10)*100, 2)
    avg_train_acc = round((total_train_acc/10)*100, 2)
    avg_fp = round((total_fp/10)*100, 2)

    show_classifier_summary(SUMMARY_TABLE, WHICH_MODEL, WHICH_SAMPLING, ESTIMATORS, DEPTH, NEIGHBORS, len(X.columns), avg_train_acc, avg_acc, avg_pre, avg_rec, avg_fpr, avg_err, avg_f1, avg_tp, avg_tn, avg_fp, X)

Define a model performance summary.

In [6]:
def show_classifier_summary(summary_table, 
                            which_model, 
                            which_sample_tech, 
                            estimators,
                            depth,
                            neighbor,
                            var_len,
                            avg_train_acc,
                            avg_acc, 
                            avg_pre, 
                            avg_rec, 
                            avg_fpr,
                            avg_err, 
                            avg_f1,
                            avg_tp, 
                            avg_tn, 
                            avg_fp,
                            variables):
    
    summary_table.append(
            [which_model, 
            which_sample_tech, 
            estimators,
            depth,
            neighbor,
            var_len,
            avg_train_acc,
            avg_acc, 
            avg_pre,
            avg_rec,
            avg_fpr,
            avg_f1,
            avg_err, 
            avg_tp, 
            avg_tn, 
            avg_fp,
            variables.columns.tolist()]
        )
    display(HTML(tabulate.tabulate(summary_table, tablefmt='html')))

In [7]:
rf_summary_table = [["MODEL", "SAMPLING", "ESTIMATORS", "DEPTH", "NEIGHBORS", "VARS", "T_ACC", "ACC.", "PRE.", "REC.", "FPR.", "F1", "ERR.", "TP", "TN", "FP", "VARS."]]

# Set variables to be participated in the model.
X = dataset[['attribute1', 'attribute2', 'attribute3', 'attribute4','attribute5', 'attribute6', 'attribute7', 'attribute9']]
Y = dataset.failure
DEPTH = 7 # Define the max depth to 7.
ESTIMATORS = 100 # Defind the number of decision trees to participate in voting.
NEIGHBORS = 0
run_model(rf_summary_table, X, Y, DEPTH, NEIGHBORS, ESTIMATORS, 'RF', 'OVER') # 0 for random forest without oversampling

Fold - 1
[[11869   570]
 [    6     5]]

Fold - 2
[[11902   537]
 [    2     9]]

Fold - 3
[[11899   540]
 [    4     7]]

Fold - 4
[[12017   422]
 [    7     4]]

Fold - 5
[[11963   476]
 [    4     7]]

Fold - 6
[[12021   418]
 [    8     3]]

Fold - 7
[[12053   386]
 [    7     3]]

Fold - 8
[[11993   446]
 [    6     4]]

Fold - 9
[[12006   432]
 [    5     5]]

Fold - 10
[[11933   505]
 [    5     5]]



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
MODEL,SAMPLING,ESTIMATORS,DEPTH,NEIGHBORS,VARS,T_ACC,ACC.,PRE.,REC.,FPR.,F1,ERR.,TP,TN,FP,VARS.
RF,OVER,100,7,0,8,89.63,96.16,1.07,48.82,3.8,2.09,3.84,5.2,11965.6,47320.0,"['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5', 'attribute6', 'attribute7', 'attribute9']"


The predictive model now seems to know ways to failure and non-failure. Although prefereable combinations of variables are not applied, the model with all variables can tell what devices are in failure. Let's try categorical variables only.

In [10]:
# Set variables to be participated in the model.
X = dataset[['attribute3', 'attribute4','attribute5', 'attribute7', 'attribute9']]
Y = dataset.failure
DEPTH = 4 # Define the max depth to 7.
ESTIMATORS = 100 # Defind the number of decision trees to participate in voting.
NEIGHBORS = 0
run_model(rf_summary_table, X, Y, DEPTH, NEIGHBORS, ESTIMATORS, 'RF', 'OVER') # 0 for random forest without oversampling

Fold - 1
[[11196  1243]
 [    2     9]]

Fold - 2
[[11558   881]
 [    2     9]]

Fold - 3
[[11501   938]
 [    4     7]]

Fold - 4
[[11534   905]
 [    5     6]]

Fold - 5
[[11543   896]
 [    3     8]]

Fold - 6
[[11595   844]
 [    4     7]]

Fold - 7
[[11554   885]
 [    4     6]]

Fold - 8
[[11584   855]
 [    5     5]]

Fold - 9
[[11635   803]
 [    3     7]]

Fold - 10
[[11341  1097]
 [    4     6]]



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
MODEL,SAMPLING,ESTIMATORS,DEPTH,NEIGHBORS,VARS,T_ACC,ACC.,PRE.,REC.,FPR.,F1,ERR.,TP,TN,FP,VARS.
RF,OVER,100,7,0,8,89.63,96.16,1.07,48.82,3.8,2.09,3.84,5.2,11965.6,47320.0,"['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5', 'attribute6', 'attribute7', 'attribute9']"
RF,OVER,100,4,0,5,82.04,92.46,0.75,65.82,7.51,1.48,7.54,7.0,11504.1,93470.0,"['attribute3', 'attribute4', 'attribute5', 'attribute7', 'attribute9']"


Again, try another combination of variables such as *'attribute4'* and *'attribute7'* only.

In [11]:
# Set variables to be participated in the model.
X = dataset[['attribute4', 'attribute7']]
Y = dataset.failure
DEPTH = 2 # Define the max depth to 7.
ESTIMATORS = 100 # Defind the number of decision trees to participate in voting.
NEIGHBORS = 0
run_model(rf_summary_table, X, Y, DEPTH, NEIGHBORS, ESTIMATORS, 'RF', 'OVER') # 0 for random forest without oversampling

Fold - 1
[[11113  1326]
 [    2     9]]

Fold - 2
[[11493   946]
 [    0    11]]

Fold - 3
[[11507   932]
 [    4     7]]

Fold - 4
[[11516   923]
 [    5     6]]

Fold - 5
[[11516   923]
 [    3     8]]

Fold - 6
[[11543   896]
 [    4     7]]

Fold - 7
[[11535   904]
 [    3     7]]

Fold - 8
[[11476   963]
 [    5     5]]

Fold - 9
[[11445   993]
 [    2     8]]

Fold - 10
[[11197  1241]
 [    4     6]]



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
MODEL,SAMPLING,ESTIMATORS,DEPTH,NEIGHBORS,VARS,T_ACC,ACC.,PRE.,REC.,FPR.,F1,ERR.,TP,TN,FP,VARS.
RF,OVER,100,7,0,8,89.63,96.16,1.07,48.82,3.8,2.09,3.84,5.2,11965.6,47320.0,"['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5', 'attribute6', 'attribute7', 'attribute9']"
RF,OVER,100,4,0,5,82.04,92.46,0.75,65.82,7.51,1.48,7.54,7.0,11504.1,93470.0,"['attribute3', 'attribute4', 'attribute5', 'attribute7', 'attribute9']"
RF,OVER,100,2,0,2,80.88,91.9,0.74,69.64,8.08,1.47,8.1,7.4,11434.1,100470.0,"['attribute4', 'attribute7']"


It is, kind of, satisfactory but the model does not know which variables it should use for the best performance. So far, variables are added to the model development manually with heuristic human labors for the best variable selection. The problem that is used in the model development process is that variable importances are calculated with isolation of other variables, which means that each variable's importance score was test for the relationship with only the target variable. It is almost impossible for people to think of the best combination of variables that gives the best model performance because there might be a lot of different combinations of variables. For example, if there are 5 variables in a dataset, there will be $$ 2^5-1 = 32 $$ combinations to test for the best combination of variables. When there are more variables involved in a dataset, then, as said, it will be impossible to imagine how variables will affect the predictive model performance. 

In the next part, the model development will adopt *'filter method'* and *'wrapper method'* to choose the best variables for the model. With *'filter method'*, each varible will be in isonlation with other variables, it only shows significance towards the target variable. *'Filter method'* are similar to manual variables additions to the model development but it helps automate variables selections for model developments. 

However, when the relationship between a combination of variables and the target variable, *'filter method'* would not be appropriate to choose the best variables for the model. So *'wrapper method'* will be introduced for the relationship between multi-variables and the target variable using machine learning algorithms to choose the best combination of variables for the prediction performance. *'Filter method'* is introduced in *'part 4'* and *'Wrapper Method'* is introduced in *'part 5'*