# Example Evaluation Code

This notebook will be very __similar__ to the code I use to evaluate your results - it is provided for __your convenience__ so that you can use it to evaluate your preprocessing results at any time before your __final submission__.

Please note that the results here will __NOT__ be the same as my evaluation results.

Let's start with loading the required packages.

In [2]:
# import required package for data handling
import pandas as pd
import numpy as np

# import required packages for splitting data
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# import `logistic regression` model
from sklearn.linear_model import LogisticRegression

Next you should load __your__ data. In this case, I am using a sample dataset (`GroupX.csv`) which contains 6 predictors (`X1 - X6`) and two target variables (`Y1, Y2`).

Please make sure you change the data to your __OWN__ dataset when using this code.

__NOTE__:
1. Your dataset maybe very different from the sample dataset.
2. Please follow this structure when submitting your dataset.

In [3]:
data = pd.read_csv('GroupX.csv', header=0)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,Y1,Y2
0,1,63.008,60.735,52.482,45.474,48.495,0,1
1,0,38.417,52.756,66.68,45.059,0.0,1,0
2,0,58.163,61.107,30.678,49.188,0.0,1,0
3,0,57.762,67.167,37.176,64.204,0.0,1,1
4,1,25.93,44.696,58.801,48.079,21.823,0,1


In [4]:
'''data = pd.read_csv('test1.csv',usecols =[ 'C1' , 'C2' , 'C3_p' ,'C4' , 'C5_p' , 'C6_p' , 
                                        'C7' , 'T1' ,'T2' ,'T3','T4' , 'T5', 'S1','S2','S3' ,'div_manufacturing' ,'div_other' ,'div_services' ,'Y1', 'Y2' ])
'''

"data = pd.read_csv('test1.csv',usecols =[ 'C1' , 'C2' , 'C3_p' ,'C4' , 'C5_p' , 'C6_p' , \n                                        'C7' , 'T1' ,'T2' ,'T3','T4' , 'T5', 'S1','S2','S3' ,'div_manufacturing' ,'div_other' ,'div_services' ,'Y1', 'Y2' ])\n"

In [5]:
data = pd.read_csv('data_skew_imputed_outliers_fixed_min_max.csv',usecols = ['C3_cube_root_Imput_p_min_max', 'Y1','Y2', 
                                                               'C1_Imput_outliner_fix_min_max',
                                                                'C2_Imput',
                                                                'C4_Imput_min_max',
                                                                'C5_log_Imput_p_outliner_fix_min_max',
                                                                'C6_log_Imput_p_outliner_fix_min_max',
                                                                'C7_log_Imput_outliner_fix_min_max',
                                                                'div_manufacturing','div_other','div_services',
                                                                'T3_Imput_ratio_outliner_fix_min_max',
                                                                'T4_Imput_ratio_outliner_fix_min_max',
                                                                'T5_Imput_ratio_outliner_fix_min_max',
                                                                'S1_Imput_ratio_outliner_fix_min_max',
                                                                'S2_Imput_ratio_outliner_fix_min_max',
                                                                'S3_Imput_ratio_outliner_fix_min_max'])

Checking your data types and make sure it follows the data dictionary would be an important step, you can do that using the `.dtypes` attribute.

__NOTE__: all __continuous__ faetures will be in `float64` data type, and all __categorical__ features will be in `int64` data type (given you already coded (per __suggest task \#6__ in the competition document) them).

In [6]:
data.dtypes

div_manufacturing                        int64
div_other                                int64
div_services                             int64
Y1                                       int64
Y2                                       int64
C2_Imput                               float64
C1_Imput_outliner_fix_min_max          float64
C3_cube_root_Imput_p_min_max           float64
C4_Imput_min_max                       float64
C5_log_Imput_p_outliner_fix_min_max    float64
C6_log_Imput_p_outliner_fix_min_max    float64
C7_log_Imput_outliner_fix_min_max      float64
T3_Imput_ratio_outliner_fix_min_max    float64
T4_Imput_ratio_outliner_fix_min_max    float64
T5_Imput_ratio_outliner_fix_min_max    float64
S1_Imput_ratio_outliner_fix_min_max    float64
S2_Imput_ratio_outliner_fix_min_max    float64
S3_Imput_ratio_outliner_fix_min_max    float64
dtype: object

In [7]:
data.Y2

0      1
1      0
2      0
3      1
4      1
      ..
655    1
656    1
657    0
658    0
659    0
Name: Y2, Length: 660, dtype: int64

Now you need to specify your targets and predictors. __NOTE__ we have two targets here (`Y1, Y2`).

In [8]:
y1 = data.Y1
y2 = data.Y2

Check the shape of the data.

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 18 columns):
div_manufacturing                      660 non-null int64
div_other                              660 non-null int64
div_services                           660 non-null int64
Y1                                     660 non-null int64
Y2                                     660 non-null int64
C2_Imput                               660 non-null float64
C1_Imput_outliner_fix_min_max          660 non-null float64
C3_cube_root_Imput_p_min_max           660 non-null float64
C4_Imput_min_max                       660 non-null float64
C5_log_Imput_p_outliner_fix_min_max    660 non-null float64
C6_log_Imput_p_outliner_fix_min_max    660 non-null float64
C7_log_Imput_outliner_fix_min_max      660 non-null float64
T3_Imput_ratio_outliner_fix_min_max    660 non-null float64
T4_Imput_ratio_outliner_fix_min_max    660 non-null float64
T5_Imput_ratio_outliner_fix_min_max    660 non-null float64
S1_Im

It is very possible that you will use different sets of the predictors for `Y1` and `Y2`. Now let's define them.

First, let's define predictors for `Y1` - which will be the first 5 features in `data`.

In [10]:

cols =  [ 'Y1','Y2', 'C6_log_Imput_p_outliner_fix_min_max', 'C3_cube_root_Imput_p_min_max',
                                                               'C1_Imput_outliner_fix_min_max',
                                                                'C2_Imput',
                                                                'C4_Imput_min_max',
                                                                'C5_log_Imput_p_outliner_fix_min_max',
                                                               
                                                                'C7_log_Imput_outliner_fix_min_max',
                                                                'div_manufacturing','div_other','div_services',
                                                                'T3_Imput_ratio_outliner_fix_min_max',
                                                                'T4_Imput_ratio_outliner_fix_min_max',
                                                                'T5_Imput_ratio_outliner_fix_min_max',
                                                                'S1_Imput_ratio_outliner_fix_min_max',
                                                                'S2_Imput_ratio_outliner_fix_min_max',
                                                                'S3_Imput_ratio_outliner_fix_min_max' ]
# first 5 features 
cols[2:-10]

['C6_log_Imput_p_outliner_fix_min_max',
 'C3_cube_root_Imput_p_min_max',
 'C1_Imput_outliner_fix_min_max',
 'C2_Imput',
 'C4_Imput_min_max',
 'C5_log_Imput_p_outliner_fix_min_max']

In [11]:
'''cols =  [ 'Y1','Y2','C6_Imput_p_outliner_fix_min_max_sqrt', 'C3_Imput_p',
        
                                                                'C1_Imput_outliner_fix_min_max',
                                                                'C2_Imput_min_max',
                                                                'C4_Imput_min_max_sqrt',
                                                                'C5_Imput_p_outliner_fix_min_max_sqrt',
                                                                
                                                                'C7_Imput_outliner_fix_min_max_sqrt',
                                                                'div_manufacturing','div_other','div_services',
                                                                'T3_Imput_ratio_outliner_fix_min_max_sqrt',
                                                                'T4_Imput_ratio_outliner_fix_min_max',
                                                                'T5_Imput_ratio_outliner_fix_min_max',
                                                                'S1_Imput_ratio_outliner_fix_min_max_sqrt',
                                                                'S2_Imput_ratio_outliner_fix_min_max_sqrt',
                                                                'S3_Imput_ratio_outliner_fix_min_max']'''

"cols =  [ 'Y1','Y2','C6_Imput_p_outliner_fix_min_max_sqrt', 'C3_Imput_p',\n        \n                                                                'C1_Imput_outliner_fix_min_max',\n                                                                'C2_Imput_min_max',\n                                                                'C4_Imput_min_max_sqrt',\n                                                                'C5_Imput_p_outliner_fix_min_max_sqrt',\n                                                                \n                                                                'C7_Imput_outliner_fix_min_max_sqrt',\n                                                                'div_manufacturing','div_other','div_services',\n                                                                'T3_Imput_ratio_outliner_fix_min_max_sqrt',\n                                                                'T4_Imput_ratio_outliner_fix_min_max',\n                                         

In [48]:

cols_y1 =  [ 'Y1','Y2', 
                                  'C1_Imput_outliner_fix_min_max',
                                                               
                                                                'C4_Imput_min_max',
                                                                'C5_log_Imput_p_outliner_fix_min_max',
                                                              
         
                                                                'T3_Imput_ratio_outliner_fix_min_max',
                                                                'T4_Imput_ratio_outliner_fix_min_max',
                                                                'T5_Imput_ratio_outliner_fix_min_max',
                                                                
                                                                'S2_Imput_ratio_outliner_fix_min_max',
                                                               
                                                                ]

In [49]:
cols_y1_2 =  [ 'Y1','Y2', 'C3_cube_root_Imput_p_min_max',
                                  'C1_Imput_outliner_fix_min_max',
                                                               
                                                                'C4_Imput_min_max',
                                                                'C5_log_Imput_p_outliner_fix_min_max',
                                                              
         
                                                                
                                                               
                                                               
                                                                
                                                                'S2_Imput_ratio_outliner_fix_min_max',
                                                               
                                                                ]


In [50]:
cols_y1_3 =  [ 'Y1','Y2',   'C4_Imput_min_max',   'C1_Imput_outliner_fix_min_max', 
                                                                'C5_log_Imput_p_outliner_fix_min_max',
                                                               
                                                               
                                                               'C7_log_Imput_outliner_fix_min_max',
                                                              
                                                           'S2_Imput_ratio_outliner_fix_min_max',
                                                                ]

Use below code to select the first 5 features as predictors for `Y1`.

In [51]:
predictors_y1 = data[cols_y1[3:]]
predictors_y1.head()
#predictors_y1.pop('Y1')
#predictors_y1.pop('Y2')
predictors_y1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 6 columns):
C4_Imput_min_max                       660 non-null float64
C5_log_Imput_p_outliner_fix_min_max    660 non-null float64
T3_Imput_ratio_outliner_fix_min_max    660 non-null float64
T4_Imput_ratio_outliner_fix_min_max    660 non-null float64
T5_Imput_ratio_outliner_fix_min_max    660 non-null float64
S2_Imput_ratio_outliner_fix_min_max    660 non-null float64
dtypes: float64(6)
memory usage: 31.1 KB


In [52]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 5)
fit = rfe.fit(predictors_y1, y1)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

Num Features: 5
Selected Features: [ True False  True  True  True  True]
Feature Ranking: [1 2 1 1 1 1]


In [35]:


cols_y2 =  [ 'Y1','Y2', 'C6_log_Imput_p_outliner_fix_min_max',   
           
                                                            
                                                                'C4_Imput_min_max',
                                                       
                                                               
                                                              
            'div_other' ,
     
                                                                'T3_Imput_ratio_outliner_fix_min_max',
                                                              
                                                          
                                                                
                                                                'S2_Imput_ratio_outliner_fix_min_max',
                                                                'S3_Imput_ratio_outliner_fix_min_max' ]
                                                            

In [36]:

cols_y2_1 =  [ 'Y1','Y2', 'C6_log_Imput_p_outliner_fix_min_max',    'C3_cube_root_Imput_p_min_max' ,
           
                                                            
                                                         
                                                       
                                                               
                                                              
            'div_other' ,
     
                                                                'T3_Imput_ratio_outliner_fix_min_max',
                                                              
                                                          
                                                               
                                                                'S2_Imput_ratio_outliner_fix_min_max',
                                                                'S3_Imput_ratio_outliner_fix_min_max' ]
                                                            

Upon investigation of the data, we know we have __six__ features (`X1 - X6`) predicting `Y2`. Use similar code (as below) to select them.

In [40]:
predictors_y2 = data[cols_y2_1[2:]]
#predictors_y2.pop('Y1')
#predictors_y2.pop('Y2')
predictors_y2.head()

Unnamed: 0,C3_cube_root_Imput_p_min_max,div_other,T3_Imput_ratio_outliner_fix_min_max,S2_Imput_ratio_outliner_fix_min_max,S3_Imput_ratio_outliner_fix_min_max
0,1.0,0,0.638012,0.515574,0.453823
1,0.0,0,0.537617,0.655044,0.449531
2,0.0,0,0.642696,0.301377,0.492216
3,0.0,0,0.718941,0.365204,0.647441
4,1.0,1,0.436206,0.577647,0.48075


In [38]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 4)
fit = rfe.fit(predictors_y2, y2)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

Num Features: 4
Selected Features: [ True False  True False  True  True]
Feature Ranking: [1 3 1 2 1 1]


Below is the key part of this notebook - which generates a `logistic regression` model to predict `Y1`/`Y2`.

The code works this way:

1. We generate two lists `f1_score_lst` and `auc_lst` to store f1_score and AUC from each of the `10` runs of the model;
2. Define model:
    1. We define a `LogisticRegression()` model;
    
    2. We split predictors (`predictors_y1`) and target `y1` to training (80%) and testing (20%);
    
    3. We fit the model `clf` to the training data, then use it to predict on the testing data;
    
    4. We also defined a `10-fold cross validation` to make sure our model do not overfit - see [here](https://scikit-learn.org/stable/modules/cross_validation.html) for more info;
    
    5. We append the f1_score and AUC of current model to the lists (`f1_score_lst` and `auc_lst`) we defined earlier.
  
3. Print out average f1_score and AUC for all 10 runs;
4. Print out average average accuracy from cross validation
5. Print out confusion matrix and classification report for the __last__ model.

__NOTE__: Step 3 provides the evaluation results we need; step 4 - 5 can be used to verify the results from step 3.

In [53]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,30):
    #Model building
    clf = LogisticRegression()
    X1_train, X1_test, y1_train, y1_test = train_test_split(predictors_y1, y1, test_size=0.2, random_state=123)
    clf.fit(X1_train, y1_train)

    y1_pred = clf.predict(X1_test)

    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, X1_train, y1_train, cv=kfold, scoring=scoring)

    

    
    #calculate f1-score and AUC
    
    clf_roc_auc = roc_auc_score(y1_test, y1_pred)
    f1_score_lst.append(precision_recall_fscore_support(y1_test, y1_pred, average='weighted')[2])
    auc_lst.append(clf_roc_auc)


print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

#result=logit_model.fit()
confusion_matrix_y1 = confusion_matrix(y1_test, y1_pred)


#print(result.summary())
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X1_test, y1_test)))

print("10-fold cross validation average accuracy of classifier: %.3f" % (results.mean()))

print('Confusion Matrix for Logistic Regression Classfier:')
print(confusion_matrix_y1)

print('Classification Report for Logistic Regression Classfier:')
print(classification_report(y1_test, y1_pred))


F1 0.6301; AUC 0.6460 
Accuracy of classifier on test set: 0.63
10-fold cross validation average accuracy of classifier: 0.599
Confusion Matrix for Logistic Regression Classfier:
[[43 35]
 [14 40]]
Classification Report for Logistic Regression Classfier:
              precision    recall  f1-score   support

           0       0.75      0.55      0.64        78
           1       0.53      0.74      0.62        54

    accuracy                           0.63       132
   macro avg       0.64      0.65      0.63       132
weighted avg       0.66      0.63      0.63       132



Below code are used to evaluate model toward `Y2`. It is very similar to the code above - key difference is that `Y2` is imbalanced - so I wrote some code (under `# Begin oversampling`) to deal with that.

In [29]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,30):
    #Model building
    clf1 = LogisticRegression()

    
    
    # Splitting data into testing and training
    X2_train, X2_test, y2_train, y2_test = train_test_split(predictors_y2, y2, test_size=0.2, random_state=123)
    
    # Begin oversampling
    oversample = pd.concat([X2_train,y2_train],axis=1)
    max_size = oversample['Y2'].value_counts().max()
    lst = [oversample]
    for class_index, group in oversample.groupby('Y2'):
        lst.append(group.sample(max_size-len(group), replace=True))
    X2_train = pd.concat(lst)
    y2_train=pd.DataFrame.copy(X2_train['Y2'])
    del X2_train['Y2']
    
    # fitting model on oversampled data
    clf1.fit(X2_train, y2_train)
    
    y2_pred = clf1.predict(X2_test)
    
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf1, X2_train, y2_train, cv=kfold, scoring=scoring)
    
    #calculate f1-score and AUC
    
    clf1_roc_auc = roc_auc_score(y2_test, y2_pred)
    
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y2_test, y2_pred, average='weighted')[2])
    auc_lst.append(clf1_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

confusion_matrix_y2 = confusion_matrix(y2_test, y2_pred)


print('Accuracy of classifier on test set: {:.3f}'.format(clf1.score(X2_test, y2_test)))

print("10-fold cross validation average accuracy of clf1: %.3f" % (results.mean()))

print('Confusion Matrix for Classfier:')
print(confusion_matrix_y2)

print('Classification Report for Classfier:')
print(classification_report(y2_test, y2_pred))


F1 0.6640; AUC 0.6817 
Accuracy of classifier on test set: 0.652
10-fold cross validation average accuracy of clf1: 0.619
Confusion Matrix for Classfier:
[[27  9]
 [37 59]]
Classification Report for Classfier:
              precision    recall  f1-score   support

           0       0.42      0.75      0.54        36
           1       0.87      0.61      0.72        96

    accuracy                           0.65       132
   macro avg       0.64      0.68      0.63       132
weighted avg       0.75      0.65      0.67       132



In [29]:
data['Y2'].value_counts().max()

461