# Example Evaluation Code

This notebook will be very __similar__ to the code I use to evaluate your results - it is provided for __your convenience__ so that you can use it to evaluate your preprocessing results at any time before your __final submission__.

Please note that the results here will __NOT__ be the same as my evaluation results.

Let's start with loading the required packages.

In [1]:
# import required package for data handling
import pandas as pd
import numpy as np

# import required packages for splitting data
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# import `logistic regression` model
from sklearn.linear_model import LogisticRegression

Next you should load __your__ data. In this case, I am using a sample dataset (`GroupX.csv`) which contains 6 predictors (`X1 - X6`) and two target variables (`Y1, Y2`).

Please make sure you change the data to your __OWN__ dataset when using this code.

__NOTE__:
1. Your dataset maybe very different from the sample dataset.
2. Please follow this structure when submitting your dataset.

In [2]:
# what to include (continous)
#data = data[['C1', 'C2', 'C4', 'C5', 'C6', 'T1', 'T2', 'T3', 'T4', 'T5', 'C3_impute', 'C7_impute', 'C3`',
       #'C5`', 'C6`', 'T1prime', 'T3prime', 'T4prime', 'S1prime',
       #'S2prime', 'S3prime','Y1', 'Y2',]]

In [10]:
data = pd.read_csv('omg2.csv', header=0)
data = data[['C1', 'C4', 'C5', 'C6', 'T1', 'T2', 'T3', 'T4', 'T5', 'S1', 'S2',
       'S3', 'C3_impute', 'C7_impute', 'C5`', 'C6`',
       'T1prime', 'T3prime', 'T4prime', 'S1prime', 'S2prime', 'S3prime', 'Y1',
       'Y2']]
data.head()

Unnamed: 0,C1,C4,C5,C6,T1,T2,T3,T4,T5,S1,...,C5`,C6`,T1prime,T3prime,T4prime,S1prime,S2prime,S3prime,Y1,Y2
0,122.0,0.029074,40962052.0,10600000.0,470.0,12719.0,11560.0,301.0,690.0,62.0,...,3.864345,0.111111,0.640426,0.908876,0.05425,0.004875,0.009199,0.009199,0,1
1,259.0,-0.013352,28869196.0,2400000.0,791.0,21792.0,19585.0,510.0,1120.0,71.0,...,12.028832,0.0,0.644753,0.898724,0.051395,0.003258,0.011105,0.011105,1,0
2,90.0,0.020715,16845668.0,5000000.0,201.0,5262.0,4785.0,128.0,325.0,61.0,...,3.369134,0.0,0.636816,0.90935,0.061764,0.011593,0.006271,0.006271,1,0
3,209.0,0.020023,14848637.0,4500000.0,328.0,8259.0,7574.0,177.0,509.0,80.0,...,3.299697,0.0,0.539634,0.91706,0.06163,0.009686,0.007144,0.007144,1,0
4,80.0,-0.034895,30741716.0,8250000.0,572.0,14830.0,13176.0,336.0,720.0,67.0,...,3.726269,0.05,0.587413,0.888469,0.04855,0.004518,0.010047,0.010047,0,1


In [11]:
data.columns

Index(['C1', 'C4', 'C5', 'C6', 'T1', 'T2', 'T3', 'T4', 'T5', 'S1', 'S2', 'S3',
       'C3_impute', 'C7_impute', 'C5`', 'C6`', 'T1prime', 'T3prime', 'T4prime',
       'S1prime', 'S2prime', 'S3prime', 'Y1', 'Y2'],
      dtype='object')

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 24 columns):
C1           660 non-null float64
C4           660 non-null float64
C5           660 non-null float64
C6           660 non-null float64
T1           660 non-null float64
T2           660 non-null float64
T3           660 non-null float64
T4           660 non-null float64
T5           660 non-null float64
S1           660 non-null float64
S2           660 non-null float64
S3           660 non-null float64
C3_impute    660 non-null float64
C7_impute    660 non-null float64
C5`          660 non-null float64
C6`          660 non-null float64
T1prime      660 non-null float64
T3prime      660 non-null float64
T4prime      660 non-null float64
S1prime      660 non-null float64
S2prime      660 non-null float64
S3prime      660 non-null float64
Y1           660 non-null int64
Y2           660 non-null int64
dtypes: float64(22), int64(2)
memory usage: 123.9 KB


In [13]:
data.isna().sum()

C1           0
C4           0
C5           0
C6           0
T1           0
T2           0
T3           0
T4           0
T5           0
S1           0
S2           0
S3           0
C3_impute    0
C7_impute    0
C5`          0
C6`          0
T1prime      0
T3prime      0
T4prime      0
S1prime      0
S2prime      0
S3prime      0
Y1           0
Y2           0
dtype: int64

Checking your data types and make sure it follows the data dictionary would be an important step, you can do that using the `.dtypes` attribute.

__NOTE__: all __continuous__ faetures will be in `float64` data type, and all __categorical__ features will be in `int64` data type (given you already coded (per __suggest task \#6__ in the competition document) them).

In [14]:
data.dtypes

C1           float64
C4           float64
C5           float64
C6           float64
T1           float64
T2           float64
T3           float64
T4           float64
T5           float64
S1           float64
S2           float64
S3           float64
C3_impute    float64
C7_impute    float64
C5`          float64
C6`          float64
T1prime      float64
T3prime      float64
T4prime      float64
S1prime      float64
S2prime      float64
S3prime      float64
Y1             int64
Y2             int64
dtype: object

Now you need to specify your targets and predictors. __NOTE__ we have two targets here (`Y1, Y2`).

In [15]:
y1 = data.Y1
y2 = data.Y2

Check the shape of the data.

In [16]:
data.shape

(660, 24)

It is very possible that you will use different sets of the predictors for `Y1` and `Y2`. Now let's define them.

First, let's define predictors for `Y1` - which will be the first 5 features in `data`.

In [17]:
data.Y1.value_counts()

1    330
0    330
Name: Y1, dtype: int64

In [18]:
cols = list(data.columns)
# first 5 features 
# -3
cols[:-3]

['C1',
 'C4',
 'C5',
 'C6',
 'T1',
 'T2',
 'T3',
 'T4',
 'T5',
 'S1',
 'S2',
 'S3',
 'C3_impute',
 'C7_impute',
 'C5`',
 'C6`',
 'T1prime',
 'T3prime',
 'T4prime',
 'S1prime',
 'S2prime']

Use below code to select the first 5 features as predictors for `Y1`.

In [19]:
# -3
predictors_y1 = data[cols[:-3]]
predictors_y1.head()

Unnamed: 0,C1,C4,C5,C6,T1,T2,T3,T4,T5,S1,...,S3,C3_impute,C7_impute,C5`,C6`,T1prime,T3prime,T4prime,S1prime,S2prime
0,122.0,0.029074,40962052.0,10600000.0,470.0,12719.0,11560.0,301.0,690.0,62.0,...,139.0,3.43,51.345,3.864345,0.111111,0.640426,0.908876,0.05425,0.004875,0.009199
1,259.0,-0.013352,28869196.0,2400000.0,791.0,21792.0,19585.0,510.0,1120.0,71.0,...,237.0,-1.62,25.936,12.028832,0.0,0.644753,0.898724,0.051395,0.003258,0.011105
2,90.0,0.020715,16845668.0,5000000.0,201.0,5262.0,4785.0,128.0,325.0,61.0,...,60.0,-1.24,7.378,3.369134,0.0,0.636816,0.90935,0.061764,0.011593,0.006271
3,209.0,0.020023,14848637.0,4500000.0,328.0,8259.0,7574.0,177.0,509.0,80.0,...,110.0,-0.91,8.526,3.299697,0.0,0.539634,0.91706,0.06163,0.009686,0.007144
4,80.0,-0.034895,30741716.0,8250000.0,572.0,14830.0,13176.0,336.0,720.0,67.0,...,167.0,0.07,632.298,3.726269,0.05,0.587413,0.888469,0.04855,0.004518,0.010047


Upon investigation of the data, we know we have __six__ features (`X1 - X6`) predicting `Y2`. Use similar code (as below) to select them.

In [20]:
# 
predictors_y2 = data[cols[:-2]]
predictors_y2.head()

Unnamed: 0,C1,C4,C5,C6,T1,T2,T3,T4,T5,S1,...,C3_impute,C7_impute,C5`,C6`,T1prime,T3prime,T4prime,S1prime,S2prime,S3prime
0,122.0,0.029074,40962052.0,10600000.0,470.0,12719.0,11560.0,301.0,690.0,62.0,...,3.43,51.345,3.864345,0.111111,0.640426,0.908876,0.05425,0.004875,0.009199,0.009199
1,259.0,-0.013352,28869196.0,2400000.0,791.0,21792.0,19585.0,510.0,1120.0,71.0,...,-1.62,25.936,12.028832,0.0,0.644753,0.898724,0.051395,0.003258,0.011105,0.011105
2,90.0,0.020715,16845668.0,5000000.0,201.0,5262.0,4785.0,128.0,325.0,61.0,...,-1.24,7.378,3.369134,0.0,0.636816,0.90935,0.061764,0.011593,0.006271,0.006271
3,209.0,0.020023,14848637.0,4500000.0,328.0,8259.0,7574.0,177.0,509.0,80.0,...,-0.91,8.526,3.299697,0.0,0.539634,0.91706,0.06163,0.009686,0.007144,0.007144
4,80.0,-0.034895,30741716.0,8250000.0,572.0,14830.0,13176.0,336.0,720.0,67.0,...,0.07,632.298,3.726269,0.05,0.587413,0.888469,0.04855,0.004518,0.010047,0.010047


Below is the key part of this notebook - which generates a `logistic regression` model to predict `Y1`/`Y2`.

The code works this way:

1. We generate two lists `f1_score_lst` and `auc_lst` to store f1_score and AUC from each of the `10` runs of the model;
2. Define model:
    1. We define a `LogisticRegression()` model;
    
    2. We split predictors (`predictors_y1`) and target `y1` to training (80%) and testing (20%);
    
    3. We fit the model `clf` to the training data, then use it to predict on the testing data;
    
    4. We also defined a `10-fold cross validation` to make sure our model do not overfit - see [here](https://scikit-learn.org/stable/modules/cross_validation.html) for more info;
    
    5. We append the f1_score and AUC of current model to the lists (`f1_score_lst` and `auc_lst`) we defined earlier.
  
3. Print out average f1_score and AUC for all 10 runs;
4. Print out average average accuracy from cross validation
5. Print out confusion matrix and classification report for the __last__ model.

__NOTE__: Step 3 provides the evaluation results we need; step 4 - 5 can be used to verify the results from step 3.

In [21]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf = LogisticRegression()
    X1_train, X1_test, y1_train, y1_test = train_test_split(predictors_y1, y1, test_size=0.2, random_state=123)
    clf.fit(X1_train, y1_train)

    y1_pred = clf.predict(X1_test)

    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=7)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, X1_train, y1_train, cv=kfold, scoring=scoring)

    

    
    #calculate f1-score and AUC
    
    clf_roc_auc = roc_auc_score(y1_test, y1_pred)
    f1_score_lst.append(precision_recall_fscore_support(y1_test, y1_pred, average='weighted')[2])
    auc_lst.append(clf_roc_auc)


print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

#result=logit_model.fit()
confusion_matrix_y1 = confusion_matrix(y1_test, y1_pred)


#print(result.summary())
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X1_test, y1_test)))

print("10-fold cross validation average accuracy of classifier: %.3f" % (results.mean()))

print('Confusion Matrix for Logistic Regression Classfier:')
print(confusion_matrix_y1)

print('Classification Report for Logistic Regression Classfier:')
print(classification_report(y1_test, y1_pred))




F1 0.5123; AUC 0.5199 
Accuracy of classifier on test set: 0.59
10-fold cross validation average accuracy of classifier: 0.491
Confusion Matrix for Logistic Regression Classfier:
[[71  7]
 [47  7]]
Classification Report for Logistic Regression Classfier:
              precision    recall  f1-score   support

           0       0.60      0.91      0.72        78
           1       0.50      0.13      0.21        54

    accuracy                           0.59       132
   macro avg       0.55      0.52      0.47       132
weighted avg       0.56      0.59      0.51       132





Below code are used to evaluate model toward `Y2`. It is very similar to the code above - key difference is that `Y2` is imbalanced - so I wrote some code (under `# Begin oversampling`) to deal with that.

In [22]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf1 = LogisticRegression()

    
    # Splitting data into testing and training
    X2_train, X2_test, y2_train, y2_test = train_test_split(predictors_y2, y2, test_size=0.2, random_state=123)
    
    # Begin oversampling
    oversample = pd.concat([X2_train,y2_train],axis=1)
    max_size = oversample['Y2'].value_counts().max()
    lst = [oversample]
    for class_index, group in oversample.groupby('Y2'):
        lst.append(group.sample(max_size-len(group), replace=True))
    X2_train = pd.concat(lst)
    y2_train=pd.DataFrame.copy(X2_train['Y2'])
    del X2_train['Y2']
    
    # fitting model on oversampled data
    clf1.fit(X2_train, y2_train)
    
    y2_pred = clf1.predict(X2_test)
    
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf1, X2_train, y2_train, cv=kfold, scoring=scoring)
    
    #calculate f1-score and AUC
    
    clf1_roc_auc = roc_auc_score(y2_test, y2_pred)
    
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y2_test, y2_pred, average='weighted')[2])
    auc_lst.append(clf1_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

confusion_matrix_y2 = confusion_matrix(y2_test, y2_pred)


print('Accuracy of classifier on test set: {:.3f}'.format(clf1.score(X2_test, y2_test)))

print("10-fold cross validation average accuracy of clf1: %.3f" % (results.mean()))

print('Confusion Matrix for Classfier:')
print(confusion_matrix_y2)

print('Classification Report for Classfier:')
print(classification_report(y2_test, y2_pred))

  _warn_prf(average, modifier, msg_start, len(result))


F1 0.3795; AUC 0.5245 
Accuracy of classifier on test set: 0.644
10-fold cross validation average accuracy of clf1: 0.426
Confusion Matrix for Classfier:
[[59 12]
 [35 26]]
Classification Report for Classfier:
              precision    recall  f1-score   support

           0       0.63      0.83      0.72        71
           1       0.68      0.43      0.53        61

    accuracy                           0.64       132
   macro avg       0.66      0.63      0.62       132
weighted avg       0.65      0.64      0.63       132



