# DATA 6545 Project 1 Evaluation Code
- ver. 1.1
Developed by: Dr. Jie Tao

This is the sample evaluation code provided for your project 1. 
- You should evaluated your processed data using this code whenever possible, and record the results;
- Do not modify this code here - create a __copy__ if you decide to do so.
- Note that due to randomness, although I will use the same code to evaluate your final submissions, the results might be slightly different.

In [1]:
# import required package for data ingestion
import pandas as pd
import numpy as np

# import required packages for splitting data
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# import `logistic regression` model
from sklearn.linear_model import LogisticRegression

# balance the data
from imblearn.over_sampling import SMOTE

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Read in the Data

In [4]:
#### you should change data_path to point to your OWN data file
process = '1B'
features = '8'
target = 'Y2'

data_path = '/content/drive/MyDrive/Colab Notebooks/data/group_5-Y1.csv'
data_df = pd.read_csv(data_path, index_col=0)
data_df.head()

Unnamed: 0,C1,C4,overhang,C7,long_s/total_s,real_w/total_w,long_w/real_w,positive_w/real_w,Y1
0,0.083991,0.749962,0.051659,-0.69096,0.568494,0.468776,0.510265,-0.182013,0
1,1.447178,0.583748,2.417699,-0.840195,0.578539,0.177598,0.45696,-1.388723,1
2,-0.482855,0.717213,-0.270246,-0.953815,0.560116,0.482105,0.685648,2.544377,1
3,1.094958,0.714502,-0.318259,-0.94667,0.334567,0.696035,0.670373,1.988448,1
4,-0.689631,0.499347,-0.034593,1.335853,0.445456,-0.127378,0.402806,-0.352216,0


In [5]:
# get a list of feature names
data_df.columns

Index(['C1', 'C4', 'overhang', 'C7', 'long_s/total_s', 'real_w/total_w',
       'long_w/real_w', 'positive_w/real_w', 'Y1'],
      dtype='object')

### NOTE:

1. This code only include 1 target here - you can only evaluate 1 target at a time. If you want to evaluate another target, define another `y`.
2. It is the norm you arrange your features as *continuous*, *categorical*, and *target* features. If you do not do it this way, you should use indexing similar to below:
```python
y = data_df['Y']
X = data_df.drop['Y']
```

In [6]:
# define features and target
X = data_df.iloc[:,:-1].values
y = data_df.iloc[:,-1].values
# if you want a secondary target
### y1 = ...

In [7]:
### y should be binary
assert len(np.unique(y)) == 2

In [8]:
X.shape, y.shape

((660, 8), (660,))

In [9]:
# resample/balance the data
# note although we do not balance data this way, 
# this works the best for this project
sm = SMOTE(random_state = 2022) 
X_res, y_res = sm.fit_resample(X, y) 

In [10]:
X_res.shape, y_res.shape

((660, 8), (660,))

In [11]:
# define the model
clf = LogisticRegression(max_iter=2000)

# Evaluation

In [12]:
def my_eval(X, y, classifer = clf, k=10, scoring = 'f1'):
  '''
  return evaluation results (f1-score or ROC_AUC). 
  Built in k-fold evaluation.
  INPUTS:
  ----
  - X: features; DataFrame or Numpy ndarray;
  - y: target; DataFrame or Numpy ndarray;
  - classifier: any sklearn (or its add-on) based classifier
  - k: number of folds in cross validation
  - scoring: evaluation metric ('f1' default or 'roc_auc')
  OUTPUT:
  ----
  bias/variance score of selected metric. Both lower the better
  - bias: mean of the metric over cross validation, measure the accruracy
  - variance: std.ev. of the metric, measure the consistency.
  '''
  scores = []
  for i in range(100):
    #### generate random numbers to shuffle the data for training and test
    np.random.seed(2021)
    random_int = np.random.randint(0,3000)
    #### create cross validation folds
    kfold = model_selection.KFold(n_splits=k, random_state=random_int, shuffle=True)
    #### record the score
    score = model_selection.cross_val_score(clf, X=X, y=y, cv=kfold, scoring=scoring)
    scores.append(score)
  scores = np.array(scores)
  #### we need to calculate the bias (average score) and viariance (std)
  bias, variance = 1 - round(scores.mean(),4), round(scores.std(),4)
  return(bias, variance)

In [13]:
# getting averaged f1_score from 10-fold CV (default)
my_eval(X_res, y_res, clf, 10)

(0.41879999999999995, 0.0474)

In [14]:
# getting averaged ROC_AUC from 10-fold CV
my_eval(X_res, y_res, clf, 10, 'roc_auc')

(0.37939999999999996, 0.0516)