<a href="https://colab.research.google.com/github/alen-ka/githubTest/blob/main/Lab_3_Ex_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3 Exercise 3
## Binary Classification

We are given a dataset that consists of biological features of a person and whether or not they are a smoker.

There are *22 descriptive features* and *1 target column* in the dataset as described in the Table below. You are tasked with creating a predictive model to predict the column "SMK_stat_type_cd", which specifies smoker/non-smoker status of the person. You are expected to use classification methods that was shown in the previous exercises.



|      Column      |                                 Description                                 |
|:----------------:|:---------------------------------------------------------------------------:|
| identified_gender              | male, female                                                                |
| age              | round up to 5 years                                                         |
| height           | round up to 5 cm[cm]                                                        |
| weight           | [kg]                                                                        |
| sight_left       | eyesight(left)                                                              |
| sight_right      | eyesight(right)                                                             |
| hear_left        | hearing left, 1(normal), 2(abnormal)                                        |
| hear_right       | hearing right, 1(normal), 2(abnormal)                                       |
| SBP              | Systolic blood pressure[mmHg]                                               |
| DBP              | Diastolic blood pressure[mmHg]                                              |
| BLDS             | BLDS or FSG(fasting blood glucose)[mg/dL]                                   |
| tot_chole        | total cholesterol[mg/dL]                                                    |
| HDL_chole        | HDL cholesterol[mg/dL]                                                      |
| LDL_chole        | LDL cholesterol[mg/dL]                                                      |
| triglyceride     | triglyceride[mg/dL]                                                         |
| hemoglobin       | hemoglobin[g/dL]                                                            |
| urine_protein    | protein in urine, 1(-), 2(+/-), 3(+1), 4(+2), 5(+3), 6(+4)                  |
| serum_creatinine | serum(blood) creatinine[mg/dL]                                              |
| SGOT_AST         | SGOT(Glutamate-oxaloacetate transaminase) AST(Aspartate transaminase)[IU/L] |
| SGOT_ALT         | ALT(Alanine transaminase)[IU/L]                                             |
| gamma_GTP        | y-glutamyl transpeptidase[IU/L]                                             |
| SMK_stat_type_cd **(Target)** | Smoking state, 0(never), 1(active smoker)                      |


### The stages of this problem can be decomposed as follows:
1. Data Preparation
* Ensure data is in correct format (numerical and not string)
* Normalize the data for better convergence (optional)
* Split the data into train/test subsets

2. Model Selection
* Instanciate models and fit on training data
* Evaluate model performance on testing data
* Select model with best performance

3. Submit model
* Your model will be evaluated on data that is kept separate from training/testing data
* The predictions from your model will be uploaded to the course server where it will be evaluated

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive/MyDrive/4c16-labs/code/lab-03/
!unzip -o data.zip
!echo 'data/*' > .gitignore

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/MyDrive/4c16-labs/code/lab-03
Archive:  data.zip
  inflating: data/data.csv           
  inflating: data/validation.csv     


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay

In [None]:
# Data Preparation
# The data is supplied as ".csv" format
# There are 23 columns in csv file, with 22 columns being features and 1 column being the target

# The csv file can be read into as a dataframe using the pandas library
DataFrame = pd.read_csv('data/data.csv')

# Shuffle the data
DataFrame = DataFrame.sample(frac=1)

# Visualize the first 5 rows of the imported dataframe
display(DataFrame.head(6))

Unnamed: 0,identified_gender,age,height,weight,waistline,sight_left,sight_right,hear_left,hear_right,SBP,...,HDL_chole,LDL_chole,triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP,SMK_stat_type_cd
943,Male,45,175,70,82.7,1.2,1.0,1.0,1.0,118.0,...,54.0,124.0,62.0,15.8,1.0,0.9,18.0,18.0,21.0,1.0
35430,Male,40,175,75,86.0,0.9,0.9,1.0,1.0,113.0,...,47.0,82.0,106.0,15.4,1.0,0.9,21.0,28.0,20.0,0.0
42782,Female,80,150,50,76.0,0.2,0.3,1.0,1.0,110.0,...,29.0,112.0,172.0,11.2,1.0,1.0,37.0,30.0,29.0,0.0
27406,Male,50,175,90,95.0,1.2,1.0,1.0,1.0,124.0,...,49.0,82.0,71.0,16.1,1.0,0.8,27.0,49.0,29.0,1.0
9337,Female,60,155,55,73.0,1.2,1.0,1.0,1.0,106.0,...,76.0,80.0,77.0,14.9,1.0,0.9,24.0,20.0,19.0,0.0
3365,Female,20,150,50,65.0,0.6,1.0,1.0,1.0,114.0,...,61.0,121.0,34.0,12.6,1.0,0.7,16.0,12.0,13.0,0.0


In [None]:
# The column "identified_gender" contains non-numeric data. This cannot be used as is and needs to be converted to numerical representations
# We can encode the identified gender to a numerical representation by letting "female" == 0, and "male" == 1.
# This can be done manually by using conditional statements on the dataframe, but luckily for us there is a simpler method
# We can use the built in LabelEncoder function of sklearn to do this

LabelEncoder = sklearn.preprocessing.LabelEncoder()
LabelEncoder.fit(DataFrame['identified_gender'])
DataFrame['identified_gender'] = LabelEncoder.transform(DataFrame['identified_gender'])

# Label encoder sorts the input column of data in alphabetical order, and then assigns a numerical value to each unique entry
# This results in 'female' being mapped to 0, 'male' being mapped to 1
display(DataFrame.head(6))

Unnamed: 0,identified_gender,age,height,weight,waistline,sight_left,sight_right,hear_left,hear_right,SBP,...,HDL_chole,LDL_chole,triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP,SMK_stat_type_cd
943,1,45,175,70,82.7,1.2,1.0,1.0,1.0,118.0,...,54.0,124.0,62.0,15.8,1.0,0.9,18.0,18.0,21.0,1.0
35430,1,40,175,75,86.0,0.9,0.9,1.0,1.0,113.0,...,47.0,82.0,106.0,15.4,1.0,0.9,21.0,28.0,20.0,0.0
42782,0,80,150,50,76.0,0.2,0.3,1.0,1.0,110.0,...,29.0,112.0,172.0,11.2,1.0,1.0,37.0,30.0,29.0,0.0
27406,1,50,175,90,95.0,1.2,1.0,1.0,1.0,124.0,...,49.0,82.0,71.0,16.1,1.0,0.8,27.0,49.0,29.0,1.0
9337,0,60,155,55,73.0,1.2,1.0,1.0,1.0,106.0,...,76.0,80.0,77.0,14.9,1.0,0.9,24.0,20.0,19.0,0.0
3365,0,20,150,50,65.0,0.6,1.0,1.0,1.0,114.0,...,61.0,121.0,34.0,12.6,1.0,0.7,16.0,12.0,13.0,0.0


In [None]:
# We should now separate the input features from the target feature and store them as different variables
# We do this by slicing what columns of data we want from the dataframe
# Let's first see the columns available to us by printing DataFrame.columns
print(DataFrame.columns)

Index(['identified_gender', 'age', 'height', 'weight', 'waistline',
       'sight_left', 'sight_right', 'hear_left', 'hear_right', 'SBP', 'DBP',
       'BLDS', 'tot_chole', 'HDL_chole', 'LDL_chole', 'triglyceride',
       'hemoglobin', 'urine_protein', 'serum_creatinine', 'SGOT_AST',
       'SGOT_ALT', 'gamma_GTP', 'SMK_stat_type_cd'],
      dtype='object')


In [None]:
# The column titled "SMK_stat_type_cd" is our target, and all remaining variables are our input features.
features_columns = DataFrame.columns[:-1]
target_column = DataFrame.columns[-1:]

DataFrame_X = DataFrame[features_columns]   # Selects only Input features
DataFrame_Y = DataFrame[target_column]      # Selects only Target variable

In [None]:
# Split Data into Train/Test split
# Change the variable "test_frac" to reflect the percentage/fraction of test data in the resulting split
test_frac = 0.15

Train_X, Test_X, Train_Y, Test_Y = train_test_split(DataFrame_X, DataFrame_Y, test_size=test_frac)
print(f"Number of rows for Training:\t{Train_X.shape[0]}\nNumber of rows for Testing:\t{Test_X.shape[0]}")
print(Train_X.shape)
print(Train_Y.shape)

Number of rows for Training:	42500
Number of rows for Testing:	7500
(42500, 22)
(42500, 1)


In [None]:
# EDIT THIS CELL
# Construct a dictionary of prediction models to compare
# Uncomment the below dictionary and insert as many prediction models as you like.
# You may have used binary classification models in previous exercises
# You may also have to import these modules/libraries to be able to use them
import sklearn.linear_model as skl_lm

from sklearn.datasets import fetch_openml
from sklearn.utils import check_random_state


from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize


PredictionModels = {
     #'Logistic Regression': LogisticRegression(),
     #'K-NN Classificaiton': code.for.KNN.classification.model.here(),
     #Logistic Regression': LogisticRegression(C=50.0 / 5000,
                                #multi_class='auto',
                                #solver='saga', tol=0.002),
    #'K-NN Classificaiton': KNeighborsClassifier(3),
    #'Linear SVM' : SVC(kernel="linear", C=0.025, probability=True),
    #'RBF SVM'    : SVC(C=5, gamma=0.05, probability=True),
    #'Decision Tree' : DecisionTreeClassifier(max_depth=5)
    'Linear SVM' : SVC(kernel="linear"),
    #'RBF SVM'    : SVC(),
    'Nearest Neighbors7': KNeighborsClassifier(n_neighbors=7),
    'Nearest Neighbors5': KNeighborsClassifier(n_neighbors=5),
    'Logistic': LogisticRegression(solver='newton-cg'),
    'Decision Tree7' :    DecisionTreeClassifier(max_depth=7),
    'Decision Tree5' :    DecisionTreeClassifier(max_depth=5)
 }

models = list(PredictionModels.keys())

print("Fitting models, this may take a while")
print(Train_X.shape)
print(Train_Y.shape)
Train_X = normalize(Train_X)
Test_X = normalize(Test_X)
print(Train_X.shape)


for _model in models:
    print(_model)
    PredictionModels.get(_model).fit(Train_X,Train_Y.values.ravel())
    y_pred = PredictionModels.get(_model).predict(Test_X)
    f1 = f1_score(Test_Y, y_pred, average='macro')
    print(f1)


    # This loop goes through the models in the variable "PredictionModels"
    # Complete the below code block to fit to the model to training data "Train_X, Train_Y"
    # Peform predictions on the fitted model on the Train set and Test set
    # compute f1 score


Fitting models, this may take a while
(42500, 22)
(42500, 1)
(42500, 22)
Linear SVM
0.7226738541921531
Nearest Neighbors7
0.7056251877758861
Nearest Neighbors5
0.6899050814220657
Logistic
0.7149562273597035
Decision Tree7
0.7979034896915851
Decision Tree5
0.7921177795014328


In [None]:
#from sklearn.metrics import accuracy_score
#from sklearn.metrics import recall_score
#from sklearn.metrics import precision_score
#from sklearn.metrics import f1_score


#for name, classifier in classifiers.items():
    #print('predictions for classifer: {}'.format(name))
   # start_time = time.time()
    #y_pred = classifier.predict(X_test)
    #end_time = time.time()

    #m = {}
   # m['f1'] = f1_score(y_test, y_pred, average='macro')
    #m['accuracy'] = accuracy_score(y_test, y_pred)
    #m['precision'] = precision_score(y_test, y_pred, average='macro')
    #m['recall'] = recall_score(y_test, y_pred, average='macro')
    #m['prediction_time'] = end_time - start_time
   # metrics[name].update(m)
#print("done")


In [None]:
# Select the model with best f1 score and write down the
# corresponding 'key' value from the 'PredictionModels' variable
# Eg if Logistic Regression is chosen then: chosen_model = 'Logistic Regression'
chosen_model = 'Decision Tree7'

In [None]:
# DO NOT EDIT.
# Generate predictions using "chosen_model" and save to file

backend_features = pd.read_csv('data/validation.csv')
backend_features['identified_gender'] = LabelEncoder.transform(backend_features['identified_gender'])
backend_preds = PredictionModels[chosen_model].predict(backend_features)
np.savez_compressed('lab3_ex3_preds', lab3_model=backend_preds)

# Remember to push your changes to the git server for marking!

