# <font color=blue>Assignment</font>

Using the [dataset](https://sci2s.ugr.es/keel/dataset/data/imbalanced/cleveland-0_vs_4.zip) for the [risk of heart attack](https://sci2s.ugr.es/keel/dataset.php?cod=980) with class imbalance:

1. Create a logistic regression model and measure the performance of it.
2. By experimenting with different methods and class ratios; overcome class imbalance, determine the best performing method and class ratio.

In [248]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,accuracy_score,recall_score,precision_score,f1_score,confusion_matrix,roc_curve, roc_auc_score,precision_recall_curve
from sklearn.utils import resample


* Data Cleaning & Convert the object to categorical ID

In [199]:
ha_df=pd.read_csv('the_risk_heart_attack.dat', header='infer',sep=',', engine='python')

In [201]:
ha_df.loc[[44],['thal']]='3.0'
ha_df.loc[[85,146],['ca']]='0.0'
ha_df.loc[[142],['ca']]='1.0'

In [209]:
ha_df.ca=ha_df.ca.astype('float')
ha_df.thal=ha_df.thal.astype('float')

In [216]:
ha_df['is_risk']=pd.get_dummies(data=ha_df.num,drop_first=True)

* Logistic Regression

In [220]:
Y= ha_df['is_risk']

X= ha_df.iloc[:,0:13]

In [238]:
print('The number of having no risk of heart attack is {} '.format(ha_df[ha_df.is_risk==0].shape[0]))
print('The number of having risk of heart attack is {}'.format(ha_df[ha_df.is_risk==1].shape[0]))


The number of having no risk of heart attack is 164 
The number of having risk of heart attack is 13


In the light of above the Output, the Dataset is obviously imbalanced. while the number of having risk of heart attack is 13, the number of having no risk of heart attack is 164. I am going to, at first, apply logistic regression in this dataset and notably examine the recall score of this model.

In [272]:
def c_values(X,Y):

    C_values = [0.001,0.01, 0.1,1,10,100, 1000]
    X_train, X_test, y_train, y_test =  train_test_split(X, Y, test_size=0.20, random_state=111, stratify = Y)
    accuracy_values = pd.DataFrame(columns=['C_values', 'Train Accuracy', 'Test Accuracy'])

    for c in C_values:
    # Apply logistic regression model to training data
        lr = LogisticRegression(penalty = 'l2', C = c, random_state = 0, solver='lbfgs', multi_class='ovr')
        lr.fit(X_train, y_train)
        accuracy_values = accuracy_values.append({'C_values': c,
                                                'Train Accuracy': lr.score(X_train, y_train),
                                                'Test Accuracy': lr.score(X_test, y_test)
                                                }, ignore_index=True)
    display(accuracy_values) 

    return None

C=1 is better penalty score compared to the other ones, which expresses the data better.

In [221]:
X_train,X_test,y_train,y_test= train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=1111)

In [241]:
log_reg=LogisticRegression(C=1.0,solver='lbfgs',multi_class='ovr')

In [242]:
log_reg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [243]:
y_preds_train=log_reg.predict(X_train)

In [244]:
y_preds=log_reg.predict(X_test)

In [245]:
train_accuracy= accuracy_score(y_train,y_preds_train)
test_accuracy= accuracy_score(y_test,y_preds)
print('The accuracy score of Train is {}'.format(train_accuracy))
print('The accuracy score of Test is {}'.format(test_accuracy))

The accuracy score of Train is 0.9858156028368794
The accuracy score of Test is 0.9444444444444444


In [247]:
print(classification_report(y_test,y_preds))
print(classification_report(y_train,y_preds_train))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        33
           1       0.67      0.67      0.67         3

    accuracy                           0.94        36
   macro avg       0.82      0.82      0.82        36
weighted avg       0.94      0.94      0.94        36

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       131
           1       1.00      0.80      0.89        10

    accuracy                           0.99       141
   macro avg       0.99      0.90      0.94       141
weighted avg       0.99      0.99      0.99       141



Because of having imbalanced dataset, Recall and Precision scores are drastically different in the Train and Test Dataset. I am going to solve this problem, then applying model to new dataset could be meaningful.

In [275]:
def create_model(X, y,c):
    X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20, random_state=111, stratify = y)
    
    logreg_model = LogisticRegression(C=c)
    logreg_model.fit(X_train, y_train)

    pred_train = logreg_model.predict(X_train)
    pred_test = logreg_model.predict(X_test)
    
    conf_mtx_train = confusion_matrix(y_train, pred_train)
    conf_mtx_test = confusion_matrix(y_test, pred_test)
    
    print("Accuracy : {}\n".format(logreg_model.score(X_test, y_test)))
    
    print("Train Dataset")
    print(classification_report(y_train, pred_train))
    
    print("Test Dataset")
    print(classification_report(y_test, pred_test))
    
    return  None

I am going to try to apply Up-Sampling method compared to SMOTE and ADAYNS

In [267]:
no_risk_data=ha_df[ha_df.is_risk==0]
risky_data=ha_df[ha_df.is_risk==1]

In [268]:
risky_data_upsampling=resample(risky_data,replace=True,n_samples=len(no_risk_data),random_state=1111)

In [269]:
up_sampled_ha=pd.concat([no_risk_data,risky_data_upsampling])

In [270]:
Y=up_sampled_ha['is_risk']

X=up_sampled_ha.iloc[:,:13]

In [273]:
c_values(X,Y)

Unnamed: 0,C_values,Train Accuracy,Test Accuracy
0,0.001,0.805344,0.772727
1,0.01,0.896947,0.893939
2,0.1,0.958015,0.939394
3,1.0,0.969466,0.954545
4,10.0,0.980916,0.969697
5,100.0,0.984733,0.969697
6,1000.0,0.984733,0.969697


In [284]:
create_model(X,Y,c=100)

Accuracy : 0.9696969696969697

Train Dataset
              precision    recall  f1-score   support

           0       1.00      0.97      0.98       131
           1       0.97      1.00      0.98       131

    accuracy                           0.98       262
   macro avg       0.99      0.98      0.98       262
weighted avg       0.99      0.98      0.98       262

Test Dataset
              precision    recall  f1-score   support

           0       1.00      0.94      0.97        33
           1       0.94      1.00      0.97        33

    accuracy                           0.97        66
   macro avg       0.97      0.97      0.97        66
weighted avg       0.97      0.97      0.97        66

