# <font color=blue>Assignments for "Imbalanced Data"</font>

Using the [dataset](https://sci2s.ugr.es/keel/dataset/data/imbalanced/cleveland-0_vs_4.zip) for the [risk of heart attack](https://sci2s.ugr.es/keel/dataset.php?cod=980) with class imbalance:

1. Create a logistic regression model and measure the performance of it.
2. By experimenting with different methods and class ratios; overcome class imbalance, determine the best performing method and class ratio.

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.utils import resample

In [64]:
df = pd.read_csv('heartAttack.dat',skiprows = 18,error_bad_lines=False)
df.columns =['age', 'sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num'] 
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,negative
1,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,negative
2,56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,negative
3,57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,negative
4,57.0,1.0,4.0,140.0,192.0,0.0,0.0,148.0,0.0,0.4,2.0,0.0,6.0,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171,64.0,1.0,4.0,145.0,212.0,0.0,2.0,132.0,0.0,2.0,2.0,2.0,6.0,positive
172,38.0,1.0,1.0,120.0,231.0,0.0,0.0,182.0,1.0,3.8,2.0,0.0,7.0,positive
173,61.0,1.0,4.0,138.0,166.0,0.0,2.0,125.0,1.0,3.6,2.0,1.0,3.0,positive
174,58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,positive


In [31]:
def objColumn(row):
    if row == "negative":
        value = 0
    else:
        value = 1
    return value

In [65]:
df2 = df.num.apply(objColumn)
df.num = df2

In [78]:
df.ca.unique()
df.thal.unique()

array(['3.0', '6.0', '7.0', '<null>'], dtype=object)

In [110]:
df = df[df.ca != "<null>"]
df = df[df.thal != "<null>"]

### Creating a logistic regression model for dataset

In [170]:
from sklearn.model_selection import train_test_split
import warnings
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
warnings.filterwarnings('ignore')

In [197]:
X = df.iloc[:,:-1]
y = df.num

In [198]:
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20, random_state=111)

In [199]:
logreg = LogisticRegression()

In [200]:
logreg.fit(X_train, y_train)

LogisticRegression()

#### Analyzing results

In [201]:
preds_train = logreg.predict(X_train)
preds_test = logreg.predict(X_test)

In [202]:
print("Train Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_test, preds_test)))
print(classification_report(y_train, preds_train))

Train Dataset
Accuracy score: 0.914
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       128
           1       1.00      0.78      0.88         9

    accuracy                           0.99       137
   macro avg       0.99      0.89      0.93       137
weighted avg       0.99      0.99      0.98       137



In [203]:
print("Test Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_test, preds_test)))
print(classification_report(y_test, preds_test))

Test Dataset
Accuracy score: 0.914
              precision    recall  f1-score   support

           0       0.91      1.00      0.95        31
           1       1.00      0.25      0.40         4

    accuracy                           0.91        35
   macro avg       0.96      0.62      0.68        35
weighted avg       0.92      0.91      0.89        35



**Results:** On train dataset recall is 0.78 which can be acceptable. However, on the test dataset recall is 0.25. Which means our model can catch only 1 out of 4 heart patient. That is a very low level. The model has to be improved. In the following sections different methods will be used to get better model and results.

### Upscaling 

In [219]:
heartAttack = df[df.num==1]
noHeartAttack = df[df.num==0]
up_ = resample(heartAttack, n_samples= len(noHeartAttack), random_state=10)

In [220]:
upscaled_df = pd.concat([up_, noHeartAttack])

In [221]:
upscaled_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
172,38.0,1.0,1.0,120.0,231.0,0.0,0.0,182.0,1.0,3.8,2.0,0.0,7.0,1
167,63.0,0.0,4.0,150.0,407.0,0.0,2.0,154.0,0.0,4.0,2.0,3.0,7.0,1
163,60.0,1.0,4.0,130.0,206.0,0.0,2.0,132.0,1.0,2.4,2.0,2.0,7.0,1
164,65.0,0.0,4.0,150.0,225.0,0.0,2.0,114.0,0.0,1.0,2.0,3.0,7.0,1
174,58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158,71.0,0.0,4.0,112.0,149.0,0.0,0.0,125.0,0.0,1.6,2.0,0.0,3.0,0
159,66.0,0.0,3.0,146.0,278.0,0.0,2.0,152.0,0.0,0.0,2.0,1.0,3.0,0
160,58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,0
161,35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0


In [232]:
X1 = upscaled_df.iloc[:,:-1]
y1 = upscaled_df.num

In [233]:
X_train1, X_test1, y_train1, y_test1 =  train_test_split(X1, y1, test_size=0.20, random_state=111)

In [224]:
logreg2 = LogisticRegression()

In [225]:
logreg2.fit(X_train1, y_train1)

LogisticRegression()

In [226]:
preds_train1 = logreg2.predict(X_train1)
preds_test1 = logreg2.predict(X_test1)

In [227]:
print("Train Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_train1, preds_train1)))
print(classification_report(y_train1, preds_train1))

Train Dataset
Accuracy score: 0.988
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       129
           1       0.98      1.00      0.99       125

    accuracy                           0.99       254
   macro avg       0.99      0.99      0.99       254
weighted avg       0.99      0.99      0.99       254



In [228]:
print("Test Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_test1, preds_test1)))
print(classification_report(y_test1, preds_test1))

Test Dataset
Accuracy score: 0.938
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        30
           1       0.89      1.00      0.94        34

    accuracy                           0.94        64
   macro avg       0.95      0.93      0.94        64
weighted avg       0.94      0.94      0.94        64



**Results:** Accuracy and recalls are increased both train and test data. Now our model is much better than the original.

### Downscaling

In [229]:
heartAttack = df[df.num==1]
noHeartAttack = df[df.num==0]
down_ = resample(noHeartAttack, n_samples= len(heartAttack), random_state=10)

In [230]:
downscaled_df = pd.concat([down_, heartAttack])

In [231]:
downscaled_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
9,48.0,0.0,3.0,130.0,275.0,0.0,0.0,139.0,0.0,0.2,1.0,0.0,3.0,0
127,51.0,0.0,3.0,120.0,295.0,0.0,2.0,157.0,0.0,0.6,1.0,0.0,3.0,0
15,66.0,0.0,1.0,150.0,226.0,0.0,0.0,114.0,0.0,2.6,3.0,0.0,3.0,0
65,51.0,1.0,4.0,140.0,261.0,0.0,2.0,186.0,1.0,0.0,1.0,0.0,3.0,0
115,47.0,1.0,4.0,112.0,204.0,0.0,0.0,143.0,0.0,0.1,1.0,0.0,3.0,0
125,57.0,1.0,4.0,110.0,201.0,0.0,0.0,126.0,1.0,1.5,2.0,0.0,6.0,0
160,58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,0
161,35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0
115,47.0,1.0,4.0,112.0,204.0,0.0,0.0,143.0,0.0,0.1,1.0,0.0,3.0,0
8,54.0,1.0,4.0,140.0,239.0,0.0,0.0,160.0,0.0,1.2,1.0,0.0,3.0,0


In [234]:
X2 = downscaled_df.iloc[:,:-1]
y2 = downscaled_df.num

X_train2, X_test2, y_train2, y_test2 =  train_test_split(X2, y2, test_size=0.20, random_state=111)

In [237]:
logreg3 = LogisticRegression()

logreg3.fit(X_train2, y_train2)

preds_train2 = logreg3.predict(X_train2)
preds_test2 = logreg3.predict(X_test2)

In [238]:
print("Train Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_train2, preds_train2)))
print(classification_report(y_train2, preds_train2))

Train Dataset
Accuracy score: 1.000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



In [240]:
print("Test Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_test2, preds_test2)))
print(classification_report(y_test2, preds_test2))

Test Dataset
Accuracy score: 1.000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         3
           1       1.00      1.00      1.00         3

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6



### SMOTE

In [245]:
from imblearn.over_sampling import SMOTE, ADASYN

In [246]:
sm = SMOTE()

In [247]:
X_smote, y_smote = sm.fit_sample(X, y)

In [248]:
X_train3, X_test3, y_train3, y_test3 =  train_test_split(X_smote, y_smote, test_size=0.20, random_state=111)

logreg4 = LogisticRegression()

logreg4.fit(X_train3, y_train3)

preds_train3 = logreg4.predict(X_train3)
preds_test3 = logreg4.predict(X_test3)

In [261]:
print("Train Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_train3, preds_train3)))
print(classification_report(y_train3, preds_train3))

Train Dataset
Accuracy score: 0.988
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       129
           1       0.98      1.00      0.99       125

    accuracy                           0.99       254
   macro avg       0.99      0.99      0.99       254
weighted avg       0.99      0.99      0.99       254



In [260]:
print("Test Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_test3, preds_test3)))
print(classification_report(y_test3, preds_test3))

Test Dataset
Accuracy score: 0.938
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        30
           1       0.89      1.00      0.94        34

    accuracy                           0.94        64
   macro avg       0.95      0.93      0.94        64
weighted avg       0.94      0.94      0.94        64



### ADASYN

In [253]:
ad = ADASYN()

In [254]:
X_ad, y_ad = ad.fit_sample(X, y)

In [255]:
X_train4, X_test4, y_train4, y_test4 =  train_test_split(X_ad, y_ad, test_size=0.20, random_state=111)

logreg5 = LogisticRegression()

logreg5.fit(X_train4, y_train4)

preds_train4 = logreg5.predict(X_train4)
preds_test4 = logreg5.predict(X_test4)

In [258]:
print("Train Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_train4, preds_train4)))
print(classification_report(y_train4, preds_train4))

Train Dataset
Accuracy score: 0.988
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       129
           1       0.98      1.00      0.99       125

    accuracy                           0.99       254
   macro avg       0.99      0.99      0.99       254
weighted avg       0.99      0.99      0.99       254



In [259]:
print("Test Dataset")
print("Accuracy score: {:.3f}" .format(accuracy_score(y_test4, preds_test4)))
print(classification_report(y_test4, preds_test4))

Test Dataset
Accuracy score: 0.938
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        30
           1       0.89      1.00      0.94        34

    accuracy                           0.94        64
   macro avg       0.95      0.93      0.94        64
weighted avg       0.94      0.94      0.94        64



**Final Comment:** Downsizing negative results to positive result number model has the best model according to our train and test data. Downsizin can be chosen for this data.