## Question 1 SMOTE

SMOTE is a version of oversampling but instead of duplicating random samples it's creating new samples sort of between the existing samples using K-nearest neighbors.

In [1]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE 

In [3]:
diabetes_df = pd.read_csv('../diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
diabetes_df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from collections import Counter

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

print('Original dataset shape %s' % Counter(y))

Original dataset shape Counter({0: 500, 1: 268})


In [12]:
sm = SMOTE(random_state=42)
X_resampled, y_resampled= sm.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_resampled))

Resampled dataset shape Counter({1: 350, 0: 350})


In [13]:
model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [14]:
# calculate accuracy

from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

0.7541975308641975

Accuracy score from SMOTE oversampling: 0.7541975308641975

Accuracy score from random oversampling: 0.7575308641975309

Accuracy score from original dataset: 0.7359307359307359

## Question 2 Preprocessing

1. Clean up column names for use, and make sure datatypes are consistent
2. Dealing with null values. Deleting columns with too many missing values. Replacing/deleting null values in rows.
3. Deleting columns that are redundant or highly correlated.
4. Deal with large outliers.
5. Standardize numeric columns.
6. Recode non-numeric data (one-hot encoding or label encoding).