1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html .
a. Describe what it is in your own words in markdown.
b. Use this technique with the diabetes dataset. Comment on the model
performance compared to other methods.

## SMOTE
SMOTE stands for "Synthetic Minority Oversampling TEchnique" (that "E" feels kinda forced, but I guess they didn't want it to rhyme with "snot").  It's a type of oversampling where a data point in the minority class is chosen and a certain number (default = 5) of nearest neighbors to the data point are identified.  One of these nearest neighbors is randomly selected, and then a new synthetic point is created based on a point randomly selected between the original data point and the nearest neighbor.  SMOTE can be used to "synthesize" as many points as you need to help correct the imbalances between minority and majority classes for your data modeling purposes.  Since the points you are generating fall between actual minority data points and their nearest neighbors, you're creating "fake" data that is believable without using exact copies of the "real" data points.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
diabetes_df["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [4]:
X=diabetes_df.drop("Outcome", axis=1)
y=diabetes_df[["Outcome"]]
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [5]:
#Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

knn = KNeighborsClassifier(n_neighbors=14)

knn.fit(X_train,y_train)
y_predicted = knn.predict(X_test)
print(y_predicted)

[1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0
 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 1 0]


  return self._fit(X, y)


In [8]:
#calculate the accuracy score
from sklearn.metrics import balanced_accuracy_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

  return f(*args, **kwargs)


0.6599999999999999

In [13]:
X1=diabetes_df.drop("Outcome", axis=1)
y1=diabetes_df[["Outcome"]]

#Resample the training data with RandomOversampler
from imblearn.over_sampling import RandomOverSampler

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=21, stratify=y1)

ros = RandomOverSampler(random_state=21)
X1_resampled, y1_resampled = ros.fit_resample(X1_train, y1_train)

In [14]:
#Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X1_train = sc.fit_transform(X1_train)
X1_test = sc.fit_transform(X1_test)

knn = KNeighborsClassifier(n_neighbors=14)

knn.fit(X1_train,y1_train)
y1_predicted = knn.predict(X1_test)
print(y1_predicted)

[0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1
 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1
 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0]


  return self._fit(X, y)


In [15]:
#calculate the accuracy score
from sklearn.metrics import balanced_accuracy_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X1_train, y1_train)

y1_pred = model.predict(X1_test)
balanced_accuracy_score(y1_test, y1_pred)

  return f(*args, **kwargs)


0.6892592592592592

With accuracy scores of 0.659 vs. 0.689, the SMOTE and RandomOverSampler methods are comparable in their ability to correctly predict outcomes.

2. Create a list of preprocessing steps you should try when working to build a model. Work
with your group to come up with the most comprehensive list you can.

- Cleaning up column names for use and make sure data types are consistent
- Dealing with null values. Deleting columns with too many missing values
- Deleting columns that are redundant and/or highly correlated 
- Dealing with outliers
- Standardize all numeric columns.
- Perform one-hot encoding or label encoding to turn categorical columns into numeric values (recoding the data)
- Integer encoding
- Correlation matrix
- Log normal scaling
- Standard Scaler
- Principal Component Analysis (PCA)
- Looking at the stats generated by .describe() - mean, min, max, etc.
- Splitting dates/times into easier-to-manipulate features (month, year, etc.)
- Value counts
- Creating new features based on existing features
- Histograms
- Clustering data
- Transforming data types as needed (str to int, etc.)