### 1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html.

a. Describe what it is in your own words in markdown.

SMOTE oversampling is used to oversample the minority class. It does this by creating new samples from existing ones. It does not just duplicate existing data points. It finds samples in the minority class that are near each other (using KNN) and draws lines between them. The new data points are found on these lines.

Note: These articles explain it much better than the one that was provided: 
* https://towardsdatascience.com/how-to-effortlessly-handle-class-imbalance-with-python-and-smote-9b715ca8e5a7
* https://towardsdatascience.com/applying-smote-for-class-imbalance-with-just-a-few-lines-of-code-python-cdf603e58688

b. Use this technique with the diabetes datatset. Comment on the model
performance compared to other methods.

In [29]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [30]:
#our previous model had a recall of .5
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome',axis = 1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

In [31]:
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train_scaler, y_train)

In [32]:
#train using the resampled data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [33]:
#calculate the accuracy score
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.7268518518518519

### 2. Create a list of preprocessing steps you should try when working to build a model. Work with your group to come up with the most comprehensive list you can.

* standardscaler

* log normal scaling

* one-hot encoding (or integer encoding)

* outlier detection

* run basic statistics

* get value counts

* fillna

* drop columns/rows with a lot of missing data (or that are redundant or too highly correlated with other columns)

* cluster data (e.g., create bins)

* create new features out of certain criteria

* find duplicates

* format column types

* principal component analysis (PCA)