1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html .
    - a. Describe what it is in your own words in markdown.
    - b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.


**What is SMOTE?**

SMOTE is an abbreviation for "Synthetic Minority Over-sampling Technique." It is a way for us to oversample our minority data by creating synthetic minority samples so that we can have a more balanced data set. This can allow for improved recall in our model, for accurately predicting any minority cases that we have. SMOTE utilizes k-nearest neighbor algorithms to create the synthetic data. This is a key way that SMOTE differs from some other resampling techniques. For example, the RandomOverSample method duplicates some of the minority data. SMOTE, however, does not duplicate. It basically creates "new" minority data, using the k-nearest neighbor algorithm. This is repeated until your minority class is large enough to have an appropriately balanced data set.

**Example of This Technique**

In [22]:
# Load in modules
import pandas as pd
import numpy as np

# Read in the diabetes file
diabetes = pd.read_csv('diabetes.csv')

In [23]:
# View the file
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


First, set up the data and split into training and testing sets. Next, standardize before moving onto to resampling.

In [24]:
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Drop outcome, so that we only have predictors
X = diabetes.drop('Outcome', axis=1)
# Keep outcome as the target variable
y = diabetes['Outcome']

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42, stratify=y)

# standardize
sc = StandardScaler()

# Fit the standardized classifier to training and testing data
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

Utilize the SMOTE resampling technique

In [25]:
# Import module to use SMOTE
from imblearn.over_sampling import SMOTE

# Instantiate the classifier
sm = SMOTE(random_state=33)

# Complete the classification on our training data
X_res, y_res = sm.fit_resample(X_train_scaler, y_train)

Build and fit our logistic regression model to the resampled data

In [26]:
# Build the logistic Regression model
model = LogisticRegression(random_state=33)
# Fit the model to the resampled data
model.fit(X_res, y_res)

LogisticRegression(random_state=33)

View the classification report to see how well our model is performing

In [28]:
from imblearn.metrics import classification_report_imbalanced

# Print the report from the oversampled model
print(classification_report_imbalanced(y_test, y_pred)) 

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.77      0.70      0.80      0.74      0.55       150
          1       0.63      0.70      0.77      0.66      0.74      0.54        81

avg / total       0.76      0.75      0.73      0.75      0.74      0.55       231



**Model Performance**

I have selected *recall* as my metric for measuring performance across models. I selected this metric because I am most interested in identifying true positives. In other words, it is very important to me that someone who has diabetes knows accurately that they have diabetes. This will maximize the number of patients able to get the necessary treatment as quickly as possible.

*SMOTE*

When using SMOTE, the recall was 70%, meaning that of all the people in our sample who truly have diabetes, we accurately predicted 70% of them (our sensitivity). 

*Random Over Sampler*

During the class lecture, we used the RandomOverSampler technique, which yielded a recall of 73% for people who have diabetes. 

*Original - No resampling methods*

Contrastingly, the original logistic regression model, that did not use any oversampling techniques, had a recall of 62%, which understandably was the worst of the 3 methods. This is understandable because our minority class was inherently smaller. This means we know less about them, and thus, are not as good at making predictions about them.

*Best result*

After examining all the results, the example from class, utilizing the random over sampler yielded the best recall results. This indicates that this model is the best at predicting whether a person truly has diabetes.

2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number. Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.

**Examples:**

16 --> 1 + 6 = 7

942 --> 9 + 4 + 2 = 15 --> 1 + 5 = 6

132189 --> 1 + 3 + 2 + 1 + 8 + 9 = 24 --> 2 + 4 = 6

493193 --> 4 + 9 + 3 + 1 + 9 + 3 = 29 --> 2 + 9 = 11

**Create the Function**

In [93]:
# Create a function that takes in an integer
def rec_digit_sum(n):
    # Try except handling to ensure we only use integers
    try:
        # Convert the argument integer into a string
        # Place the string into a list, so that we can iterate over it
        num_list = list(str(n))
        
        # Iterate through the list
        num = [int(n) for n in num_list]
        # Add up all the elements of the number list
        sum_num = sum(num)
        # Check to see if the length of the list is greater than 1
        if len(str(sum_num)) >1:
            # If it is, call the function again
            rec_digit_sum(sum_num)
        # If it's not, just print the value
        else:
            print(sum_num)
    # Error message if someone puts in the wrong argument
    except ValueError:
        print('Please input an integer')

**Test the Function on Sample Output**

In [94]:
rec_digit_sum(5.5)

Please input an integer


In [95]:
rec_digit_sum(16)

7


In [96]:
rec_digit_sum(942)

6


In [97]:
rec_digit_sum(132189)

6


In [98]:
rec_digit_sum(493193)

2
