1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOT E.html 

a. Describe what it is in your own words in markdown.

SMOTE is an oversampling method (resampling the minority class proportion to match the majority) where samples are generated for the minority class. In some oversampling techniques, the data from the minority class is simply duplicated. This does not produce any new information. SMOTE helps solve this issue because it generates synthetic minority data. 

How is new data generated?

    SMOTE or Synthetic Minority Oversampling Technique uses KNN to generate samples. Specifically, first, a random sample is taken. Then the kNN for that random example is found. Typically the neighbors used is 5. A neighbor is selected at random and the synthetic example is created. 

b. Use this technique with the diabetes dataset. Comment on the model
performance compared to other methods. Make sure you are clear about why
you chose the performance metric you did.

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
diabetes_df = pd.read_csv("diabetes copy2.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [18]:
diabetes_df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [19]:
from imblearn.over_sampling import SMOTE

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)

#Standardize (values get centered around the average [mean] )
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

In [21]:
#Oversampling the data
smote = SMOTE(random_state = 42)
X_resampled, y_resampled = smote.fit_resample(X_train_scaler, y_train)

In [22]:
from sklearn.linear_model import LogisticRegression

In [23]:
#train using resampled data
model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [24]:
y_pred = model.predict(X_test_scaler)

In [25]:
from imblearn.metrics import classification_report_imbalanced
y_pred = model.predict(X_test_scaler)
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.84      0.78      0.73      0.81      0.75      0.57       150
          1       0.64      0.73      0.78      0.68      0.75      0.57        81

avg / total       0.77      0.76      0.75      0.76      0.75      0.57       231



The model performance I would look at is recall since resampling gives us a better understanding of the positives. 

The recall percent for 1 is, [true positives (diabetic)] 73% that means this model is good in predicting the positive cases.

2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number.


Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.


Examples:
- 16 --> 1+6=7
- 942 --> 9+4+2=15 --> 1+5=6
- 132189 --> 1+3+2+1+8+9=24 --> 2+4=6 
- 493193 --> 4+9+3+1+9+3=29 --> 2+9=11-->1+1 =2

In [60]:
def rec_digit_sum (i): 
    if i == 0:
        return 0
    else: 
        s = (i % 10) + rec_digit_sum(i // 10)
        
        if s > 9:
            return rec_digit_sum (s)
    return s


In [61]:
rec_digit_sum (16)

7

In [62]:
rec_digit_sum (942)

6

In [63]:
rec_digit_sum (132189)

6

In [64]:
rec_digit_sum (493193)

2

# Why it works

If user inputs a 0, 0 is returned. 

Otherwise, the i modulus 10 is taken. Which divides i by 10 but returns the remainder. 

If i is greater than 10, the remainder of this operation will always be the last digit of i. If i is less than 10, then the remainder is 0. 

The result is added to the function, 'rec_digit_sum', consisting of the following argument: 

    i 'floor' divided by 10 which only returns the whole number consisting of all the digits excluding the 
    last (which is taken care of by the modulus).

    This argument, as mentioned before, "goes back" into the function and can enter the conditional statement 
    (if not equal to 0).  Each time the "new i" is passed into the function, the modulus and floor division 
    operators seperate the number into the last digit and preceding digits respectively until the i modulus 10 
    is equal to 0.


        If the sum of the digits 's', is greater than 9 (or had more than 1 digit), the function is called on 
        yet again and will continue to be called on until s is less than 9