1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html .
- a. Describe what it is in your own words in markdown.
- b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.

- SMOTE stands for Synthetic Minority Oversampling Technique.It balances the dataset by oversampling the minority class. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any or all of the k minority class nearest neighbors.


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.feature_selection import RFE
from imblearn.over_sampling import RandomOverSampler
from imblearn.metrics import classification_report_imbalanced

### Load Data

In [2]:
diabetes = pd.read_csv("diabetes.csv")
diabetes.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [3]:
diabetes["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

### Feature Selection

In [4]:
X = diabetes.drop('Outcome',axis=1)
y = diabetes['Outcome']

estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=4)
selector = selector.fit(X, y)

features = X.columns[selector.support_ == True]
features

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Index(['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction'], dtype='object')

### Evaluation Metric 

In [5]:
#evaluation metric 
def evaluation(y_test, y_pred):
    print('Accuracy: '  + str(metrics.accuracy_score(y_test, y_pred)))
    print('Recall: ' + str(metrics.recall_score(y_test, y_pred)))
    print('F1 Score: ' + str(metrics.f1_score(y_test, y_pred)))
    print('Precision: ' + str(metrics.precision_score(y_test, y_pred)))

### Split Training/Testing Datasets

In [6]:
X = diabetes[features]
y = diabetes['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

### Standardization of Features

In [7]:
scaler = StandardScaler()  
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
evaluation(y_test, y_pred)

Accuracy: 0.7619047619047619
Recall: 0.6
F1 Score: 0.6357615894039734
Precision: 0.676056338028169


### RandomOverSampler

In [9]:
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [10]:
lr = LogisticRegression()
lr.fit(X_resampled,y_resampled)
y_pred = lr.predict(X_test)
evaluation(y_test, y_pred)

Accuracy: 0.7186147186147186
Recall: 0.6875
F1 Score: 0.6285714285714286
Precision: 0.5789473684210527


In [11]:
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.82      0.74      0.69      0.77      0.71      0.51       151
          1       0.58      0.69      0.74      0.63      0.71      0.50        80

avg / total       0.73      0.72      0.70      0.72      0.71      0.51       231



### Smote

In [12]:
#Used smote to oversample minority class
sm = SMOTE(random_state=42)
smX_train, smy_train = sm.fit_resample(X_train, y_train)

In [13]:
lr = LogisticRegression()
lr.fit(smX_train,smy_train)
y_pred = lr.predict(X_test)
evaluation(y_test, y_pred)

Accuracy: 0.7229437229437229
Recall: 0.6875
F1 Score: 0.6321839080459769
Precision: 0.5851063829787234


In [14]:
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.82      0.74      0.69      0.78      0.71      0.51       151
          1       0.59      0.69      0.74      0.63      0.71      0.51        80

avg / total       0.74      0.72      0.71      0.73      0.71      0.51       231



Recall gives us the fraction of the model correctly detected as having diabetes out of all the diabetic patients.A model with high recall allows doctors to treat all patients with diabetes. SMOTE improved recall from 0.6 to 0.69.Smote performed slightly better than RandomOverSampler. 

2. Create a function called rec_digit_sum that takes in an integer. This function is the
recursive sum of all the digits in a number.
Given n, take the sum of all the digits in n. If the resulting value has more than one digit,
continue calling the function in this way until a single-digit number is produced. The input
will be a non-negative integer, and this should work for extremely large values as well as
for single-digit inputs.

In [3]:
def rec_digit_sum(num):
    if num < 9:
        return num
    else:
        new_number = sum([int(i) for i in str(num)])
        return rec_digit_sum(new_number)
rec_digit_sum(16)

7

In [4]:
rec_digit_sum(942)

6

In [5]:
rec_digit_sum(132189)

6

In [6]:
rec_digit_sum(493193)

2