### 1, Look up SMOTE oversampling

#### a. Describe what it is in your own words in markdown

Synthetic Minority Oversampling Technique, or SMOTE is a technique that is useful when developing predictive models on classification datasets that have a pronounced class imbalance. This situation is problematic because most learning techniques ignore the minority class. In such a case the minority class could be oversampled or synthetically augmented.

First, a random example from the minority class is chosen. Then k of the nearest neighbors for that example are found. A randomly selected neighbor is chosen and a synthetic example is created lying on the line connecting these two in feature space.

Ususally it is employed together with undersampling of the majority class.



#### b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.

In [1]:
import imblearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


diabetes_path = 'C:/Users/balazs.varga/Documents/BALAZS/USE/REPOS/HW/WEEK13/diabetes.csv'
diabetes = pd.read_csv(diabetes_path)

diabetes.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [2]:
y = diabetes['Outcome']
X = diabetes.drop(columns = ['Outcome'])

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)


In [6]:
oversample = imblearn.over_sampling.SMOTE()
X_resampled, y_resampled = oversample.fit_resample(X_train_scaler, y_train)

#train using resampled data
model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

#calculate accuracy
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.7508641975308642

In [7]:
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))
#a technique for improving recall is to resample because it gives us a 
#better understanding of the postives, thus improving our true positive rate

                   pre       rec       spe        f1       geo       iba       sup

          0       0.84      0.77      0.73      0.81      0.75      0.57       150
          1       0.63      0.73      0.77      0.68      0.75      0.56        81

avg / total       0.77      0.76      0.74      0.76      0.75      0.56       231



In [8]:
# the same without resampling:

model = LogisticRegression(random_state=42)
model.fit(X_train_scaler, y_train)

#calculate accuracy
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.6859259259259259

In [9]:
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.77      0.85      0.52      0.81      0.67      0.46       150
          1       0.66      0.52      0.85      0.58      0.67      0.43        81

avg / total       0.73      0.74      0.64      0.73      0.67      0.45       231



The most important metric for us in this case is recall, because we are doing resampling in order to better understand the positive (undersampled) outcome (has diabetes). There is a significant improvement in recall of positives (from 0.52 to 0.73) after resampling, even though recall of negatives suffers as a result.

### 2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number. Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.

In [10]:
def rec_digit_sum(n):
    try:
        if not n%1 == 0:
            raise ValueError('This function works only with integers')
    except ValueError as ve:
        print(ve)
    else:
        digit_sum = 0
        for digit in str(n):
            digit_sum += int(digit)        
        return digit_sum if len(str(digit_sum)) == 1 else rec_digit_sum(digit_sum)

In [11]:
rec_digit_sum(1098765945)

9