## Class impalance
- Classification for predicting fraudulent bank transactions
    - 99% of transactions are legitimate; 1% are fraudulent
- Could build a classifier that predicts NONE of the transactions are fraudulent
    - 99% accurate!
    - But terrible at actually predicting fraudulent transactions (fails original purpose)
- Class imbalance: Uneven frequency of classes

Assessing classification performance with a confusion matrix

Precision = True positives/(true positives + false positives)
- High precision = lower positive rate
- High precision = Not many legitimate transactions are predicted to be fraudulent

Recall = True positives/true positives + false negatives
- High recall = lower false negative rate
- High recall: Predicted most fraudulent transactions correctly

F1 Score: 2*(precision*recall/(precision+recall))
-harmonic mean between precision and recall, factors in number of errors and type of errors

In [2]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

churn_df = pd.read_csv('../resources/telecom_churn_clean.csv')

In [5]:
y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[1129    9]
 [ 189    7]]
              precision    recall  f1-score   support

           0       0.86      0.99      0.92      1138
           1       0.44      0.04      0.07       196

    accuracy                           0.85      1334
   macro avg       0.65      0.51      0.49      1334
weighted avg       0.80      0.85      0.79      1334



In [25]:
diabetes_df = pd.read_csv('../resources/diabetes_clean.csv')
X = diabetes_df[['bmi', 'age']]
y = diabetes_df['diabetes'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [26]:
#Import confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[117  34]
 [ 47  33]]
              precision    recall  f1-score   support

           0       0.71      0.77      0.74       151
           1       0.49      0.41      0.45        80

    accuracy                           0.65       231
   macro avg       0.60      0.59      0.60       231
weighted avg       0.64      0.65      0.64       231



## Logistic regression for binary classification
- Logistic regression is used for classification problems
- Logistic regression outputs probability
- If probability, p > 0.5
    - data is labeled 1
- If probability, p < 0.5
    - data is labeled 0

### Probability thresholds
- By default, logistic regression threshold = 0.5
- Not specific to logistic regression
- KNN classifiers also have thresholds
- What happens if we vary the threshold
- **use a ROC curve (from 0 to 1 with 1 being ideal)**

In [30]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score

logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

y_pred_probs = logreg.predict_proba(X_test)[:, 1]

fpr, tpr, threshold = roc_curve(y_test, y_pred_probs)

In [35]:
diabetes_df = pd.read_csv('../resources/diabetes_clean.csv')

X = diabetes_df.drop("diabetes", axis=1)
y = diabetes_df["diabetes"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [39]:
#Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logreg = LogisticRegression(max_iter=1000)

# Fit the model
logreg.fit(X_train, y_train)

# Predict probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[:10])

[0.26653165 0.19014682 0.12360024 0.14565342 0.50456534 0.45328168
 0.01327948 0.59612298 0.56324867 0.79991141]


In [38]:
diabetes_df

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
