## Intuition: Disease Screening Example
- Let's say your client is a leading research hospitals, and they've asked you to train a model for detecting a disease based on biological inputs collected from patients.
- But here's the catch... the disease is relatively rare; it occurs in only 8% of patients who are screened.
- What if you just wrote a single line of code that always predicts 'No Disease?'

In [1]:
def disease_screen(patient_data):
    # Ignore patient_data
    return 'No Disease.'

- ** Your "solution" would have 92% accuracy!**
- That accuracy is misleading!!
    - For patients who do not have the disease, you'd have 100% accuracy.
    - For patients who do have the disease, you'd have 0% accuracy.
    - Your overall accuracy would be high simply because most patients do not have the disease (not because your model is any good).

In [2]:
import numpy as np
import pandas as pd
data = pd.read_csv('balance-data.csv', 
                  names=['balance', 'var1', 'var2', 'var3', 'var4'])

In [3]:
data.head()

Unnamed: 0,balance,var1,var2,var3,var4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


- It has 1 target variable, which we've labeled balance .
- It has 4 input features, which we've labeled var1  through var4 .

In [4]:
data.balance.value_counts()

L    288
R    288
B     49
Name: balance, dtype: int64

In [6]:
# For this tutorial, we're going to turn this into a binary classification problem.

# We're going to label each observation as 1 (positive class) 
# if the scale is balanced or 0 (negative class) if the scale is not balanced:

In [5]:
data.balance = [1 if b=='B' else 0 for b in data.balance]

In [6]:
data.balance.value_counts()

0    576
1     49
Name: balance, dtype: int64

In [12]:
neg_label = data.balance.value_counts()[0]
pos_label = data.balance.value_counts()[1]

total_obs = len(data)

In [16]:
print("""If we've had predicted 0 for all the values, our accuracy would have 
been the number of times we have predicted 0 out of all the observations: {0:.2f}""".format(neg_label/total_obs))

If we've had predicted 0 for all the values, our accuracy would have 
been the number of times we have predicted 0 out of all the observations: 0.92


- **As you can see, only about 8% of the observations were balanced. Therefore, if we were to always predict 0, we'd achieve an accuracy of 92%.**

# 1. Up-sample Minority Class

- Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

In [17]:
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# First, we'll separate observations from each class into different DataFrames
# Then, we'll resample the minority class with replacement, 
# setting the number of samples to match that of the majority class.
# Finally, we'll combine the up-sampled minority class DataFrame with the original majority class DataFrame.

data_minority = data[data.balance == 1]
data_majority = data[data.balance == 0]

In [18]:
# Upsample minority class
df_minority_upsampled = resample(data_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=576,    # to match majority class
                                 random_state=123) # reproducible results

In [28]:
df_minority_upsampled.head()

Unnamed: 0,balance,var1,var2,var3,var4
572,1,5,3,5,3
30,1,1,2,2,1
364,1,3,5,3,5
416,1,4,2,4,2
494,1,4,5,4,5


In [29]:
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([data_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.balance.value_counts()

1    576
0    576
Name: balance, dtype: int64

- **As you take notice, we see that the results are more equal, exactly equal, and this help us find a more accurate representation of the data, we will now train the data with logistic regression**


In [57]:
y = df_upsampled.balance
Xs = df_upsampled.drop('balance', axis=1)

# Train model
logfit = LogisticRegression()
model = logfit.fit(Xs,y)

# Predict on training set
predict = model.predict(Xs)

# Is our model still predicting just one class?
print("The total numbers of uniques values is {0}".format(np.unique(predict)))

The total numbers of uniques values is [0 1]


In [58]:
 # How's our accuracy?
print("The accuracy is: {0:.4f}".format(accuracy_score(y, predict)))

The accuracy is: 0.5139


# 2. Down-sample Majority Class
- Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.
- The most common heuristic for doing so is resampling without replacement.

In [49]:
# Separate majority and minority classes
df_majority = data[data.balance==0]
df_minority = data[data.balance==1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results


1    49
0    49
Name: balance, dtype: int64

In [50]:
df_majority_downsampled.head()

Unnamed: 0,balance,var1,var2,var3,var4
360,0,3,5,3,1
486,0,4,5,3,2
421,0,4,2,5,2
166,0,2,2,4,2
194,0,2,3,4,5


In [51]:
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.balance.value_counts()

1    49
0    49
Name: balance, dtype: int64

- **As you take notice, we see that the results are more equal, exactly equal, and this help us find a more accurate representation of the data, we will now train the data with logistic regression. We now have a lower amount of equal sample sizes, 49 to be exact**

In [54]:
y = df_upsampled.balance
Xs = df_upsampled.drop('balance', axis=1)

# Train model
logfit = LogisticRegression()
model_2 = logfit.fit(Xs,y)

# Predict on training set
predict = model_2.predict(Xs)

# Is our model still predicting just one class?
print("The total numbers of uniques values is {0}".format(np.unique(predict)))

The total numbers of uniques values is [0 1]


In [56]:
# How's our accuracy?
print("The accuracy is: {0:.4f}".format(accuracy_score(y, predict)))

The accuracy is: 0.5139


# 3. Change Your Performance Metric
- We'll look at using other performance metrics for evaluating the models.
- **Area Under ROC Curve (AUROC).**
    - Intuitively, AUROC represents the likelihood of your model distinguishing observations from two classes.
    - In other words, if you randomly select one observation from each class, what's the probability that your model will be able to "rank" them correctly?
    
    
## ROC Curve and Area Under the Curve (AUC)
- ROC Curve:
    - This curve plots True Positive Rate on the x-axis against False Positive Rate on the y-axis. 
    - Pretty much, you're figuring out a visual of the TPR against the FPR. Thus, you're seeing how well can the model predict positive when its positive, and when 
    - You want a high True Positive Rate, and a low False Positive Rate
- AUC Curve:
    - Using the graph, the AUC curve is the area under the circle relative to the entire graph
    - <img src="ROC-AUC.png">

In [69]:
from sklearn.metrics import roc_auc_score

# Predict class probabilities
prob = model_2.predict_proba(Xs)
 
# Keep only the positive class
prob = [p[1] for p in prob]
 
# AUROC of model trained on downsampled dataset
print("The ROC Curve is {0:.4f}".format(roc_auc_score(y, prob)))

The ROC Curve is 0.5166


# 4. Penalize Algorithms (Cost-Sensitive Training)
- Penalized learning algorithms that increase the cost of classification mistakes on the minority class.

In [73]:
from sklearn.svm import SVC

# During training, we can use the argument class_weight='balanced'  
# to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

# We also want to include the argument probability=True 
# if we want to enable probability estimates for SVM algorithms.

# Using the same X and y variables are previously created
 
# Train model
SVC = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)
 
model_3 = SVC.fit(Xs, y)
 
# Predict on training set
predict = model_3.predict(Xs)
 
# Is our model still predicting just one class?
print(np.unique(predict)) # No!

[0 1]


In [74]:
# How's our accuracy?
print("The accuracy is: {0:.4f}".format(accuracy_score(y, predict)))

The accuracy is: 0.5191


In [75]:
prob = model_3.predict_proba(Xs)
prob = [p[1] for p in prob]
print("The ROC Curve is {0:.4f}".format(roc_auc_score(y, prob)))

The ROC Curve is 0.4853


# 5. Use Tree-Based Algorithms
- The final tactic we'll consider is using tree-based algorithms. Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

In [16]:
from sklearn.ensemble import RandomForestClassifier

# Train model
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)
 
# Predict on training set
pred_y_4 = clf_4.predict(X)
 
# Is our model still predicting just one class?
print(np.unique(pred_y_4))
 
# How's our accuracy?
print(accuracy_score(y, pred_y_4)) # 0.9744
 
# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print(roc_auc_score(y, prob_y_4))

[0 1]
0.979591836735
0.998334027489
