<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Cost_Sensitive_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cost Sensitive Machine Learning**

Machine learning errors have different costs. <br>
**For example:** <br>
>predicting someone has cancer when they don't is not nearly as costly as predicting someone does not have cancer when they do. <br><br>

With imbalanced datasets different errors can have vastly different costs. 

Types of cost:<br>
- cost of misclassification errors<br>
- cost of tests or evaluation<br>
- cost of labeling<br>
- cost of intervention <br>
- cost of unwanted acchievements or outcomes
- cost of computatiion 
- cost of data collection
- cost if human-computer interaction
- cost of instability


Cost-sensitive techniques can be broken into three types:<br>
- data sampling
- algorithm modifications
- ensemble methods

**Import libraries**

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
from numpy import mean
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight


**Create an imbalanced dataset**

In [None]:
X, y = make_classification(n_classes = 2,n_samples=10000, n_features=2, n_redundant=0,
      n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
weight_of_classes=[0.99,0.01]
X, y = make_classification(n_classes = 2,n_samples=10000, n_features=2, n_redundant=0,
      n_clusters_per_class=1, weights=weight_of_classes, flip_y=0, random_state=2)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
# scatter plot of examples by class label 
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
pyplot.legend()
pyplot.show()

**Create a logistic regresssion model**<br>
**Use Cross Validation to evaluate the model**

In [None]:
# define model
model = LogisticRegression(solver='lbfgs')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) 

# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# **Weighted Logistic Regression**

Use weighted logistic regression on an imbalanced dataset. <br>
Each class label is given a weight for calculating cost. 

Weights are a hyperparameter that can be found:<br>
- using a hyperparameter search
- using an SME to set the cost
- settiing and using a best practice

Best practice:<br>
Use inverse class distribution for the weights<br>
In this example, the difference between the classes is 100 to 1, so we set the weights as 1 to 100

In [None]:
# define model
weights = {0:0.01, 1:1.0}
model = LogisticRegression(solver='lbfgs', class_weight=weights)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) 
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))
print(scores)

**Use compute_class_weight to get the weights**

In [None]:
# calculate class weighting
weighting = compute_class_weight('balanced', [0,1], y) 
print(weighting)
#.5 to 50 == 1 to 100 == .01 to 1

**Can use class_weight='balanced' to balance the weights **

In [None]:
# define model
model = LogisticRegression(solver='lbfgs', class_weight='balanced')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

**Assignment**:<br>
1. Change the size of the dataset. What happens to the scores?

**Assignment**: <br>
Below is code for multi-class imbalanced datasets.<br>
Substitute this code for the code above and rerun. What is the difference with multiclass classification? Try different ratios. What happens when the classes are balanced?<br>


In [None]:
weight_of_classes=[0.98,0.01,0.01]
X, y = make_classification(n_classes = 3,n_samples=10000, n_features=2, n_redundant=0,
      n_clusters_per_class=1, weights=weight_of_classes, flip_y=0, random_state=2)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
# define model
model = LogisticRegression(solver='lbfgs')
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1) 

# summarize performance
print(('Mean ROC AUC: %.3f' % mean(scores)))
print("Scores:",scores)
