<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Cost_Sensitive_1b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Applying cost sensitivity to a dataset**

Number of Instances: 214

Number of Attributes: 10 (including an Id#) plus the class attribute<br>
   -- all attributes are continuously valued<br>

 Attribute Information:<br>
   1. Id number: 1 to 214
   2. RI: refractive index
   3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
   4. Mg: Magnesium
   5. Al: Aluminum
   6. Si: Silicon
   7. K: Potassium
   8. Ca: Calcium
   9. Ba: Barium<br>
   10. Fe: Iron<br>
   11. Type of glass: (class attribute)<br>
      -- 1 building_windows_float_processed<br>
      -- 2 building_windows_non_float_processed<br>
      -- 3 vehicle_windows_float_processed<br>
      -- 4 vehicle_windows_non_float_processed (none in this database)<br>
      -- 5 containers<br>
      -- 6 tableware<br>
      -- 7 headlamps<br>

8. Missing Attribute Values: None

9. Class Distribution: (out of 214 total instances)<br>
  **163 Window glass (building windows and vehicle windows)**<br>
      -- 87 float processed  <br>
      -- 70 building windows<br>
      -- 17 vehicle windows<br>
      -- 76 non-float processed<br>
      -- 76 building windows<br>
      -- 0 vehicle windows<br>
  **51 Non-window glass<br>**
      -- 13 containers<br>
      -- 9 tableware<br>
      -- 29 headlamps<br>


In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

In [None]:
!pip install scikit-plot

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from scikitplot.metrics import plot_roc
from scikitplot.metrics import plot_precision_recall
from imblearn.under_sampling import ClusterCentroids
from imblearn.over_sampling import SMOTE

**Load the dataset** 

In [None]:
df = pd.read_csv('glass.csv', skiprows=1)
df.columns=['RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type']
df.head()

In [None]:
features = []
for feature in df.columns:
    if feature != 'Type':
        features.append(feature)
X = df[features]
y = df['Type']

In [None]:
X

**Split into training and test sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

**Note the imbalanced dataset**

In [None]:
count = y_train.value_counts()
count.plot.bar()
plt.ylabel('Number of records')
plt.xlabel('Glass Type')
plt.show()

**Create and train a model.** <br>
**Then test it** 

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)
y_pred = model.predict(X_test)

In [None]:
plot_roc(y_test, y_score)
plt.show()

In [None]:
plot_precision_recall(y_test, y_score)
plt.show()

**The median number of samples for all columns**

---



In [None]:
n_samples = int(count.median())
print(n_samples)

Define a function that either undersamples or oversamples, depending on the 't' input parameter.<br>

if t= under, then undersample the larger columns<br>
if t= over, oversample the smaller columns

In [None]:
def sampling_strategy(X,y,n_samples,t):
    target_classes = ''
    if t == 'under':
        target_classes = y.value_counts() > n_samples
    elif t == 'over':
        target_classes = y.value_counts() < n_samples
    tc = target_classes[target_classes == True].index
    target_classes_all = y.value_counts().index
    columns = {}
    for target in tc:
        columns[target] = n_samples
    return columns

**Undersample the larger classes to get the median number of samples**<br>
Use ClusterCentroids to undersample the larger classes

In [None]:
columns = sampling_strategy(X_train,y_train,n_samples,t='under')
under_sample = ClusterCentroids(columns)
X_under, y_under = under_sample.fit_resample(X_train, y_train)

In [None]:
y_under_s = pd.Series(y_under)

In [None]:
count = y_under_s.value_counts()
count.plot.bar()
plt.ylabel('Number of records')
plt.xlabel('Glass Type')
plt.show()

**Oversample the smaller classes using SMOTE**

In [None]:
columns_under=sampling_strategy(X_under, y_under_s,n_samples, t='over')
over_sampler = SMOTE(columns_under,k_neighbors=2)
X_bal, y_bal = over_sampler.fit_resample(X_under, y_under)

In [None]:
y_bal_s=pd.Series(y_bal)

**The dataset is now balanced 19 instances in each class**

In [None]:
count = y_bal_s.value_counts()
count.plot.bar()
plt.ylabel('Number of records')
plt.xlabel('Glass Type')
plt.show()

**Create and train a model on the balanced dataset**

In [None]:
model = KNeighborsClassifier()
model.fit(X_bal, y_bal_s)
y_score_balanced = model.predict_proba(X_test)
y_pred_balanced = model.predict(X_test)

**Compare the imbalanced dataset performance to the balanced dataset performance**<br>
Did balancing the dataset improve the performance of the model?<br>
The ROC curve looks like it improved, but the precision-recall model did not. 

Note: classes are 1,2,3,5,6,7 - 6 types of glass

In [None]:
# Plot metrics 
plot_roc(y_test, y_score, title="Imbalanced Dataset ROC")
plot_roc(y_test, y_score_balanced, title="Balanced Dataset ROC")
plt.show()

In [None]:
plot_precision_recall(y_test, y_score,title="Imbalanced Dataset Precision-Recall")
plot_precision_recall(y_test, y_score_balanced, title="Balanced Dataset Precision-Recall")
plt.show()

In [None]:
from sklearn.metrics import f1_score
f1_metric=f1_score(y_test,y_pred,average=None)
print("F1 score for each imbalanced class:",f1_metric)

f1_metric=f1_score(y_test,y_pred_balanced,average=None)
print("F1 score for each balanced class  :",f1_metric)

**Assignment**<br>
Change the n_samples. <br>
Is the median of all the classes the best choice?

# **Use the class_weights function when creating the model**

Recall the class_weights function will give more importance to the minority classes.

In [None]:
from sklearn.utils import class_weight
classes = np.unique(y_train)
cw = class_weight.compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
cw=class_weight.compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
weights = dict(zip(classes,cw))
print(weights)

In [None]:
model = DecisionTreeClassifier(class_weight=weights)
model.fit(X_train, y_train)
y_score_weight = model.predict_proba(X_test)
y_pred_weight = model.predict(X_test)

In [None]:
# Plot metrics 
plot_roc(y_test, y_score, title="Imbalanced Dataset ROC")
plot_roc(y_test, y_score_weight, title="Weighted Dataset ROC")
plt.show()

In [None]:
plot_precision_recall(y_test, y_score,title="Imbalanced Dataset Precision-Recall")
plot_precision_recall(y_test, y_score_weight,title="Weighted Dataset Precision-Recall")
plt.show()

In [None]:
#@title
f1_metric=f1_score(y_test,y_pred,average=None)
print("F1 score for each imbalanced class:",f1_metric)

f1_metric=f1_score(y_test,y_pred_weight,average=None)
print("F1 score for each weighted class  :",f1_metric)