## PSEUDO_CODE_MetaCost

https://www.udemy.com/course/machine-learning-with-imbalanced-data/learn/lecture/23765586#overview

Inputs:
- S : the training set
- L : the classification algorithm
- C : the Cost matrix
- m : number of samples to regenerate
- n : number of examples in each resample
- p : True iff L produces class probabilities
- q : True iff all resamples are to be used for each example


Procedure MetaCost(S,L,C,m,n,p,q)

For i = 1 to m
    Let Si be a resample of S with n examples
    Let Mi = Model produced by applying L to Si

For each example x in S
    For each class j, Let $P(j|x) = (1/\sum\limits_{i})\sum\limits_{i}P(j|x,M_i)$
    
Where:
If p then P(j|x, Mi) is produced by Mi
Else P(j|x,Mi)=1 for the class predicted by Mi for x, and 0 for all others
If q then i ranges over all Mi
Else i ranges over all Mi such that x doesnt belong Si
Let x's class = $argmin_{i}\sum\limits_{j}P(j|x)C(i,j)$

## Using MetaCost to find model performance using Logistic Regression

In [1]:
# import libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

from metacost import MetaCost

In [4]:
# import data

df = pd.read_csv('..\kdd2004.csv').sample(10000, random_state = 0)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
46233,51.02,22.08,0.92,31.5,10.5,1910.7,-1.47,-0.74,-8.0,-52.0,...,879.5,1.58,-0.45,-5.0,-30.0,291.7,-0.12,0.47,0.96,-1
58625,64.17,24.6,-0.21,-35.5,26.0,4585.3,-1.1,1.17,-27.5,-121.5,...,4815.7,-1.09,5.09,25.0,-220.0,475.4,2.32,0.42,0.46,-1
5231,86.09,29.63,3.24,78.5,-89.0,453.2,1.87,4.58,63.0,-119.5,...,144.9,1.25,2.5,3.0,-24.0,64.8,-0.85,0.59,0.94,1
58042,78.57,21.37,0.36,-7.0,38.5,1779.1,-0.25,-0.03,-3.5,-62.5,...,1471.3,-0.12,1.48,-5.0,-62.0,406.9,0.18,0.41,0.68,-1
128067,79.13,24.18,0.78,-3.0,-16.0,844.1,0.48,-0.56,-6.5,-52.0,...,633.8,0.43,1.3,5.0,-29.0,165.0,-0.1,0.09,-0.41,-1


In [5]:
X = df.drop('target', axis = 1)
y = df['target']

y.value_counts()

-1    9904
 1      96
Name: target, dtype: int64

In [7]:
y = y.map({-1:0, 1:1})
y.value_counts()

0    9904
1      96
Name: target, dtype: int64

In [8]:
# split the data into train and test set

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

X_train.shape, X_test.shape

((7000, 74), (3000, 74))

## Set up Logistic Regression model`

In [9]:
# set up the estimator that we would like to ensemble

log = LogisticRegression(penalty='l2'
                         ,max_iter=10,
                         solver='newton-cg',
                         random_state=0,
                         n_jobs=2)

## Metacost

- With no Cost

In [10]:
cost_matrix = np.array([[0,1],[1,0]])
cost_matrix

array([[0, 1],
       [1, 0]])

In [11]:
# see parameter description here
# https://www.udemy.com/course/machine-learning-with-imbalanced-data/learn/lecture/23765586#overview

metacost_ = MetaCost(estimator=log,
                     cost_matrix=cost_matrix,
                     n_estimators=50,
                     n_samples=None,
                     p=True,
                     q=True)

In [12]:
# this is fitted on re-labelled final model
metacost_.fit(X_train, y_train)

resampling data and training ensemble
Finished training ensemble
evaluating optimal class per observation
Finished re-assigning labels
Training model on new data
Finished training model on data with new labels


In [14]:
metacost_.predict_proba(X_train)



array([[9.99888503e-01, 1.11497252e-04],
       [9.90811298e-01, 9.18870208e-03],
       [1.00000000e+00, 8.05627179e-11],
       ...,
       [9.99999965e-01, 3.45265000e-08],
       [9.96681354e-01, 3.31864637e-03],
       [1.00000000e+00, 8.44431700e-11]])

In [15]:
print('Train set')
pred = metacost_.predict_proba(X_train)
print(
    'MetaCost roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

print('Test set')
pred = metacost_.predict_proba(X_test)
print(
    'MetaCost roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
MetaCost roc-auc: 0.9023866090910255
Test set
MetaCost roc-auc: 0.9149888584330816




## Meta Cost

- With cost

TP | FN

FP | TN

In [17]:
cost_matrix = np.array([[0,100],[1,0]])
cost_matrix

array([[  0, 100],
       [  1,   0]])

In [19]:
metacost2 = MetaCost(estimator=log,
                     cost_matrix=cost_matrix,
                     n_estimators=50,
                     n_samples=None, # None means we will use all the obseravations to create the samples
                     p=True, # will use the probability as the intermediate output and not class
                     q=True) # will include all the samples

In [20]:
metacost2.fit(X_train,y_train)

resampling data and training ensemble
Finished training ensemble
evaluating optimal class per observation
Finished re-assigning labels
Training model on new data
Finished training model on data with new labels


In [21]:
metacost2.predict_proba(X_train)



array([[9.27143717e-01, 7.28562827e-02],
       [1.94790832e-01, 8.05209168e-01],
       [9.99994104e-01, 5.89550172e-06],
       ...,
       [9.98786173e-01, 1.21382728e-03],
       [7.49275305e-01, 2.50724695e-01],
       [9.99998374e-01, 1.62632331e-06]])

In [22]:
print('Train set')
pred = metacost2.predict_proba(X_train)
print(
    'MetaCost roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

print('Test set')
pred = metacost2.predict_proba(X_test)
print(
    'MetaCost roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
MetaCost roc-auc: 0.9342964633412528
Test set
MetaCost roc-auc: 0.928555934609848




In [23]:
y_train

0       0
1       0
2       0
3       0
4       0
       ..
6995    0
6996    0
6997    0
6998    0
6999    0
Name: target, Length: 7000, dtype: int64

In [24]:
y_train.reset_index(drop = True)

0       0
1       0
2       0
3       0
4       0
       ..
6995    0
6996    0
6997    0
6998    0
6999    0
Name: target, Length: 7000, dtype: int64

In [25]:
metacost2.y_

0       0
1       1
2       0
3       0
4       0
       ..
6995    0
6996    1
6997    0
6998    1
6999    0
Length: 7000, dtype: int64

In [26]:
tmp = pd.concat([metacost2.y_, y_train.reset_index(drop=True)], axis=1)

tmp.head()

Unnamed: 0,0,target
0,0,0
1,1,0
2,0,0
3,0,0
4,0,0


In [27]:
tmp[tmp[0]!=tmp['target']][['target', 0]]

Unnamed: 0,target,0
1,0,1
9,0,1
10,0,1
13,0,1
14,0,1
...,...,...
6975,0,1
6978,0,1
6979,0,1
6996,0,1


In theory, we should only be re-labeling observations from class 0 to class 1, but in practice that does not happen.

In [28]:
np.sum( np.where(metacost2.y_ != y_train.reset_index(drop=True),1,0) )

1338

In [29]:
np.sum( np.where(metacost2.y_ == y_train.reset_index(drop=True),1,0) )

5662

- So, basically 1338 number of observations labelled are updated due the meta cost learning

## Conclusion

We can wrap a model to make it cost-sensitive utilizing metacost.

### Important

The code here, does not allow reproducible results, because at the moment the class MetaCost does not incorporate a seed when re-sampling the data.

MetaCost might be incorporated to Sklearn, there is a PR open:
https://github.com/scikit-learn/scikit-learn/pull/16525 
