# Imbalanced Data
## Discrete Classification
### Issues with Vanilla Accuracy
- High imbalances in classes may mask poor classification performance in high metrics  
 *e.g. a classifier that always assigns the majority class to a new instance will achieve 99% accuracy where the majority class is 99% prevalent*
- Accuracy assumes errors are equally cost, imbalanced classification: misclassifying instances of minority class are generally much costlier
- Accuracy assumes class proportion is statis




### Cohen's Kappa
$$
Acc_e = \frac{\textbf{E}(TP)+\textbf{E}(TN)}{N}
$$
where $\textbf{E}$ is expectation:

$$
\textbf{E}(TP) = \frac{POS\times P(POS)}{N}
$$

$$
\textbf{E}(TN) = \frac{NEG\times P(NEG)}{N}
$$

Values less than zero indicate that performance of classifier is lower than random guessing
$$
\kappa = \frac{Acc_0-Acc_e}{1-Acc_e}
$$

### Weighted Loss Function
Since misclassifying the minority class may be significantly costlier than misclassifying the majority, we can apply selective weights to the loss function by weighting each outcome as follows:
$$
L = \frac{w_{+|+}TP + w_{+|-}FP + w_{-|+}FN + w_{-|-}TN}{N}
$$

#### Balanced Class Distribution
Setting misclassification costs to the inverse of class proportions, i.e. $w_{+|-} = N/NEG$ and $w_{-|+} = N/POS$ 

### F-Measure 
$$
F_\beta = (1+\beta^2)\frac{\text{precision}\times\text{recall}}{\beta^2\text{precision} + \text{recall}}
$$

## Continuous Output
By using a set threshold, we cannot differentiate more/less likely predictions in a given class

### ROC Curve
> TPR vs FPR 
For each possible threshold value, a point in the ROC curve is plotted based on the TPR and FPR for that threshold

### PR Graph
> Precision vs Recall
Focuses ONLY on the positive class  
Good classifiers have optimal trade-off in terms of precision and recall, and should be to the top right side

In [1]:
# !conda install -y statsmodels
# !conda install -y plotly
## imports and data loading
import os
import pandas as pd
import numpy as np
import plotly.graph_objects as go 
import sklearn 
from sklearn.linear_model import LogisticRegression
import imblearn
import statsmodels.api as sm
import dask 
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
import dask 
import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder

DATA_DIR = 'data'

COLS = [
'duration', 
'protocol_type', 
'service', 
'flag', 
'src_bytes', 
'dst_bytes', 
'land', 
'wrong_fragment', 
'urgent', 
'hot', 
'num_failed_logins', 
'logged_in', 
'num_compromised', 
'root_shell', 
'su_attempted', 
'num_root', 
'num_file_creations', 
'num_shells', 
'num_access_files', 
'num_outbound_cmds', 
'is_host_login', 
'is_guest_login', 
'count', 
'srv_count', 
'serror_rate', 
'srv_serror_rate', 
'rerror_rate', 
'srv_rerror_rate', 
'same_srv_rate', 
'diff_srv_rate', 
'srv_diff_host_rate', 
'dst_host_count', 
'dst_host_srv_count', 
'dst_host_same_srv_rate', 
'dst_host_diff_srv_rate', 
'dst_host_same_src_port_rate', 
'dst_host_srv_diff_host_rate', 
'dst_host_serror_rate', 
'dst_host_srv_serror_rate', 
'dst_host_rerror_rate', 
'dst_host_srv_rerror_rate',
'normal'
]

df = dask.dataframe.read_csv(os.path.join(DATA_DIR, 'kddcup.data'), sep=',', header=None, names=COLS)
df['protocol_type'] = df['protocol_type'].astype('category')
df['service'] = df['service'].astype('category')
df['flag'] = df['flag'].astype('category')

le = LabelEncoder()

df['normal_enc'] = le.fit_transform(df.normal)

protocol_type = dask.dataframe.reshape.get_dummies(df.protocol_type.cat.as_known())
service = dask.dataframe.reshape.get_dummies(df.service.cat.as_known())
flag = dask.dataframe.reshape.get_dummies(df.flag.cat.as_known())
data = dask.dataframe.multi.concat([df, protocol_type, service, flag], axis=1)
data=data.drop('normal',axis=1)

df.head()

We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,normal,normal_enc
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.,11
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.,11
2,0,tcp,http,SF,236,1228,0,0,0,0,...,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.,11
3,0,tcp,http,SF,233,2032,0,0,0,0,...,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.,11
4,0,tcp,http,SF,239,486,0,0,0,0,...,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.,11


# Methods of Addressing Imbalanced Datasets
## Cost-Sensitive Learning
- Stronger penalty to certain classes, training is forced to focus on instanced from this distribution
- Traditional 0-1 loss easy to minimize on imbalanced data by focussing on the majority class
- Can be done in two ways: directly (modify the misclassification cost into the training procesure) OR meta- (modifying the data-outputs or training data)

<b>Potential Issue: Assignment of cost-matrix</b>
If an expert doesn't assign this, it can be learnt from data:

In [11]:
clf = LogisticRegression()
weights = {}

# classes = df.normal.unique().compute()
le.classes_.compute()

array(['back.', 'buffer_overflow.', 'ftp_write.', 'guess_passwd.',
       'imap.', 'ipsweep.', 'land.', 'loadmodule.', 'multihop.',
       'neptune.', 'nmap.', 'normal.', 'perl.', 'phf.', 'pod.',
       'portsweep.', 'rootkit.', 'satan.', 'smurf.', 'spy.', 'teardrop.',
       'warezclient.', 'warezmaster.'], dtype=object)

In [26]:
dict(zip(le.classes_.compute(), le.transform(pd.Series(['back.', 'buffer_overflow.', 'ftp_write.', 'guess_passwd.',
       'imap.', 'ipsweep.', 'land.', 'loadmodule.', 'multihop.',
       'neptune.', 'nmap.', 'normal.', 'perl.', 'phf.', 'pod.',
       'portsweep.', 'rootkit.', 'satan.', 'smurf.', 'spy.', 'teardrop.',
       'warezclient.', 'warezmaster.'], dtype='category'))))

{'back.': 0,
 'buffer_overflow.': 1,
 'ftp_write.': 2,
 'guess_passwd.': 3,
 'imap.': 4,
 'ipsweep.': 5,
 'land.': 6,
 'loadmodule.': 7,
 'multihop.': 8,
 'neptune.': 9,
 'nmap.': 10,
 'normal.': 11,
 'perl.': 12,
 'phf.': 13,
 'pod.': 14,
 'portsweep.': 15,
 'rootkit.': 16,
 'satan.': 17,
 'smurf.': 18,
 'spy.': 19,
 'teardrop.': 20,
 'warezclient.': 21,
 'warezmaster.': 22}

In [31]:
# using the above, define costs
## since class 11 is normal, this can be down-weighted

cost_per_class = {i: 10 for i  in range(0, 23)}
cost_per_class[11]  = 1

clf = LogisticRegression(solver='lbfgs', class_weight=cost_per_class)

## Data Preprocessing
Resampling techniques can be categorized into three groups or families:
- Undersampling methods, which create a subset of the original dataset by eliminating instances (usually majority class instances)
- Oversampling methods, which create a superset of the original dataset by replicating some instances or creating new instances from existing ones 
- Hybrids methods, which combine both sampling approaches.

<b>Issues with Types</b>
- Random undersampling can loose important data, random oversampling increases the likelihood of overfitting 
- SMOTE generates synthetic data-points for each original minority sample with no consideration for neighbours, leading to overlap between classes

In [None]:
# RandomUndersampling


## Algorithm-Level

### Deep Learning

## Ensembles

ValueError: could not convert string to float: 'icmp'