## Outlier detection.

In this task you will be given a dataset of credit card transactions, small proportion 
of which is identified as fraudulent (see *data description.txt*). Since there are too few frauds and too many regular transactions, standard classification methods are not directly appliccable. But since frauds are untypical you may use outlier detection methods to identify them.

In this task you need to identify the best outlier detection method with best parameters to detect frauds out of three outlier detection methods offered in sklearn. Quality metric is area-under-curve (AUC).

In [None]:
%pylab inline
%precision 6

In [None]:
import sklearn
import sklearn as skl
import pandas as pd
from pdb import set_trace as bp

In [None]:
np.set_printoptions(linewidth=140,edgeitems=10)
rcParams['figure.figsize'] = (8.0, 5.0)

In [None]:
from common.classes.Struct import Struct
from common.visualize.colors import COLORS
from common.visualize.data import plot_corr
from common.visualize.distributions import cont_dist_classification, pca_2D, cross_distributions
from common.visualize.distributions import cross_distributions_classification, cross_distributions_regression

### Data preparation

In [None]:
Z=pd.read_csv('data.csv')

In [None]:
Z.head()

In [None]:
Z.describe()

In [None]:
random.seed(0)
inds = random.permutation(arange(len(Z)))

Z=Z.loc[inds]

Z.index = arange(len(Z))

In [None]:
inds0 = find(Z['Class'].values==0)
inds1 = find(Z['Class'].values==1)

# for simplicity of computations consider subset of original sample.
inds0 = inds0[:len(inds0)//3]
inds1 = inds1[:len(inds1)//3]

inds = hstack( [inds0, inds1] ) 
random.seed(0)
inds = random.permutation(inds)

Z=Z.loc[inds] # to simplify future computations

In [None]:
len(Z)

In [None]:
Z.Class.value_counts()

In [None]:
Z.index = arange(len(Z))

In [None]:
features = ['V%d'%i for i in arange(1,28+1)]

In [None]:
X = Z[features].values
Y = Z['Class'].values

In [None]:
len(X), len(Y)

In [None]:
time=Z.Time.values

#### Plot distribution of the total number of transactions. Does it have any day/night pattern?

#### Plot distribution of the fraudulent number of transactions. Does it have any day/night pattern?

### Train, test sets

In [None]:
train_inds = find(time<86400) # train - previous day
test_inds = find(time>=86400) # test - next day

## Data visualizations

Should be performed on **train set only**

#### Estimate class proportions

#### Plot distributions $p(f|y=0), p(f|y=1)$ of all features f.

$p(f|y=0), p(f|y=1)$ should lie on one graph and different f should belong to different graphs.

Useful function: common.vizualize.distributions.cont_dist_classification

#### Redraw the graph above for some very discriminative feature. Title it with feature name.

#### Redraw the graph above for some least discriminative feature. Title it with feature name.

### Plot data in first 2 principal component space
Useful function: common.visualize.cross_distributions.pca_2D
    
Are frauds separable?

#### Plot correlations between features.

Useful function: common.visualize.data.plot_corr

What regularities do you see? 

# Anomaly detection

Below you need to compare outlier detectin methods, using the following scheme:
    1. find optimal parameters with GridSearchCV with default number of folds on TRAIN SET
       Set n_jobs=1 (GridSearchCV may not work otherwise).
    2. display best parameters
    3. apply method with best parameters to TEST SET
    4. show ROC curve for TEST SET with title 'AUC=<value>', where <value> is estimated AUC value.

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import GridSearchCV

from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EmpiricalCovariance
from sklearn.ensemble import IsolationForest

#### GridSearchCV should use everywhere scoring=my_auc_score defined below.

Y_hat is the **score**, showing how much an object looks like an outlier. So **predict** methods should **return scores, not classes.**

To obtain such predict method, you will need to redefine (by inheritance) original scikit-learn methods.

Don't confuse object outlier score with object normality score.

In [None]:
from sklearn.metrics import make_scorer

def my_auc(Y,Y_hat):
    fpr,tpr,_ = roc_curve(Y, Y_hat, pos_label=1)
    return auc(fpr, tpr)

my_auc_score = make_scorer(my_auc, greater_is_better=True)

### LocalOutlierFactor method

#### Show best parameters on train set.

Consider grid {'n_neighbors':[1,3,7],'p':[1,2]}

#### Show quality on test set.

### EmpiricalCovariance method

#### Since method does not have tunable parametes, just show its ROC & AUC on test set.

### IsolationForest method

#### Show best parameters on train set

#### Show quality on test set.

Does the quality increase with increase of **n_estimators**?