<a href="https://colab.research.google.com/github/faridelya/Outlier-Anomalies/blob/main/Adult_Mix_Data_Outlier_Detection_PyOD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## [**PyOD:**](https://pypi.org/project/pyod/)
**PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data**. This exciting yet challenging field is commonly referred as Outlier Detection or Anomaly Detection.

PyOD includes more than 30 detection algorithms, f**rom classical LOF (SIGMOD 2000)** to the **latest SUOD (MLSys 2021)** and **ECOD (TKDE 2022)**. Since 2017, PyOD has been successfully used in numerous academic researches and commercial products with more than 5 million downloads. It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, **including Analytics Vidhya, KDnuggets, Towards Data Science, Computer Vision News, and awesome-machine-learning.**

# **We are Using:** 

*   Probabilistic <br> 
        ECOD 
        Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions



# **Installing Dependences**

In [14]:
!pip install pyod &> /dev/null

In [15]:
!pip install combo>=0.0.8

In [16]:
!pip install joblib &> /dev/null

In [17]:
!pip install numba>=0.35 &> /dev/null

In [18]:
!pip install scipy>=0.19.1 &> /dev/null

In [19]:
!pip install statsmodels &> /dev/null

[PyOD link](https://pyod.readthedocs.io/en/latest/index.html)<br>
[Vist Module ](https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.copod) <br>

[Reasearch_Paper (COPOD)  Copula-Based Outlier Detection](
https://arxiv.org/abs/2009.09463)




In [62]:
from __future__ import division
from __future__ import print_function
import pandas as pd
import io
import os
import sys
from pyod.models.ecod  import ECOD
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize


**Uploading Adult Dataset** 

In [21]:
from google.colab import files
uploaded = files.upload()

Saving adult.csv to adult.csv


# **Class for Multicolumn labelencoder**

In [22]:
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# creating custom labelencoder for multiple features
class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

In [59]:


def load_data_get_result(path, categorical_col= None, target= None, Threshold=0.1 ):

  '''
  Parameters:

  Path:  the location of file if on local same directory then just give name in form of string

  ctategorical_col: By Defult None if there is no need to convert categorical to label encoder

  target: the dependent variable pass in form of string

  Threshold or contamination: float in (0., 0.5), optional (default=0.1)
  The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.


  Global Variable:
  you can access these variables just after calling the function it willgive you results'
  (1): concat_score_with_X_train  ==> the Dataframe of training dataset with Label of inlier and outlier and score for outliers
  (2): X_test_with_label   ==> the DataFrame of Test and predicted label and score 
  (3):inlier_train_data  ==> Only inliner DataFrame from train data
  (4):outlier_train_data ==> Only outlier DataFrame from train data
  (5):inliers_test_data  ==> Only inliner DataFrame from test data 
  (6):outlier_test_data   ==> Only inliner DataFrame from test data

  Return of this Func:
  call this function must assign to two 
  two variables will be return 
  (1):train_report: the Precision, Recall, F1 and Accuracy Score for train data predicted label and Ground Truth Label
  (2): test_report: the Precision, Recall, F1 and Accuracy Score for test data predicted label and Ground Truth Label

  '''

  mix_data = pd.read_csv(io.BytesIO(uploaded[path]))
  #print(mix_data.head(3))

  # convert to label incoder
  if categorical_col != None:
    transform_data = MultiColumnLabelEncoder(columns = categorical_col).fit_transform(mix_data)
  
  #predictor and target   split
  if target !=None:
    X = transform_data.drop(columns=[target], axis=1)
    y = transform_data[target]
  
  # train test split
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

  #BUILD aNOMALY MODEL
  #contamination = 0.1  # percentage of outliers
  # train COPOD detector
  clf_name = 'ECOD'
  clf = ECOD(contamination = Threshold )

  # you could try parallel version as well.
  # clf = COPOD(n_jobs=2)
  clf.fit(X_train)

  # get the prediction labels and outlier scores of the training data
  y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
  y_train_scores = clf.decision_scores_  # raw outlier scores

  #CONVERT LABEL AND SCORE TO DATAFRAME
  y_test_prd = pd.DataFrame(y_train_pred, columns=['Label_0_1'])
  y_train_scre = pd.DataFrame( y_train_scores, columns=['decision_score'])                         
  result_score = pd.concat([y_test_prd, y_train_scre], axis=1)

  # reset index to avoid merge conflict on index
  X_train.reset_index(drop=True, inplace=True)
  result_score.reset_index(drop=True, inplace=True)

  # Train data with label and score in dataframe
  global concat_score_with_X_train 
  concat_score_with_X_train = pd.concat([X_train,result_score],axis=1)
    
  # get the prediction on the test data
  x_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
  x_test_scores = clf.decision_function(X_test)  # outlier scores

  #CONVERT TEST SCOR AND LABEL TO DATAFARME
  X_test_prd = pd.DataFrame(x_test_pred, columns=['test_Label_0_1'])
  X_test_score = pd.DataFrame( x_test_scores, columns=['test_decision_score'])                        
  test_result_score = pd.concat([X_test_prd, X_test_score], axis=1)

  # reset index to avoid merge conflict on index
  X_test.reset_index(drop=True, inplace=True)
  test_result_score.reset_index(drop=True, inplace=True)

  #test dataframe with score and label
  global X_test_with_label
  X_test_with_label = pd.concat([X_test,test_result_score],axis=1)

  # evaluate and print the results
  print("\n On Training Data eval:")
  evaluate_print(clf_name, y_train, y_train_scores)
  
  print('\n On Test Data eval:')
  evaluate_print(clf_name, y_test, x_test_scores)

  from sklearn.metrics import classification_report
  target_names = ['0: inliers', '1: outliers']
  train_report =classification_report(y_train, y_train_pred, target_names=target_names)###
  
  test_report =classification_report(y_test, x_test_pred, target_names=target_names)####
  #inlier in train dataset
  global inlier_train_data 
  inlier_train_data = concat_score_with_X_train[concat_score_with_X_train['Label_0_1'] == 0]
  #outlier in train dataset
  global outlier_train_data 
  outlier_train_data = concat_score_with_X_train[concat_score_with_X_train['Label_0_1'] == 1]

  # inlier in test dataset
  global inliers_test_data
  inliers_test_data = X_test_with_label[X_test_with_label['test_Label_0_1'] == 0]
  # outlier in test dataset
  global outlier_test_data 
  outlier_test_data = X_test_with_label[X_test_with_label['test_Label_0_1'] == 1]
  
  return train_report, test_report


### **Call Function**

In [61]:
categ_col = [ 'workclass',  'education', 'education.num','marital.status', 'occupation', 'relationship', 'race', 'sex','native.country','income']
path = 'adult.csv'

train_report_, test_report_ = load_data_get_result(path=path,  categorical_col=categ_col, target='income')




 On Training Data eval:
ECOD ROC:0.503, precision @ rank n:0.2519

 On Test Data eval:
ECOD ROC:0.4958, precision @ rank n:0.2373


In [63]:
inlier_train_data.head(4)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,Label_0_1,decision_score
0,39,4,160623,7,11,4,3,3,4,1,0,0,40,39,0,9.404474
1,44,4,92649,11,8,2,3,5,4,0,0,0,40,39,0,11.006102
2,52,4,72743,11,8,3,1,1,4,0,0,0,50,39,0,11.655851
3,33,4,118500,15,9,0,4,4,4,0,0,0,40,39,0,11.12684


In [57]:
print(train_report_)

              precision    recall  f1-score   support

  0: inliers       0.76      0.91      0.82      8150
 1: outliers       0.23      0.09      0.13      2596

    accuracy                           0.71     10746
   macro avg       0.49      0.50      0.48     10746
weighted avg       0.63      0.71      0.66     10746

