<a href="https://www.kaggle.com/code/fazilamirli/outlier-identification-tutorial?scriptVersionId=151331109" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np
import pandas as pd 
import os

from sklearn.neighbors import LocalOutlierFactor

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



/kaggle/input/playground-series-s3e25/sample_submission.csv
/kaggle/input/playground-series-s3e25/train.csv
/kaggle/input/playground-series-s3e25/test.csv


In [2]:
train_df = pd.read_csv('/kaggle/input/playground-series-s3e25/train.csv', index_col='id')

In [3]:
train_df.head()

Unnamed: 0_level_0,allelectrons_Total,density_Total,allelectrons_Average,val_e_Average,atomicweight_Average,ionenergy_Average,el_neg_chi_Average,R_vdw_element_Average,R_cov_element_Average,zaratio_Average,density_Average,Hardness
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,100.0,0.841611,10.0,4.8,20.612526,11.0881,2.766,1.732,0.86,0.49607,0.91457,6.0
1,100.0,7.558488,10.0,4.8,20.298893,12.04083,2.755,1.631,0.91,0.492719,0.7176,6.5
2,76.0,8.885992,15.6,5.6,33.739258,12.0863,2.828,1.788,0.864,0.481478,1.50633,2.5
3,100.0,8.795296,10.0,4.8,20.213349,10.9485,2.648,1.626,0.936,0.489272,0.78937,6.0
4,116.0,9.577996,11.6,4.8,24.988133,11.82448,2.766,1.682,0.896,0.492736,1.86481,6.0


In [4]:
X = train_df.drop('Hardness', axis=1)

In [5]:
print('Number of observations: %d' % len(X))

Number of observations: 10407


## Outlier identification with standard deviation

* Assumption: data should be Gaussian or Gaussian like

In [6]:
X_ = X.copy()
for col in X_.columns:
    col_mean, col_std = np.mean(X_[col]), np.std(X_[col])
    cut_off = 3 * col_std
    lower, upper = col_mean - cut_off, col_mean + cut_off
    mask1 = X_[col]<=upper
    mask2 = X_[col]>=lower
    X_ = X_[mask1&mask2]

In [7]:
print('Number of observations after outlier removal: %d' % len(X_))

Number of observations after outlier removal: 8931


## Outlier identification with interquartile range method

In [8]:
X_ = X.copy()
for col in X_.columns:
    q1, q3 = np.percentile(X_[col], 25), np.percentile(X_[col], 75)
    iqr = q3 - q1
    cut_off = 1.5 * iqr
    
    lower, upper = q1 - cut_off, q3 + cut_off
    
    mask1 = X_[col]<=upper
    mask2 = X_[col]>=lower
    X_ = X_[mask1&mask2]

In [9]:
print('Number of observations after outlier removal: %d' % len(X_))

Number of observations after outlier removal: 5012


## Outlier identification with Local Outlier Factor(LOF)

* The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers. The scikit-learn library provides an implementation of this approach in the `LocalOutlierFactor` class.

In [10]:
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X)

mask = yhat != -1

X_ = X[mask]

In [11]:
print('Number of observations after outlier removal: %d' % len(X_))

Number of observations after outlier removal: 8481


End for now! If you find it useful, please feel free to upvote. Thanks in advance.

Reference:
> https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/    