## Outlier identification and Removal
Outliers are observations unlike other observations. They can be identified by statistical methods. 
But before we look for an outlier, lets define a dataset we can use to test the methods.

We will generate a population of 10,000 random numbers drawn for a Gaussian distribution with a mean of 50 and a standard deviation of 5. 
The numbers drawn from a Gausian distribution will have outliers.

In [2]:
# We will use the numpy library.
import numpy as np
# We will generate random data of 1 seed
np.random.seed(1)
data = 5* np.random.randn(10000) + 50
Mean =np.mean(data)
Standard_deviation = np.std(data)
print(f"The mean is {Mean:.3f} and the standard deviation is {Standard_deviation:.3f}")

The mean is 50.049 and the standard deviation is 4.994


### Standard Deviation Method.
given mu and sigma, a simple way to identifu outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from mean data values that have z-score sigma greater than a threshold, for example, of three, are declared to be outliers.

In [4]:
cut_off = Standard_deviation * 3
lower = Mean - cut_off
upper = Mean + cut_off
outliers = [x for x in data if x< lower or x> upper]
length_of_outliers = len(outliers)
print(f"Identified outliers {length_of_outliers}")
non_outliers = [x for x in data if x>= lower and x<= upper]
length_of_non_outliers = len(non_outliers)
print(f"Non-Outlier observations {length_of_non_outliers}")

Identified outliers 29
Non-Outlier observations 9971


# Inter Quartile Range Method
Not all data is normal enough to be treated by the Gaussian distribution. A good statistic for summarizing a non-Gaussian distribution smaple of data is the Interquartile Range, or IQR for short.
It is calculate as the 25th and 75th percentiles of the data and defines the box in a box and whisker plot.
The 50th percentile is the middle value, or the average of the two middle values for an even number of examples. If we had 10,000 samples, then the 50th percentile would be the average of the 5000th and 5001st values.
Statistics-based outlier detection techniques assume that the normal data points would appear in high probability regions of a stochastic model, while outliers would occur in the low probability regions of a stochastic model.


In [6]:
np.random.seed(1)
data = 5 * np.random.randn(10000) + 50
#Now we will calculate the interquartile range
q25 = np.percentile(data, 25)
q75 = np.percentile(data, 75)
iqr = q75 - q25 
print(f"percentiles of Q25 is {q25:.3f} and Q75 is {q75:.3f}. The Interquartile range is {iqr:.3f} ")
cut_off = iqr * 1.5
lower = q25 - cut_off
upper = q75 + cut_off
outliers = [x for x in data if x < lower or x> upper]
No_outliers = len(outliers)
print(f"Identified outliers {No_outliers}")
Non_outliers = [x for x in data if x >= lower and x <= upper]
length_non_outliers = len(Non_outliers)
print(f"Non Outliers are {length_non_outliers}")

percentiles of Q25 is 46.685 and Q75 is 53.359. The Interquartile range is 6.674 
Identified outliers 81
Non Outliers are 9919


## Automatic Outlier Detection 
Another approach to tackle this problem in outlier detection is one-class calssification.

For this we will use Boston housing dataset

In [8]:
# First we will import the necessary libraries
import pandas as pd
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load the Boston housing dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
housing_df = pd.read_csv(url, header=None)

# Convert the dataframe into a numpy array
data = housing_df.values

# Split into features and target
X, y = data[:, :-1], data[:, -1]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print("Before outlier removal:", X_train.shape, y_train.shape)

# Apply Local Outlier Factor
lof = LocalOutlierFactor()
lof_X = lof.fit_predict(X_train)

# Create a mask for non-outliers
mask = lof_X != -1
X_train_clean = X_train[mask, :]
y_train_clean = y_train[mask]
print("After outlier removal:", X_train_clean.shape, y_train_clean.shape)

# Train Linear Regression model on clean data
model = LinearRegression()
model.fit(X_train_clean, y_train_clean)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate using Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f"The mean absolute error is {mae:.2f}")


Before outlier removal: (339, 13) (339,)
After outlier removal: (305, 13) (305,)
The mean absolute error is 3.36
