<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Anomalies_3_Local_Outlier_Factor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As the number of features increases, you may want to consider using the Local Outlier Factor (LOF) method. <br>


The LOF method uses a nearest neighbors to detect outliers. <br>
Each data example is scored based on the size of its local neighborhood. The larger the score, the more likely the example is an outlier.  

**Import Libraries**

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from numpy import vstack
from sklearn.datasets import make_classification 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import f1_score
from sklearn.neighbors import LocalOutlierFactor
from matplotlib import pyplot
from sklearn.neighbors import NearestNeighbors

**Use the LocalOutlierFactor model with an estimate of the expected percentage of outliers** 

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the scores of the samples.<br>

>if ‘auto’, the threshold is determined as in the original paper,<br>

>if a float, the contamination should be in the range (0, 0.5].<br>

*The default contamination is 'auto'*

**Small example of LOF**

In [None]:
sample_x = [[-1.1], [0.2], [101.1], [0.3]]

In [None]:
pyplot.scatter(sample_x,y=[0,1,2,3], c='orange',s=60)
pyplot.show()

Use LOF to score the data examples as outliers

In [None]:
clf = LocalOutlierFactor(n_neighbors=2)
clf.fit_predict(sample_x)

clf.negative_outlier_factor_

**Sample example of NearestNeighbors**

In [None]:
samples = [[0., 1., 2.], [3., 4, 5], [6, 7, 8]]
samples_x=[samples[0][0],samples[1][0],samples[2][0]]
print(samples_x)
samples_y=[samples[0][1],samples[1][1],samples[2][1]]
print(samples_y)

In [None]:
pyplot.scatter(samples_x,y=samples_y, c='orange')
pyplot.scatter(1,6)
pyplot.show()

We can use the NearestNeighbors function to find the distance to the nearest neighbor, and which sample is the nearest neighbor

In [None]:
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(samples)
print(neigh.kneighbors([[1., 6., 1.]]))





# **Try a larger example**




**Create a dataset that has outliers**

In [None]:
np.random.seed(42)

# Generate train data
X_inliers = 0.3 * np.random.randn(100, 2) #generates an array 100rows x 2columns
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]

# Generate some outliers
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X_inliers, X_outliers]

**Plot the data**

In [None]:
pyplot.scatter(X[:, 0],X[:,1])
pyplot.show()

**Create the label**s<br>
outliers = -1<br>
inliers = 1

In [None]:
n_outliers = len(X_outliers)
ground_truth = np.ones(len(X), dtype=int)
ground_truth[-n_outliers:] = -1

In [None]:
ground_truth

**Plot the data based on inliers and outliers**

In [None]:
pyplot.scatter(X[:, 0],ground_truth)
pyplot.show()

**Use the LocalOutlierFactor function to predict the outliers**

In [None]:
# fit the model for outlier detection (default)
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).
y_pred = clf.fit_predict(X)
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_

**Plot the data and add circles with the diameter equal to the score**.<br>
 The larger the score, the larger the circle

In [None]:
plt.title("Local Outlier Factor (LOF)")
plt.scatter(X[:, 0], X[:, 1], color="k", s=3.0, label="Data points")
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(
    X[:, 0],
    X[:, 1],
    s=1000 * radius,
    edgecolors="r",
    facecolors="none",
    label="Outlier scores",
)
plt.axis("tight")
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.xlabel("prediction errors: %d" % (n_errors))
legend = plt.legend(loc="upper left")
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

**Assignment**<br>
Change the dataset. <br>
Make it larger, add more outliers, etc. <br>
What effect do the changes have on the LOF?