In [3]:
# LocalOutlierFactor?
Docstring:     
Unsupervised Outlier Detection using Local Outlier Factor (LOF)

The anomaly score of each sample is called Local Outlier Factor.
It measures the local deviation of density of a given sample with
respect to its neighbors.
It is local in that the anomaly score depends on how isolated the object
is with respect to the surrounding neighborhood.
More precisely, locality is given by k-nearest neighbors, whose distance
is used to estimate the local density.
By comparing the local density of a sample to the local densities of
its neighbors, one can identify samples that have a substantially lower
density than their neighbors. These are considered outliers.

LOF를 사용한 비지도 이상치 탐지
각 샘플의 이상 점수는 Local Outlier Factor라고 불리며, 이는 해당 샘플의 주변 밀도와의 지역적 편차를 측정한다.
이 점수는 지역적(local) 특성을 갖고 있어, 객체가 그 주변 이웃에 비해 얼마나 고립되어 있는지를 나타낸다.
좀 더 구체적으로는, k-최근접 이웃(k-nearest neighbors) 를 기준으로 지역 밀도를 추정하고, 
이를 기반으로 샘플의 밀도가 이웃들보다 현저히 낮은 경우 해당 샘플을 이상치(outlier)로 간주.


Parameters
----------
n_neighbors : int, optional (default=20)
    Number of neighbors to use by default for :meth:`kneighbors` queries.
    If n_neighbors is larger than the number of samples provided,
    all samples will be used.

최근접 이웃 개수. 만약 이 값이 샘플 수보다 크면 전체 샘플이 사용됩니다.
    
    
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
    Algorithm used to compute the nearest neighbors:

    - 'ball_tree' will use :class:`BallTree`
    - 'kd_tree' will use :class:`KDTree`
    - 'brute' will use a brute-force search.
    - 'auto' will attempt to decide the most appropriate algorithm
      based on the values passed to :meth:`fit` method.

    Note: fitting on sparse input will override the setting of
    this parameter, using brute force.
    
최근접 이웃 탐색에 사용할 알고리즘:
'ball_tree' → BallTree 사용
'kd_tree' → KDTree 사용
'brute' → 완전탐색 사용
'auto' → 데이터 특성에 따라 자동 선택
(희소 입력은 무조건 brute 사용)


leaf_size : int, optional (default=30)
    Leaf size passed to :class:`BallTree` or :class:`KDTree`. This can
    affect the speed of the construction and query, as well as the memory
    required to store the tree. The optimal value depends on the
    nature of the problem.

BallTree나 KDTree의 리프 노드 크기. 탐색 속도 및 메모리 사용에 영향 줌.    
    
    
metric : string or callable, default 'minkowski'
    metric used for the distance computation. Any metric from scikit-learn
    or scipy.spatial.distance can be used.

    If 'precomputed', the training input X is expected to be a distance
    matrix.

    If metric is a callable function, it is called on each
    pair of instances (rows) and the resulting value recorded. The callable
    should take two arrays as input and return one value indicating the
    distance between them. This works for Scipy's metrics, but is less
    efficient than passing the metric name as a string.

    Valid values for metric are:

    - from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
      'manhattan']

    - from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
      'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
      'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao',
      'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean',
      'yule']

거리 측정에 사용되는 메트릭. 'precomputed'로 지정하면 입력은 거리 행렬이어야 함.
문자열로 전달하거나, (두 배열을 받아 하나의 거리값을 반환하는) 함수로 전달 가능.

        
p : integer, optional (default=2)
    Parameter for the Minkowski metric from
    :func:`sklearn.metrics.pairwise.pairwise_distances`. When p = 1, this
    is equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

Minkowski 거리의 지수. p=1은 맨해튼 거리, p=2는 유클리드 거리.

    
metric_params : dict, optional (default=None)
    Additional keyword arguments for the metric function.

거리 함수에 전달할 추가 매개변수.

    
contamination : float in (0., 0.5), optional (default=0.1)
    The amount of contamination of the data set, i.e. the proportion
    of outliers in the data set. When fitting this is used to define the
    threshold on the decision function. If "auto", the decision function
    threshold is determined as in the original paper.

    .. versionchanged:: 0.20
       The default value of ``contamination`` will change from 0.1 in 0.20
       to ``'auto'`` in 0.22.

데이터셋에 존재하는 이상치 비율. fit() 시 decision 함수 임계값 설정에 사용됨.
'auto'이면 논문에 따라 자동 결정됨.
        
    
novelty : boolean, default False
    By default, LocalOutlierFactor is only meant to be used for outlier
    detection (novelty=False). Set novelty to True if you want to use
    LocalOutlierFactor for novelty detection. In this case be aware that
    that you should only use predict, decision_function and score_samples
    on new unseen data and not on the training set.

False면 이상치 탐지용, True면 신규성(novelty) 탐지로 사용.
novelty=True인 경우 학습 데이터에는 predict나 score_samples를 사용하지 말고, 새로운 데이터에만 적용해야 함.

    
n_jobs : int or None, optional (default=None)
    The number of parallel jobs to run for neighbors search.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.
    Affects only :meth:`kneighbors` and :meth:`kneighbors_graph` methods.
                    
병렬 처리에 사용할 CPU 수. -1이면 모든 CPU 사용.
이 설정은 kneighbors 및 kneighbors_graph에서만 영향을 미침.
                    
                    

Attributes
----------
negative_outlier_factor_ : numpy array, shape (n_samples,)
    The opposite LOF of the training samples. The higher, the more normal.
    Inliers tend to have a LOF score close to 1 (``negative_outlier_factor_``
    close to -1), while outliers tend to have a larger LOF score.

    The local outlier factor (LOF) of a sample captures its
    supposed 'degree of abnormality'.
    It is the average of the ratio of the local reachability density of
    a sample and those of its k-nearest neighbors.

학습 샘플에 대한 LOF의 음수값. 값이 높을수록 정상(inlier)에 가까움.
일반적으로 LOF ≈ 1이면 정상, 더 크면 이상치.

    
n_neighbors_ : integer
    The actual number of neighbors used for :meth:`kneighbors` queries.
            
실제로 사용된 최근접 이웃 수


offset_ : float
    Offset used to obtain binary labels from the raw scores.
    Observations having a negative_outlier_factor smaller than `offset_`
    are detected as abnormal.
    The offset is set to -1.5 (inliers score around -1), except when a
    contamination parameter different than "auto" is provided. In that
    case, the offset is defined in such a way we obtain the expected
    number of outliers in training.
    
이상치를 이진 분류하기 위한 기준값.
negative_outlier_factor_가 이보다 작으면 이상치로 판단됨.
기본적으로 -1.5로 설정되며, contamination을 명시하면 이에 맞게 조정됨.


fit(X)
학습 데이터 X를 기반으로 모델을 학습합니다.


fit_predict(X)
novelty=False일 때만 사용 가능.
학습 데이터에 대해 이상치(-1) 또는 정상치(+1)로 라벨링합니다.


predict(X)
novelty=True일 때만 사용 가능.
새로운 데이터에 대해 이상치(-1) 또는 정상치(+1) 예측을 수행합니다.


decision_function(X)
LOF 점수를 0 기준으로 시프트한 값 반환
양수: 정상
음수: 이상치

    
score_samples(X)
입력 샘플들의 LOF 점수의 음수를 반환
(값이 클수록 정상, 작을수록 이상치)

In [6]:
import pandas as pd
import numpy as np

np.random.seed(1234)
normal_data = np.random.normal(loc=0, scale=1, size=(100, 2))  # 정상 데이터
outliers = np.random.uniform(low=-6, high=6, size=(5, 2))      # 이상치

normal_labels = np.zeros((normal_data.shape[0], 1))
outlier_labels = np.ones((outliers.shape[0], 1))

normal_data = np.hstack((normal_data, normal_labels))
outliers = np.hstack((outliers, outlier_labels))

data = np.vstack([normal_data, outliers])

df = pd.DataFrame(data, columns=['A', 'B', 'label'])

# data shuffle
df = df.sample(frac=1).reset_index(drop=True)

df.head()

Unnamed: 0,A,B,label
0,0.35402,-0.035513,0.0
1,-0.121728,2.365769,0.0
2,0.015696,-2.242685,0.0
3,-0.334077,0.002118,0.0
4,-1.735349,1.210384,0.0


In [10]:
from sklearn.neighbors import LocalOutlierFactor

# novelty = 훈련 데이터 외의 새로운 데이터에 대해 이상치 탐지를 할지 여부
# novelty=False : 학습 데이터 자체에 이상치를 탐지, fit_predict()
# novelty=True : 학습 데이터는 정상으로 간주, 새로운 데이터에 대해 이상치 여부 판단, fit(normal) + predict(outlier)
# contamination = 데이터 내에 이상치가 차지하는 비율, 0 ~ 0.5 or 'auto'
lof = LocalOutlierFactor(n_neighbors=20, novelty=True, contamination=0.05)
predict = lof.fit(df[['A', 'B']]).predict(df[['A', 'B']]) # fit에서 모두 정상으로 가정, predict에서 이상치로 다시 감지
predict

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1])

In [19]:
lof.decision_function(df[['A', 'B']])

array([ 1.13740966,  0.48187754,  0.76776239,  1.11587022,  0.54780374,
        1.07034553,  1.03664275,  0.58552604,  1.00920317,  1.14809067,
        1.0675924 ,  0.75792055, -2.52317204,  0.72622604,  1.03282739,
        1.03366612,  1.03779385,  0.25290076,  1.03301526,  0.98411118,
        0.86685835,  0.97554723,  0.93840061,  1.08622172,  1.13899556,
        0.05479485,  1.15041046,  0.96974693,  0.94090476,  1.09855955,
        1.13996593,  1.04530363,  1.11777292,  0.82828847,  0.68094751,
        1.06532091,  1.13343091,  1.01751544,  0.62831043,  0.89939151,
        1.13443569,  1.07431507,  1.12383673,  0.95260007,  0.95667833,
        1.01715555,  1.10506452,  1.1123397 ,  1.07509798,  0.81458551,
        1.09126857,  0.18267066,  1.13722562,  1.15907265,  1.15380281,
        0.94234552,  1.08789603,  1.12039097,  0.55270804, -1.36091151,
        1.12945174,  1.1305037 ,  1.10338388,  0.86767713, -0.42468923,
        0.9273643 ,  0.43295087,  1.03206063,  1.10401523,  1.04

In [20]:
pd.DataFrame({'pred':predict, 'score':lof.decision_function(df[['A', 'B']])}).sort_values('score').head(10)

Unnamed: 0,pred,score
12,-1,-2.523172
74,-1,-1.585367
59,-1,-1.360912
80,-1,-1.036545
64,-1,-0.424689
25,1,0.054795
51,1,0.182671
17,1,0.252901
66,1,0.432951
1,1,0.481878


In [3]:
# 정상 데이터 확인
df[predict != -1]['label'].unique()

array([0.])

In [10]:
df[predict != -1].shape

(100, 3)

In [6]:
df[predict != -1].reset_index(drop=True)

Unnamed: 0,A,B,label
0,0.354020,-0.035513,0.0
1,-0.121728,2.365769,0.0
2,0.015696,-2.242685,0.0
3,-0.334077,0.002118,0.0
4,-1.735349,1.210384,0.0
...,...,...,...
95,0.680656,-1.818499,0.0
96,-1.027851,-0.584718,0.0
97,0.639633,-0.962029,0.0
98,1.104352,-0.431550,0.0
