# Anomaly Detection Using Unsupervised Learning

The goal of this project is to find out which days the stock market data for Apple behaved strangely. In order to conduct this research I utilized data from the yahoo finance API (https://pypi.org/project/yfinance/Links to an external site). Specifically, the market data for Apple during the time period 03/24/2021 to 03/24/2023. The columns choosen for analysis are Open (the price a stock trades at during the day), High (the highest price a stock trades at during the day), Low (the lowest price a stock trades at during the day), Close (the closing price of the stock), Adj Close (the closing price after adjustments for all aplicable splits and dividend distributions), and Volume (the number of shares of a security traded during a given period of time). Given my knowledge of the events in the past two years my hypothesis is that the anomolies will occur during the early pandemic era of Jan/March 2021 or during the slight stock market crash in June/July 2022. 

In [1]:
# Import yahoo finance 
import yfinance as yf
import pandas as pd

# Download the market data from apple over the past two years
aapl_data = yf.download("AAPL", start="2021-03-24", end="2023-03-24")

aapl_data

# Turn the data into a dataframe
aapl_data = pd.DataFrame(aapl_data)

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-24,122.820000,122.900002,120.070000,120.089996,118.661629,88530500
2021-03-25,119.540001,121.660004,119.000000,120.589996,119.155693,98844700
2021-03-26,120.349998,121.480003,118.919998,121.209999,119.768333,94071200
2021-03-29,121.650002,122.580002,120.730003,121.389999,119.946190,80819200
2021-03-30,120.110001,120.400002,118.860001,119.900002,118.473907,85671900
...,...,...,...,...,...,...
2023-03-17,156.080002,156.740005,154.279999,155.000000,155.000000,98862500
2023-03-20,155.070007,157.820007,154.149994,157.399994,157.399994,73641400
2023-03-21,157.320007,159.399994,156.539993,159.279999,159.279999,73938300
2023-03-22,159.300003,162.139999,157.809998,157.830002,157.830002,75701800


In [11]:
aapl_data.duplicated().sum()

0

In [2]:
# exploratory analysis 
aapl_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 504 entries, 2021-03-24 to 2023-03-23
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       504 non-null    float64
 1   High       504 non-null    float64
 2   Low        504 non-null    float64
 3   Close      504 non-null    float64
 4   Adj Close  504 non-null    float64
 5   Volume     504 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 27.6 KB


## Selecting Features

I selected five features - Open, High, Low, Adj Close, and Volume. 

Open - This is the value that the stock opens with. It is an important measure because with the adjusted close value, it can be used to mark the change in the stock in 24/h. 

High - A value that represents when the stock peaked during the day. This can be an important signifier if there are correlations between time and peak price. 

Low - The feature signifies when the value of the stock was at its lowest during the day. It can be used to measure the discrepencies in value change. 

Adj Close - Adjusted close is the value of the stock when the markets finish with an added adjustment for applicable adjustments. I found this to be the more accurate value when compared to just "Close". 

Volume - This represents how many shares were traded during the day. It is often used to signify the "popularity" of the company as well as its overall health. 

In [3]:
# Establishing features
features = aapl_data[["Open", "High", "Low", "Adj Close", "Volume"]]
features.head()

Unnamed: 0_level_0,Open,High,Low,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-03-24,122.82,122.900002,120.07,118.661629,88530500
2021-03-25,119.540001,121.660004,119.0,119.155693,98844700
2021-03-26,120.349998,121.480003,118.919998,119.768333,94071200
2021-03-29,121.650002,122.580002,120.730003,119.94619,80819200
2021-03-30,120.110001,120.400002,118.860001,118.473907,85671900


In [4]:
# Transform all points to floats 
def clean(s):
    converted = float(s)
    return converted

# Drop the null values and then independently add each to the list.
features = features.dropna()
print (features["Open"].tolist())

features["Open"] = features["Open"].apply(clean)
features["High"] = features["High"].apply(clean)
features["Low"] = features["Low"].apply(clean)
features["Adj Close"] = features["Adj Close"].apply(clean)
features["Volume"] = features["Volume"].apply(clean)

features.head()

[122.81999969482422, 119.54000091552734, 120.3499984741211, 121.6500015258789, 120.11000061035156, 121.6500015258789, 123.66000366210938, 123.87000274658203, 126.5, 125.83000183105469, 128.9499969482422, 129.8000030517578, 132.52000427246094, 132.44000244140625, 134.94000244140625, 133.82000732421875, 134.3000030517578, 133.50999450683594, 135.02000427246094, 132.36000061035156, 133.0399932861328, 132.16000366210938, 134.8300018310547, 135.00999450683594, 134.30999755859375, 136.47000122070312, 131.77999877929688, 132.0399932861328, 131.19000244140625, 129.1999969482422, 127.88999938964844, 130.85000610351562, 129.41000366210938, 123.5, 123.4000015258789, 124.58000183105469, 126.25, 126.81999969482422, 126.55999755859375, 123.16000366210938, 125.2300033569336, 127.81999969482422, 126.01000213623047, 127.81999969482422, 126.95999908447266, 126.44000244140625, 125.56999969482422, 125.08000183105469, 124.27999877929688, 124.68000030517578, 124.06999969482422, 126.16999816894531, 126.59999

Unnamed: 0_level_0,Open,High,Low,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-03-24,122.82,122.900002,120.07,118.661629,88530500.0
2021-03-25,119.540001,121.660004,119.0,119.155693,98844700.0
2021-03-26,120.349998,121.480003,118.919998,119.768333,94071200.0
2021-03-29,121.650002,122.580002,120.730003,119.94619,80819200.0
2021-03-30,120.110001,120.400002,118.860001,118.473907,85671900.0


## Scale Feaures

Scaling features is essential in unsupervised machine learning, especially for algorithms that calculate distances between data. If the data is not scaled than the feaures with higher values starts to heavily impact the calculated distances. Scaling leads to better results that are more interpretable and easier to understand. For this assignment I utilized the MinMax Scaler function because it preserves the original shape of the distribution and does not reduce the importance of outliers - which is what I am trying to find. 

In [5]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Scale Features
scaler = MinMaxScaler()
features_transformed = pd.DataFrame()
features_transformed[["Open", 
          "High",
          "Low",
          "Adj Close",          
          "Volume"]] = scaler.fit_transform(features[["Open", 
                                                      "High", 
                                                      "Low",
                                                      "Adj Close",
                                                      "Volume"]])
features_transformed.head()

Unnamed: 0,Open,High,Low,Adj Close,Volume
0,0.051989,0.039974,0.02008,0.003018,0.332849
1,0.0,0.020147,0.002323,0.010959,0.397217
2,0.012839,0.017269,0.000996,0.020807,0.367427
3,0.033444,0.034858,0.031032,0.023666,0.284724
4,0.009035,0.0,0.0,0.0,0.315009


## Implement Anomaly Detection

Because the data is unstructured the choosen algorithm is the KMeans function. This function is used to find groups which have not been explicitly labeled in the data. The number of clusters to form as well as the number of centroids to generate is set at 2. Whereas, the number of time the k-means algorithm is run with different centroid seeds is set to 1.

In [6]:
# Import neccesary functions
from sklearn.cluster import KMeans

# Initialize a K-means clustering object
kmeans_model = KMeans(n_clusters = 2, random_state = 0, n_init = 1)

# Transform Features
features_transformed_numpy = features_transformed.to_numpy()

kmeans_model.fit(features_transformed_numpy)

center = kmeans_model.cluster_centers_

Euclidean distance calculates the distance between two real-valued vectors, such as the ones present in the yahoo finance data set. It is often used when calculating the distance between rows of data contining numerical values. The entire data set is floating data points which is why this function was choosen. 

In [7]:
from scipy.spatial.distance import euclidean

distance_from_nearest_cluster = []

for row in features_transformed_numpy:
    dist_1 = euclidean(row,center[0])
    dist_2 = euclidean(row,center[1])
    min_dist = min([dist_1,dist_2])
    distance_from_nearest_cluster.append(min_dist)

features["distance_from_nearest_cluster"] = distance_from_nearest_cluster

features.head()

Unnamed: 0_level_0,Open,High,Low,Adj Close,Volume,distance_from_nearest_cluster
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-24,122.82,122.900002,120.07,118.661629,88530500.0,0.653855
2021-03-25,119.540001,121.660004,119.0,119.155693,98844700.0,0.700439
2021-03-26,120.349998,121.480003,118.919998,119.768333,94071200.0,0.687263
2021-03-29,121.650002,122.580002,120.730003,119.94619,80819200.0,0.647038
2021-03-30,120.110001,120.400002,118.860001,118.473907,85671900.0,0.704653


## Sort Anomalies 

In this step I detected the top five anaomalous points by sorting the values.  

In [8]:
features_sorted = features.sort_values(['distance_from_nearest_cluster'],ascending=[False])

features_sorted.head(5)

Unnamed: 0_level_0,Open,High,Low,Adj Close,Volume,distance_from_nearest_cluster
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-30,120.110001,120.400002,118.860001,118.473907,85671900.0,0.704653
2021-03-25,119.540001,121.660004,119.0,119.155693,98844700.0,0.700439
2021-03-26,120.349998,121.480003,118.919998,119.768333,94071200.0,0.687263
2021-03-31,121.650002,123.519997,121.150002,120.697144,118323800.0,0.671351
2021-12-17,169.929993,173.470001,169.690002,169.893066,195432700.0,0.668669


In [10]:
# Convert sorted features into a data frame
features_sorted = pd.DataFrame(features_sorted)
features_sorted

Unnamed: 0_level_0,Open,High,Low,Adj Close,Volume,distance_from_nearest_cluster
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-30,120.110001,120.400002,118.860001,118.473907,85671900.0,0.704653
2021-03-25,119.540001,121.660004,119.000000,119.155693,98844700.0,0.700439
2021-03-26,120.349998,121.480003,118.919998,119.768333,94071200.0,0.687263
2021-03-31,121.650002,123.519997,121.150002,120.697144,118323800.0,0.671351
2021-12-17,169.929993,173.470001,169.690002,169.893066,195432700.0,0.668669
...,...,...,...,...,...,...
2022-01-20,166.979996,169.679993,164.179993,163.311386,91420500.0,0.043928
2022-10-07,142.539993,143.100006,139.449997,139.644775,85925600.0,0.039997
2021-10-05,139.490005,142.240005,139.360001,139.877716,80861100.0,0.037259
2021-10-13,141.240005,141.399994,139.199997,139.679474,78762700.0,0.037226


## Discussion

The dates with the top five anomalies for apple stock - January 17, March 25, March 26, March 30, and March 31 - are set within the same time period. This is not unexpected as Winter/Spring 2021 was when the COVID pandemic was overtaking the American economy. January 2021 was when the initial troubles started but were not fully realized. This value is higher than average when compared to the other values at that time because of this. After the middle of March the stock market was essentially in upheavel because this is when the lockdowns started. Many Americans lost their jobs and global trade came to a halt. These factors, as well as others, caused Apple's stock to plummit. However, the decline was temporary as the stimulus checks Biden created during the lockdown and the increased demand for video games, TV's, computers, etc because everybody was trapped at home allowed Apple to increase it's profits. The anomalies corraligned with my hypothesis about apple's stock prices across the past two years. They were in the exact time range of early pandemic when fear and stress heavily impacted the financial markets. In the future, I would incorporate other factors such as ev/ebitda or CF that would allow for a more detailed analysis of the companies health. Stock prices are often temporary measures and do not reveal what causes them to dip or increase like other variables found on the Balance Sheet or Income Statment. Moreover, I would try different algorithms for clustering to see if it causes different results. This is a very interesting topic. In the future I plan on putting more time into messing with the exact formulas to find one that is even more accurate. 