### Data downloading and exploration

In [12]:
import yfinance as yf
data = yf.download("IBM", start="2021-03-24", end="2023-03-24")

[*********************100%***********************]  1 of 1 completed


In [13]:
data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-24,125.191208,126.300194,124.827919,124.875717,113.396706,4189230
2021-03-25,124.598473,127.380501,124.063095,127.217972,115.523651,5809484
2021-03-26,127.428299,130.478012,127.265778,130.382416,118.397217,5823710
2021-03-29,130.0,131.042068,129.550674,129.885284,117.945763,4835344
2021-03-30,129.885284,130.277252,128.12619,128.79541,116.956093,5010758


In [14]:
data.info

<bound method DataFrame.info of                   Open        High         Low       Close   Adj Close  \
Date                                                                     
2021-03-24  125.191208  126.300194  124.827919  124.875717  113.396706   
2021-03-25  124.598473  127.380501  124.063095  127.217972  115.523651   
2021-03-26  127.428299  130.478012  127.265778  130.382416  118.397217   
2021-03-29  130.000000  131.042068  129.550674  129.885284  117.945763   
2021-03-30  129.885284  130.277252  128.126190  128.795410  116.956093   
...                ...         ...         ...         ...         ...   
2023-03-17  124.080002  124.519997  122.930000  123.690002  123.690002   
2023-03-20  124.309998  126.160004  124.190002  125.940002  125.940002   
2023-03-21  126.900002  127.150002  125.660004  126.570000  126.570000   
2023-03-22  127.000000  127.220001  124.010002  124.050003  124.050003   
2023-03-23  123.809998  124.930000  122.599998  123.370003  123.370003   

     

In [15]:
data.dtypes

Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

#### Initial Insight:
The data seems fluctuating in a normal level since the data in each column seems stable during the period I randomly extract.

### Finding out company’s anomalous behavior 

Let's select "High", "Low", "Adj Close" and "Volume" as our features. Here I ignore "Open" and "Close" since they are just data recorded at two time points at a day, the highest and lowest data in a day should include the info. of "Open" and "Close" and make more sense.

In [16]:
features = data[["High","Low","Adj Close","Volume"]]
features.head()

Unnamed: 0_level_0,High,Low,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-03-24,126.300194,124.827919,113.396706,4189230
2021-03-25,127.380501,124.063095,115.523651,5809484
2021-03-26,130.478012,127.265778,118.397217,5823710
2021-03-29,131.042068,129.550674,117.945763,4835344
2021-03-30,130.277252,128.12619,116.956093,5010758


Let's use MinMaxScaler

In [17]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
features_transformed = pd.DataFrame()
features_transformed[["High","Low","Adj Close","Volume"]] = scaler.fit_transform(features[["High","Low","Adj Close","Volume"]])
features_transformed.head(10)

Unnamed: 0,High,Low,Adj Close,Volume
0,0.270144,0.290136,0.110651,0.061897
1,0.299444,0.268525,0.164167,0.107664
2,0.383456,0.359022,0.236469,0.108066
3,0.398754,0.423585,0.22511,0.080148
4,0.378011,0.383334,0.200209,0.085103
5,0.337561,0.347946,0.168318,0.089674
6,0.317335,0.33606,0.167662,0.063936
7,0.388901,0.366586,0.226639,0.105221
8,0.361675,0.385225,0.189287,0.050551
9,0.343524,0.376851,0.204796,0.031494


Let's implement anomaly Detection Algorithm

In [18]:
from sklearn.cluster import KMeans
from scipy.spatial import distance

# Initialize a k-means clustering object 
kmeans_model = KMeans(n_clusters=3, random_state=0, n_init=1)

# Fit the K-means algorithm with the features above.
kmeans_model.fit(features_transformed.to_numpy())


centers = kmeans_model.cluster_centers_
print (centers)

dist_data = []

for row in features_transformed.to_numpy():
    # Compute the euclidean distances between each row in features_transformed.to_numpy() and the cluster centroids. 
    dist1 = distance.euclidean(centers[0], row)
    dist2 = distance.euclidean(centers[1], row)
    dist3 = distance.euclidean(centers[2], row)
    
    # Select the minmum distance and store it in a variable
    min_dist = min([dist1,dist2,dist3])
    dist_data.append(min_dist)

features["distance_from_nearest_cluster"] = dist_data

[[0.24926017 0.24280325 0.23176888 0.10526709]
 [0.75035433 0.76394648 0.76289123 0.06621933]
 [0.52557059 0.53183814 0.4605111  0.08330827]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features["distance_from_nearest_cluster"] = dist_data


### Printing the anomalous data points and writing insights 

In [19]:
# we can sort the datafeame
result = features.sort_values(["distance_from_nearest_cluster"], ascending=[False])

result.head(20)

Unnamed: 0_level_0,High,Low,Adj Close,Volume,distance_from_nearest_cluster
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-03-17,124.519997,122.93,123.690002,37399800,0.90573
2021-10-21,127.839386,122.466537,113.950417,32913959,0.778231
2022-07-19,132.559998,127.720001,126.178757,29690500,0.722652
2022-01-25,137.339996,128.300003,128.095688,19715700,0.443827
2022-12-13,153.210007,149.949997,148.742966,8811500,0.436122
2021-11-26,116.339996,114.559998,108.998985,3322000,0.423564
2021-11-19,116.559998,115.269997,109.224869,5380200,0.400018
2022-04-20,139.559998,133.380005,131.753281,17859200,0.395514
2021-07-02,140.487579,133.326965,122.915436,17584515,0.395402
2023-01-26,138.270004,132.979996,132.818558,17548500,0.388446


#### Insights
explain your initial hypotheses, findings, whether the method conformed your hypothesis or not and what could be the future improvements.

The result shows my initial hypotheses is not accurate. From this dataframe, we can find t 2021 November is an anomalous month since 7 anomalous data points are detected in that month among top 20 anomalous data points of the who three years. 2023-03-17, 2021-10-21 and  2022-07-19 show extremly abnormal as their distance from the nearest cluster is larger than >= 0.7 while all other days' distance is <0.5, which shows these three days are obvious outliers.

The future improvement can be 
- Use better normalization/scale method since an outlier data in a column can cause the scaling to skew a lot. In this dataframe, the scaling of 'volumn' is not good enough as most data are too small, which causes the impact of 'volume' decreases.
- Use elbow method to decide the best k since here it's hard to evaluate whether the choice  of k is good or not.
- Increase n_init. Run more times of algorithm with more centroid seeds to decrease the possibility that the algorithm stuck in the local optim. 
- Observe more result data and do some data engineering to analyse them since 20 rows may not be enough for a good conclusion.