<a href="https://colab.research.google.com/github/ganeshbio/CDS-B1-G9/blob/main/M5_NB_MiniProject_4_Stock_Prices_Anomaly_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Stock Prices Anomaly Detection

## Learning Objectives

At the end of the experiment, you will be able to :

* apply PCA based analysis on various stocks data
* analyze and create time series data
* implement LSTM auto-encoders
* detect the anomalies based on the loss


## Information

Autoencoder Neural Networks try to learn data representation of its input. Usually, we want to learn an efficient encoding that uses fewer parameters/memory. The encoding should allow for output similar to the original input. In a sense, we’re forcing the model to learn the most important features of the data using as few parameters as possible.

LSTM autoencoder is an encoder that makes use of LSTM encoder-decoder architecture to compress data using an encoder and decode it to retain original structure using a decoder.

**Anomaly Detection**

Anomaly detection refers to the task of finding/identifying rare events/data points. Some applications include - bank fraud detection, tumor detection in medical imaging, and errors in written text.

A lot of supervised and unsupervised approaches for anomaly detection have been proposed. Some of the approaches include - One-class SVMs, Bayesian Networks, Cluster analysis, and Neural Networks.

We will use an LSTM Autoencoder Neural Network to detect/predict anomalies (sudden price changes) in the S&P 500 index.

## Dataset



This mini-project consists of two parts and two different stock price datasets:

### PART A

Using the **S&P 500 stock prices data of different companies**, we will perform a PCA based analysis. 

### PART B

Using the **S&P 500 stock price index time series data**, we will perform anomaly detection in the stock prices across the years. The dataset chosen is is S&P500 Daily Index a .csv format with one column with a daily timestamp and the second column with the raw, un-adjusted closing prices for each day. This long term, granular time series dataset allows researchers to have a good sized publicly available financial dataset to explore time series trends or use as part of a quantitative finance project.

## Problem Statement

Detect the stock price anomalies by implementing an LSTM autoencoder

## Grading = 10 Points

In [1]:
#@title Download dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/SPY.csv
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/prices.csv

### Import required packages

In [67]:
import keras
from keras.layers import Activation, Dense, Dropout, Flatten
from keras.layers import LSTM, RepeatVector, TimeDistributed
from keras.layers.normalization import BatchNormalization
from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA
from keras.models import Sequential, Model
import tensorflow as tf
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## PCA Analysis (PART-A)

Principal Component Analysis (PCA) decomposes the data into many vectors called principal components. These summaries are linear combinations of the input features that try to explain as much variance in the data as possible. By convention, these principal components are ordered by the amount of variance they can explain, with the first principal component explaining most of the data.

Perform PCA based analytics on the stock prices data from different companies.

Hint: Refer to the article [here](https://towardsdatascience.com/stock-market-analytics-with-pca-d1c2318e3f0e).


### Load and pre-process the prices data

In [98]:

prices_path = "prices.csv"
sp_perform_path="SPY.csv"
# YOUR CODE HERE
prices=pd.read_csv(prices_path)
sp_perform=pd.read_csv(sp_perform_path)
prices.head()

Unnamed: 0,A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,ADI,ADM,ADP,ADSK,AEE,AEP,AES,AFL,AIG,AIZ,AJG,AKAM,ALB,ALGN,ALK,ALL,ALLE,AMAT,AMCR,AMD,AME,AMGN,AMP,AMT,AMZN,ANET,ANSS,ANTM,AON,AOS,...,V,VFC,VIAC,VLO,VMC,VNO,VRSK,VRSN,VRTX,VTR,VTRS,VZ,WAB,WAT,WBA,WDC,WEC,WELL,WFC,WHR,WLTW,WM,WMB,WMT,WRB,WRK,WST,WU,WY,WYNN,XEL,XLNX,XOM,XRAY,XYL,YUM,ZBH,ZBRA,ZION,ZTS
0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,85.017082,28.982893,157.17189,74.207466,81.950539,82.938141,168.809998,84.622925,204.91095,334.429993,116.998917,44.048424,164.65007,187.830002,73.122215,88.918884,19.133558,51.241142,49.00322,128.423721,93.23159,87.639999,71.061058,283.679993,67.785583,108.96843,123.176743,61.148903,9.971735,49.099998,99.71846,230.298279,163.271606,222.032486,1898.01001,204.720001,259.589996,295.028076,205.982056,46.387165,...,189.381256,96.426582,40.090492,86.901367,140.557083,60.242638,149.663574,196.729996,219.449997,52.25222,20.504297,56.621544,80.132065,235.059998,55.568127,64.771545,86.955795,75.1642,51.397491,141.787598,200.223602,111.092049,20.954809,116.044952,67.74073,41.771225,151.029617,24.689568,28.591002,142.405029,59.934875,100.115349,63.633118,56.203197,78.396255,99.349007,147.579269,259.140015,49.903751,132.803421
2,83.652077,27.548195,157.181747,73.486023,81.172668,81.895088,166.820007,83.591301,204.569687,331.809998,114.939316,43.962448,164.302048,184.949997,73.151054,88.823746,18.913849,50.885765,48.624527,129.034988,93.036346,87.239998,70.043243,280.440002,66.542633,108.978081,121.654099,60.175625,9.840403,48.599998,99.856964,228.734818,161.235901,222.139359,1874.969971,200.850006,256.970001,291.11557,205.173096,45.979324,...,187.875092,95.82048,39.480896,83.60363,139.946487,60.711697,151.119431,200.880005,217.979996,52.882324,20.057474,56.0187,79.518326,231.009995,55.568127,63.774597,87.502457,76.486641,51.081936,139.37114,200.272614,112.171654,21.044018,115.020508,67.829132,41.030972,151.427734,24.604725,28.639641,140.292755,60.223114,97.810677,63.12154,55.581242,78.857689,99.037834,147.193512,256.049988,49.199547,132.823227
3,83.899353,27.21941,154.598541,74.071579,81.813271,83.094116,179.039993,84.029251,203.233841,333.709991,113.588921,43.618546,164.524399,187.119995,73.218369,89.11869,19.133558,50.741692,48.662395,129.151443,93.465851,87.550003,69.964951,285.880005,66.224434,109.296829,121.428162,58.877934,9.774739,48.389999,99.965782,230.490112,161.766525,222.08107,1902.880005,202.860001,254.589996,294.616272,205.76503,46.270645,...,187.468826,95.44529,40.06192,83.612755,139.700287,61.502666,151.466049,202.740005,224.029999,53.429035,20.722746,55.898129,79.191666,228.880005,56.04781,62.550629,87.56958,77.649673,50.77594,140.679672,200.517624,112.755219,21.32056,114.786362,67.730896,40.406082,151.507355,24.369047,28.581272,140.015091,60.136642,95.771927,63.606186,55.88728,78.347168,98.9795,146.342834,258.01001,48.60001,131.803482
4,84.156532,27.119778,152.764648,73.723213,81.34655,82.499466,180.350006,83.562103,198.846008,333.390015,116.173134,43.093136,162.532974,187.5,73.487579,89.137718,19.200424,50.261456,48.387844,128.278198,92.460426,90.199997,70.884911,283.059998,65.806808,108.359932,119.974274,60.578697,9.690312,48.25,100.539551,228.322357,159.817673,217.348892,1906.859985,204.850006,256.670013,293.723907,204.018814,45.959904,...,186.973373,94.973892,40.128597,84.708961,138.124527,60.757683,152.733719,203.210007,223.789993,52.984257,21.020628,55.276722,79.16198,231.979996,55.76564,66.785164,87.233925,77.143188,50.355194,140.727402,200.105988,112.833038,21.32056,113.722878,66.493172,39.954247,151.119171,24.482172,28.396437,140.679504,60.011738,97.958405,63.085632,56.183449,78.052628,99.154533,146.214264,256.470001,48.305,132.248978


In [99]:
print (prices.isna().sum(),"\n\n")
prices.shape


A       1
AAL     1
AAP     1
AAPL    1
ABBV    1
       ..
YUM     1
ZBH     1
ZBRA    1
ZION    1
ZTS     1
Length: 503, dtype: int64 




(394, 503)

In [100]:
prices= prices.drop(labels=0, axis=0)
prices.shape

(393, 503)

In [101]:
null_columns=prices.columns[prices.isnull().any()]
prices[null_columns].isnull().sum()


CARR     53
OGN     344
OTIS     53
dtype: int64

In [102]:
prices['CARR']=prices['CARR'].fillna(0)
prices['OGN']=prices['OGN'].fillna(0)
prices['OTIS']=prices['OTIS'].fillna(0)

prices.shape

(393, 503)

In [None]:
#By Considering Covind Time from 6th November 2019 add extract dates from S&P and create
#New Column
#insert_date=sp_perform.tail(394).Date
#insert_date=insert_date.reset_index(drop=True)
#prices_new=pd.DataFrame(prices)
#prices_new=prices_new.insert(0, 'Date', list_date)


In [109]:
from sklearn.preprocessing import StandardScaler
features = prices.columns
x = prices.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features
x.shape

(393, 503)

In [171]:
#Finding the number of Principal Components to use to preserve 90% of variance
def find_min_pca(X_train, percentage=0.8):
    
    from sklearn.decomposition import PCA
    
    
    inital_components=pd.DataFrame(X_train).shape[1]
    if inital_components<10:
        print('The Data has less than 10 components, no PCA needed')
    else:
        a=int(str(inital_components)[-1])
        tens=int((inital_components-a)/10)
        for i in range(1,tens+1):
        
            
            
            pca = PCA(n_components=(i)*10)
            principalComponents = pca.fit_transform(X_train)
            explained_var= np.sum(pca.explained_variance_ratio_)
        
            
            
            if explained_var>=percentage:
                upper=i
                break
        
        

        for j in range((upper-1)*10,(upper)*10):
            pca = PCA(n_components=j)
            principalComponents = pca.fit_transform(X_train)                
            explained_var1= np.sum(pca.explained_variance_ratio_)
            if explained_var1>percentage:
                comp=j
                break
                    
                        
            
            
    return  comp
min_comps=find_min_pca(x,0.99)
min_comps

31

In [172]:
pca = PCA(n_components = 31)
components=pca.fit_transform(x);


In [173]:
#Variance of the first 10 compoents
pca.explained_variance_ratio_

array([6.82559613e-01, 1.40548527e-01, 4.44721480e-02, 3.59782430e-02,
       2.05254105e-02, 1.23441231e-02, 9.34984878e-03, 6.92115146e-03,
       5.24667234e-03, 4.59993239e-03, 3.50290057e-03, 2.92957033e-03,
       2.79552721e-03, 2.31890802e-03, 1.87482427e-03, 1.61455085e-03,
       1.60806619e-03, 1.28221850e-03, 1.19963260e-03, 1.11635642e-03,
       1.02105633e-03, 8.88504807e-04, 8.06617370e-04, 7.56796810e-04,
       6.84893315e-04, 6.52358492e-04, 5.95131638e-04, 5.51217851e-04,
       5.04408170e-04, 4.58073397e-04, 4.16620496e-04])

### Apply PCA (3 points)

* plot the explained variance ratio. Hint: `pca.explained_variance_ratio_`
* Represent the components which preserve maximum information and plot to visualize
* Compute the daily returns of the 500 company stocks. Hint: See the following [reference](https://towardsdatascience.com/stock-market-analytics-with-pca-d1c2318e3f0e).
* Plot the stocks with most negative and least negative PCA weights in the pandemic period (Year 2020). Use reference as above. Discuss the least and most impacted industrial sectors in terms of stocks.

In [175]:
# YOUR CODE HERE
import plotly.express as px
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    labels={"x": "# Components", "y": "Explained Variance"}
)

#### Apply T-SNE and visualize with a graph

## Anomaly Detection (PART-B)

### Load and Preprocess the data

* Inspect the S&P 500 Index Data

In [None]:
path = 'SPY.csv'

In [None]:
# YOUR CODE HERE

#### Data Preprocessing

In [None]:
# YOUR CODE HERE

### Create time series data ( 1 point)

Select the variable (column) from the data and create the series of data with a window size.

Refer [LSTM Autoencoder](https://medium.com/swlh/time-series-anomaly-detection-with-lstm-autoencoders-7bac1305e713)

In [None]:
# YOUR CODE HERE

### Build an LSTM Autoencoder ( 2 points)

Autoencoder should take a sequence as input and outputs a sequence of the same shape.

Hint: [LSTM Autoencoder](https://medium.com/swlh/time-series-anomaly-detection-with-lstm-autoencoders-7bac1305e713)

In [None]:
# YOUR CODE HERE

### Train the Autoencoder (1 point)

* Compile and fit the model with required parameters

In [None]:
# YOUR CODE HERE

#### Plot metrics and evaluate the model (1 point)

In [None]:
# YOUR CODE HERE

### Detect Anomalies in the S&P 500 Index Data (2 points)

* Predict the data and calculate the loss
* Define threshold and detect the anomalies

Discuss the Impact of COVID19 pandemic on stock prices in terms of anomalies detected during the pandemic period in stock prices

In [None]:
# YOUR CODE HERE

### Report Analysis

* Discuss on the results of T-SNE and PCA
* Dicuss about the results of LSTM autoencoder