## Pairs trading on different portfolios based on Machine Learning

Statistical arbitrage is the trading strategy developed using quantitative and statistical models to generate trading signals. Pairs-trading is one of the statistical arbitrage techniques. Here, this teschnique will be investigated using ML model LSTM (Long Short-term Memory) 

### Different steps in this project

1. Selecting 20 different stocks
2. Categorizing stocks into different types (aggressive or defensive types) 
3. Identifying pairs with co-integration test 
4. Constructing Portfolio
5. Forecasting stock prices using ML algorithm, LSTM 
6. Calculating Trading Profits 

### Why LSTM?

Among different types of models, deep learning models contain neural networks. Neural networks are capable of providing decent approximation to almost all non-linear functions. Due to the non-linearity of time series data, it is better to use neural networks to predict stock prices, rather than using regular linear frameworks.

Merits of using Neural networks
* Neural networks are designed to detect non-linearities that exist within the data
* Neural networks are scalable with reduced computational load, because, training or updating doesnot require retraining the entire model from scratch when new data is added. 

### Neural Networks

Neural Networks composed of input layer, several hiddien layers and then output layer. Input layer holds the input values, hidden layers process these input values through non-linear functions and the final output is provided at the output layer. Artificial Neurons are the building block of these different layers.

## Importing Libraries

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline  
import yfinance as yf 

#### Selecting 20 different stocks

In [2]:
tickers = ['AMD','HPQ','CSCO','INTC','ORCL','MSFT','SNE','DIS','CEA','ERIC',
           'UN','WMT','PCG','JNJ','NEE','AEP','COKE','PEP','MCD','MRK']

t = ['AMD']
start_date = '2008-01-01'
end_date = '2018-01-01'  

In [3]:
data_frames = {}
for tick in t: 
    df = pd.DataFrame()
    df = yf.download(tick,start=start_date,end=end_date)
    if len(df)==0:
        continue 
    else:
        df.rename(columns={'Open':f'{tick}_Open','High':f'{tick}_High','Low':f'{tick}_Low','Close':f'{tick}_Close',
                       'Volume':f'{tick}_Volume','Adj Close':f'{tick}_Adj Close'},inplace=True) 
        data_frames[f'df_{tick}'] = df 

[*********************100%%**********************]  1 of 1 completed


In [4]:
len(data_frames) 

1

In [5]:
data_frames['df_AMD'].shape 

(2518, 6)

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() 

#### Scaling the values in each dataframe and storing the corresponding arrays and data frames in dictionary format 

In [7]:
scaled_data = {}
scaled_df = {} 
for tick,df in data_frames.items():
    scaled_data[f'{tick}'] = sc.fit_transform(df) 
    scaled_df[f'{tick}'] = pd.DataFrame(data=scaled_data[f'{tick}'], columns=df.columns) 

In [8]:
scaled_df['df_AMD']  

Unnamed: 0,AMD_Open,AMD_High,AMD_Low,AMD_Close,AMD_Adj Close,AMD_Volume
0,0.542338,0.495612,0.472134,0.462193,0.462193,0.794383
1,0.462565,0.426578,0.384237,0.343988,0.343988,0.314601
2,0.299829,0.257132,0.172634,0.177861,0.177861,1.025084
3,0.213674,0.184961,0.146590,0.123550,0.123550,0.200683
4,0.153047,0.175547,0.130313,0.097992,0.097992,0.420358
...,...,...,...,...,...,...
2513,1.611292,1.553080,1.507362,1.548409,1.548409,0.993926
2514,1.493228,1.493460,1.552939,1.522851,1.522851,-0.296285
2515,1.515565,1.543666,1.572471,1.545214,1.545214,-0.190540
2516,1.553856,1.512287,1.582237,1.551604,1.551604,-0.374128


#### Finding the standard deviation of each column in each dataframe and storing it in a dictionary

In [9]:
standard_dev = {}
for tick, df in scaled_df.items():
    standard_dev[f'{tick}'] = df.std() 

In [10]:
standard_dev['df_AMD']

AMD_Open         1.000199
AMD_High         1.000199
AMD_Low          1.000199
AMD_Close        1.000199
AMD_Adj Close    1.000199
AMD_Volume       1.000199
dtype: float64

### Calculating $\beta$-coefficient

Beta measures the volatility, or systematic risk, of a stock in relation to the overall market. So, here we have to take S&P500 as benchmark.

In [12]:
df_bench = yf.download('^GSPC',start=start_date,end=end_date)['Adj Close'] 

[*********************100%%**********************]  1 of 1 completed


In [14]:
stock_data = data_frames['df_AMD']['AMD_Adj Close']

In [16]:
# Calculating Daily returns 
stock_returns = stock_data.pct_change().dropna()
benchmark_returns = df_bench.pct_change().dropna() 

In [17]:
# Calculate covariance and variance
covariance = stock_returns.cov(benchmark_returns)
vaiance = benchmark_returns.var() 

In [18]:
beta = covariance/vaiance

In [19]:
beta 

1.4296052820603145