In [None]:
import sys

!{sys.executable} -m pip install yfinance --upgrade --no-cache-dir

# Applying ML : Clustering stock market data

**Goal**: Learn about approximate nearest neighbor identification in high-dimensional spaces via:

1. Clustering times series based on its shape using [K-Shape: Time Series Clustering](https://aws.amazon.com/marketplace/pp/Spotad-LTD-K-Shape-Time-Series-Clustering/prodview-bjbovimwn5ajs). 
2. Clustering high-dimensional data using Amazon SageMaker built-in [K-Means Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html)

#### Task 1 description:
In this task, you will learn how to perform clustering on time series data and identify stocks that are performing identical to each other over a given time-span. You will download the stock market data at runtime, normalize values for each stock, and then identify clusters of stocks with identical shape. You will then share findings about which stocks seem to have identical behaviors. You will also report which value for `k` returned you the minimum SSD (Sum of the squared distances between each data point and the cluster centroid). 

To help you ensure you have sufficient time for experimentation in Task 2, some starter code for task 1 has been provided in this notebook. 


#### *References:*

* https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/
* Accelerating ML projects with algorithms and models from AWS Marketplace (https://youtu.be/OrmHHVI1uPk?t=1682)
* Interesting graphs -https://github.com/awslabs/amazon-sagemaker-examples/blob/master/aws_marketplace/using_model_packages/financial_transaction_processing/Extracting_insights_from_your_credit_card_statement.ipynb

#### Task 2 description:
In this task, you will learn how to identify approximate nearest neighbors in high-dimensional space by applying a clustering algorithm. As part of this task, you will first generate high-dimensional synthetic datasets containing trading portfolio tickers. You will then apply K-Means clustering algorithm and clusters of traders that have identical portfolios. 

**Notes**:

* To make this a fun project, add tickers you have special interest in, to the list.
* Extra time left? 
    Explore other algorithms you can use to solve problems identified in Task 1 and 2 and compare the results using appropriate metrics.


#### *References:*

* https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/
* [How K-Means algorithm works](https://docs.aws.amazon.com/sagemaker/latest/dg/algo-kmeans-tech-notes.html)

In [None]:
#For this experiment, you may use following tickers.
tickers = ['FB','AAPL','MSFT','GOOGL','GOOG','JNJ','V','PG','JPM','UNH','HD','MA','INTC','NVDA','VZ','NFLX','ADBE','DIS','T','PYPL','PFE','MRK','CSCO','CMCSA','WMT','PEP','BAC','XOM','KO','CRM','ABBV','ABT','CVX','TMO','AMGN','COST','MCD','ACN','LLY','BMY','NEE','MDT','AVGO','LIN','TXN','DHR','UNP','NKE','AMT','ORCL','PM','IBM','LOW','HON','QCOM','C','GILD','BA','WFC','RTX','LMT','MMM','BLK','SBUX','FIS','SPGI','NOW','CHTR','CVS','UPS','VRTX','BDX','INTU','ISRG','MDLZ','MO','CAT','CCI','BKNG','PLD','ZTS','AMD','REGN','GS','ANTM','D','CI','EQIX','APD','ADP','CL','ATVI','MS','AXP','TJX','SYK','CB','TMUS','TGT']

In [None]:
import os
import json
import boto3
import sagemaker
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import boto3
import sagemaker as sage
import yfinance as yf
import botocore
from sklearn import preprocessing
from uuid import uuid4
from collections import namedtuple
from functools import partial
from scipy.stats import zscore
from sagemaker import AlgorithmEstimator
from matplotlib.pyplot import figure
import warnings
import matplotlib.dates as mdates
from sagemaker.predictor import csv_serializer
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")
%matplotlib inline

#visualization variables
palette=sns.color_palette("RdBu", n_colors=7)

In [None]:
#Common variable declaration
region_name = boto3.Session().region_name
bucket=sage.Session().default_bucket()
role = sage.get_execution_role()
sagemaker_session = sage.Session()

Dont worry about the following warning. Note that cell has executed successfully.

`Couldn't call 'get_role' to get Role ARN from role name Sagemaker_Studio_Role to get Role path.`

## Task 1:
In this project, you will find stocks that have identical shape. This task has been divided into following three steps:

##### Step 1:
* Download stock market data for 95 days and write data to a file in a format accepted by the algorithm. Algorithm requires you to prepare a CSV file containing normalized time series data where each row contains a time-series for a stock.

##### Step 2:
* Perform shape based time series clustering and identify clusters of stocks that are performing identically - Remember, magnitude does not matter but shape does!

##### Step 3:
Experiment and report findings.



For this task, you will use [K-Shape Time Series Clustering algorithm](https://aws.amazon.com/marketplace/pp/Spotad-LTD-K-Shape-Time-Series-Clustering/prodview-bjbovimwn5ajs)  from [AWS Marketplace](https://aws.amazon.com/marketplace/search/results?page=1&filters=fulfillment_options&fulfillment_options=SAGEMAKER&ref_=header_nav_dm_sagemaker). The K-Shape Time Series Clustering algorithm is based on [this research paper](
http://web2.cs.columbia.edu/~gravano/Papers/2015/sigmod2015.pdf)

In [None]:
#Configure dates for which you would like to download the data
start_date = '2020-02-03'
end_date = '2020-06-18'
common_prefix = "k-shape-clustering"


In [None]:
#lets download the stock data for all specified tickers.
data = yf.download(' '.join(tickers), start=start_date, end=end_date, group_by="ticker")

In [None]:
data.head()

In [None]:
#Extract all dates for which stock prices are available into a column.
dates=data[tickers[0]]['Close'].index

In [None]:
len(dates)
##df[dates].values

For this experimentation, we will only use closing price.

In [None]:
close_data=[]

for ticker in tickers:
    ticker_data=[ticker]
    ticker_data.extend(data[ticker]['Close'].values)
    close_data.append(ticker_data)

#print('Closing price data set for ',len(close_data),' tickers')
#print(close_data[0])

You can see that `close_data` contains ticker and stock price time-series. Let us insert this data into a dataframe.

In [None]:
columns=['Ticker']
columns.extend(dates)

df=pd.DataFrame(data=close_data,columns=columns)

In [None]:
df.head()

Data looks great! Now, we will normalize the data by row and save it to a file.

In [None]:
x = df[dates].values
len(x)

In [None]:
x[0]

In [None]:
#To normalize the data by row instead of column, we will transpose it first, transform by applying MinMaxScaler, 
# and then will transpose it back to coonvert it back to columnar format.
minmax_scale = preprocessing.MinMaxScaler(feature_range=(-1, 1)).fit(x.T)
x_scaled=minmax_scale.transform(x.T).T

In [None]:
x_scaled[0]

In [None]:
file_name='train.csv'

#Lets write scaled column values to a dataframe, insert `Ticker` column, and the save it to a file to later feed it to an algorithm as part of the training job.
df = pd.DataFrame(x_scaled)
df.insert(0,'Ticker',tickers)
df.to_csv(file_name,header=False,index=False)

In [None]:
df.head()

In [None]:
#Next, we will upload it to Amazon S3 so that we can specify the same as part of the training job in Step 2.
train_file = sagemaker_session.upload_data(file_name, bucket, common_prefix)

#### Step 2: Train an ML model

Third party algorithms from AWS Marketplace work with Amazon SageMaker and require a subscription. To subscribe:

1. Open the algorithm [AWS Marketplace listing page](https://aws.amazon.com/marketplace/pp/Spotad-LTD-K-Shape-Time-Series-Clustering/prodview-bjbovimwn5ajs)
1. Click on **Continue to subscribe** button.
1. If you are trying this notebook as part of a workshop conducted by AWS, a subscription has been created for you and **Continue to configuration** button is active. However, If your trying this notebook in your own AWS account, On the ***Subscribe to this software*** page, **"Accept Offer"** button needs to be clicked if you agree with EULA, pricing, and support terms.
1. Click on **Continue to configuration** button and then choose a **region** corresponding to the AWS Region in which you launched notebook,
1. you will see a **Product Arn**. Copy the ARN and specify the same in the following cell.

In [None]:
algo_arn='<Customer to specify algorithm ARN corresponding to their AWS region after subscription>'

#algo_arn='arn:aws:sagemaker:us-east-1:865070037744:algorithm/k-shape-cd639040558775d27d890f1479c92d7b'

In [None]:
#Review hyperparameters (k=11 for 11 clusters, label-size=1 since we have first column in the data as the ticker)
#Review instance-type, and train an ML model.

algo = AlgorithmEstimator(algorithm_arn=algo_arn, 
                          role=role, 
                          train_instance_count=1, 
                          train_instance_type='ml.m5.4xlarge', 
                          sagemaker_session=sagemaker_session, 
                          base_job_name=common_prefix,
                          hyperparameters={"k": "11", "label_size": "1"}) 

algo.fit({'train': train_file}) 


This algorithm allows us to download and inspect the ML model generated which contains information about centroids.  Cluster centroids are means of the variables in the cluster. In this case, it is the cluster center time series for  the time series observations found in the cluster.

To find a cluster to which a point belongs, the algorithm finds the distance of that time-sries from all of the cluster centers. It then chooses the cluster with the closest center as the cluster to which the observation belongs.

Lets download the model and plot cluster centroids.

In [None]:
s3 = boto3.resource('s3')

try:
    s3.Bucket(bucket).download_file('{}/output/model.tar.gz'.format(algo._current_job_name), 'model.tar.gz')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

In [None]:
!mkdir -p model
!tar -zxvf model.tar.gz -C model

In [None]:
split_by_comma = lambda s: str.split(s, ',')

centroids = list()
with open('model/centroids', 'r') as f:
    for index,record in enumerate(map(split_by_comma, list(map(str.strip, f)))):
        centroid=np.array(record).astype(float)
        centroids.append(centroid)
len(centroids)

Lets plot centroid lines for clusters identified. 

Note that centroids are Z-normalized and their range does not match with original stock range. Add code in following cell to create a line chart containing  centroids.

In [None]:
figure(num=None, figsize=(14, 6), dpi=150, facecolor='w', edgecolor='k')

#Display only month and day
formatter = mdates.DateFormatter("%m-%d")
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)


for index,centroid in enumerate(centroids):
    plt.plot( dates, centroid, linewidth=1, label='Centroid '+str(index))
plt.legend()


Next, deploy the ML model and peform an inference.

In [None]:
%%time
predictor = algo.deploy(1, 'ml.m5.4xlarge', serializer=csv_serializer)

In [None]:
single_result=df.head(1).values[0]
result=predictor.predict(np.array(single_result[1:])).decode('utf-8')
result

<font color='red'> Task for workshop attendees: Perform inference on entire training dataset and identify cluster-id for each row. Plot each cluster separately in form of a line chart.  </font>

<font color='red'> Report your findings in the next cell. </font>

In [None]:
#predictor.delete_endpoint()

Congratulations! You have successfully performed K-shape based time series clustering. 

#### Experiment Summary

<font color='red'>Task: Next, we recommend that each member in the group to replicate the working notebook and perform one or more experiments for different values of K from k=2 to k=20, (Step 2: Train an ML model onwards) and report `Sum Square Distance` in following section.
    
For this experiment, do not use Automatic model tuning. The goal of this exercise is to ensure that every team member understands the experimentation process for the problem at hand so that your team can solve task 2 more efficiently.

For experimentation, you may choose another set of tickers/date-range. But rememeber, you must provide:
1. At-least 50 tickers
2. Atleast 3 month date range.
</font>

##### Sample Experiment summary:
<font color='red'>Tickers =[]

Date range=[]


| K      | Sum Square Distance |
| ----------- | ----------- |
| Header      | Title       |
| Paragraph   | Text        |




Can you answer following questions:
* What value of "K" gave you the best results?
* Do all tickers in the same sector have identical shape?
* Note an interesting trend you discovered from graphs
</font>

Once each member has finished the task, work on Task 2 together as a team.


### Task 2 Description 
Despite stock markets being volatile, a large number of people have invested in stocks.  Each of us likes to think that we have a unique stock portfolio. While quantity and purchase date may vary, it is highly unlikely that the collection of tickers in your portfolio is unique. 

You first task is simple, you need to generate synthetic portfolios for 100,000 traders with each trader having stocks of at-least 3 companies and at-most 10 companies. 

##### Step 1:
For given tickers collection(a subset from SPDR SP 500 ETF: (SPY)), create ticker portfolios for 20,000 traders. Give a unique id to each trader. 

##### Step 2:
Perform K-Means clustering on the portfolio tickers by running KMeans clustering algorithm, see [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html).
        
        K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.
    Amazon SageMaker has a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time.For more information about KMeans clustering algorithm, see [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html)


**Goal**: Develop a function that accepts a trader-id at run-time and identify other traders that have a portfolio identical (at least 90% match - quantity does not matter) to the chosen trader.

##### Step 3:
Perform experiementation with different values for "K" and summarize your findings.

**Pro Tip**: To avoid delays, start development with a small dataset and then run your experiment on large data configurations.

### Step 1: Generate portfolios


`Proposed pandas dataframe columns`: ['TRADER_ID','Ticker1','Ticker2','Ticker3','Ticker4'...'TickerN']

In [None]:
num_traders=20000
min_stocks_in_portfolio=3
max_stocks_in_portfolio=10

In [None]:
df = pd.DataFrame(columns=tickers)

In [None]:
from random import randrange

#Each trader's portfolio must contain 3 companies and at-most 10 companies.
portfolios=[]
for trader_number in range(num_traders):
    total_stocks= randrange(min_stocks_in_portfolio, max_stocks_in_portfolio)
    for num_stock in range(total_stocks):
        df.loc[trader_number, tickers[randrange(len(tickers))]] = 1

**Pro Tip**: If its taking a lot of time for you to run this then check if you can run this code on higher infrastructure configuration, choose a larger instance type.

### Step 2:
<font color='red'>
In this section, you need to write code required to train an ML model for clustering different data points in the portfolios generated.</font>

### Step 3:
<font color='red'>
In this section, write code required to select a trader and then find others who have tickers in their portfolios identical to the chosen trader's portfolio </font>

### Step 4: Identify optimal value for K and report metrics (Optional)

Read the blog-post:
https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/ 
    
Perform an experimentation identical to the blog-post, plot an elbow graph, and share your results.