# End-to-end Quantitative Trading 

## Employing Web Scrapping, AWS, and ML to Find Good Trades

The idea of quantitative trading is the use of algorithms to make more informed and less passionate decisions regarding trades. There are many different methodologies which can be employed for quantitative trading. However, in this post I'd like to focus on the idea of building an end to end pipeline which will attempt to produce future predictions on price changes. 

Often times, a novice data scientist will be concerned with the process of building and optimizing a model, as well as producing buy-in to their ideas through the use of a convincing narrative and powerful visualizations. These steps are critical, but are not exhaustive. A full, simple, end-to-end framework will be concerned with these steps, but also with obtaining data, finding the right platform for the analysis to run, and performing all these steps in a reproducible way (in case a more senior data scientist would like to reproduce them).

This set of posts will break down the workflow into three major parts:

1. Obtain the data, and prepare it for modeling
2. Run a set of models on Amazon Web Service (AWS) and collect the best one
3. Prepare a simple summary of all observations during step 2.

## Step 1. Obtain the data

Since we are interested in historic data for stocks there are several available options. Quandl is a service which can provide daily stock data (in minute to minute intervals), but it costs money. For this post I will opt out for the free option, and employ Yahoo Finance.

Navigating to the Yahoo Finance website you will be greeted with the following page, and a download button within it. The link from the download link will look like this<br>
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1420095600&period2=1577862000&interval=1d&events=history&crumb=X68TFb4PiMR

There are four important parameters embedded in this link: 

* `AAPL` - the stock we are interested in, in this case it is AAPL (Apple Inc)
* `period1=1420095600` - a timestamp of the start period (this timestamp is actually Jan 1st, 2015, 7am)
* `period2=1577862000` - another timestamp for the end period (this one is Jan 1st, 2020, 7am)
* `crumb=X68TFb4PiMR` - this is an authentication token which makes the link valid, no request will work without it

The authentication token is quite important since that token will allow us to to query remotely. If we run the following bash code. Let's assume for now we have some way of obtaining that code (it expires after some time). We need to query the address and get back the response.

If we use Python's requests library, we can run the following code:

In [26]:
import requests
from io import StringIO
import pandas as pd


STOCK_ID = "AAPL"
start_time = "1420095600"
end_time = "1577862000"
auth_token = "X68TFb4PiMR"

In [27]:
r = requests.post(f'https://query1.finance.yahoo.com/v7/finance/download/{STOCK_ID}\
?period1={start_time}\
&period2={end_time}&interval=1d&events=history\
&crumb={auth_token}')

In [34]:
r.status_code

200

In [35]:
if (r.status_code == 200):
    # the response content is a bytes type so we need to turn it into a string
    csv_content = r.content.decode("utf-8")
    the_data = StringIO(csv_content)
    stock_df = pd.read_csv(the_data)
    display(stock_df.head()) # this only works in a Jupyter notebook
else:
    print("Request failed")

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2015-01-02,111.389999,111.440002,107.349998,109.330002,100.4543,53204600
1,2015-01-05,108.290001,108.650002,105.410004,106.25,97.624336,64285500
2,2015-01-06,106.540001,107.43,104.629997,106.260002,97.633545,65797100
3,2015-01-07,107.199997,108.199997,106.699997,107.75,99.002556,40105900
4,2015-01-08,109.230003,112.150002,108.699997,111.889999,102.80648,59364500


While this code gets the job done, it's so far very specific to a unique use scenario. It only gets the data for AAPL and only does so within a certain date range (and only works for a limited time given the authentication code).

We should probably wrap this into a function so we can make it slightly more versatile

In [68]:
import pandas as pd
import requests
from io import StringIO

def get_stock_data(STOCK_ID, start_time, end_time, auth_code):
    request_string = f'https://query1.finance.yahoo.com/v7/finance/download/{STOCK_ID}?period1={start_time}&period2={end_time}&interval=1d&events=history&crumb={auth_token}'
    r = requests.post(request_string)
    
    if (r.status_code == 200):
        # the response content is a bytes type so we need to turn it into a string
        csv_content = r.content.decode("utf-8")
        the_data = StringIO(csv_content)
        stock_df = pd.read_csv(the_data)
        return stock_df
    else:
        return None

Let's start with the easy part, the start and end times. We can get today's date quite easily, and then find the date five years back. We can then easily convert both into epoch time.

In [69]:
from datetime import datetime
from dateutil.relativedelta import relativedelta


today = datetime.now() #.replace(hour=7, minute=0, second=0)
five_yrs_ago = datetime.now() - relativedelta(years=5)

today_timestamp = round(datetime.timestamp(today))
five_yrs_ago_timestamp = round(datetime.timestamp(three_yrs_ago))


print(today_timestamp)
print(five_yrs_ago_timestamp)

1577914905
1420147864


In [71]:
my_df = get_stock_data("AAPL", five_yrs_ago_timestamp, today_timestamp, auth_token)
type(my_df)

pandas.core.frame.DataFrame

Now the next part, obtaining the authentication token. The token appears as part of the link on the webpage of Yahoo Finance for a given stock. This means I can navigate to any given stock page and grab an authentication code from the HTML. Let's select an asset we're quite certain will always be available, perhaps the S&P500.

Searching for that asset on the Yahoo Finance page we see the URL is the following:
`https://ca.finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC`

In [72]:
yahoo_request = requests.get("https://ca.finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC")

In [75]:
yahoo_request.content.decode("utf-8")



`"CrumbStore":{"crumb":"Z4FrXH54dXl"}`

We get a response and after searching through, using the technological wonder that is the find feature in text editors, we find our crumb as `"CrumbStore":{"crumb":"Z4FrXH54dXl"}`. We can check the new crumb works

In [76]:
new_auth_token = "Z4FrXH54dXl"
my_df = get_stock_data("AAPL", five_yrs_ago_timestamp, today_timestamp, new_auth_token)
type(my_df)

pandas.core.frame.DataFrame

And indeed it works, now we can use the regex magic to find our crumb in the response

In [77]:
response_string = yahoo_request.content.decode("utf-8")

In [78]:
import re

In [87]:
all_crumbs = re.findall('"CrumbStore":{"crumb":"\w+"',response_string)
if (len(all_crumbs) > 0):
    my_crumb_string = all_crumbs[0].replace("\"CrumbStore\":{\"crumb\":", "").replace("\"", "")
print(my_crumb_string)

Z4FrXH54dXl


We can probably pack that into a function too:

In [88]:
import re
def get_crumb_token(asset="%5EGSPC"):
    yahoo_request = requests.get(f'https://ca.finance.yahoo.com/quote/{asset}/history?p={asset}')
    response_string = yahoo_request.content.decode("utf-8")
    
    # Search for our crumb store
    all_crumbs = re.findall('"CrumbStore":{"crumb":"\w+"',response_string)
    if (len(all_crumbs) > 0):
        my_crumb_string = all_crumbs[0].replace("\"CrumbStore\":{\"crumb\":", "").replace("\"", "")
    
        return my_crumb_string
    else:
        return None

And we have our basic workflow

## Putting it all together

In [98]:
import re
import pandas as pd
import requests
from io import StringIO
from datetime import datetime
from dateutil.relativedelta import relativedelta


def get_crumb_token(asset="%5EGSPC"):
    yahoo_request = requests.get(f'https://ca.finance.yahoo.com/quote/{asset}/history?p={asset}')
    response_string = yahoo_request.content.decode("utf-8")
    
    # Search for our crumb store
    all_crumbs = re.findall('"CrumbStore":{"crumb":"\w+"',response_string)
    if (len(all_crumbs) > 0):
        my_crumb_string = all_crumbs[0].replace("\"CrumbStore\":{\"crumb\":", "").replace("\"", "")
    
        return my_crumb_string
    else:
        return None


def get_stock_data(STOCK_ID, start_time, end_time, auth_code):
    request_string = f'https://query1.finance.yahoo.com/v7/finance/download/{STOCK_ID}?period1={start_time}&period2={end_time}&interval=1d&events=history&crumb={auth_token}'
    r = requests.post(request_string)
    
    if (r.status_code == 200):
        # the response content is a bytes type so we need to turn it into a string
        csv_content = r.content.decode("utf-8")
        the_data = StringIO(csv_content)
        stock_df = pd.read_csv(the_data)
        return stock_df
    else:
        return None
    


def get_start_and_end_dates(difference=5):
    end_date = datetime.now()
    start_date = datetime.now() - relativedelta(years=difference)

    end_date = round(datetime.timestamp(end_date))
    start_date = round(datetime.timestamp(start_date))
    
    return start_date, end_date

start_date, end_date = get_start_and_end_dates()

auth_token = get_crumb_token()

STOCK_ID = "AAPL"

stock_df = get_stock_data(STOCK_ID, start_date, end_date, auth_token)


In [99]:
display(stock_df.head())

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2015-01-02,111.389999,111.440002,107.349998,109.330002,100.4543,53204600
1,2015-01-05,108.290001,108.650002,105.410004,106.25,97.624336,64285500
2,2015-01-06,106.540001,107.43,104.629997,106.260002,97.633545,65797100
3,2015-01-07,107.199997,108.199997,106.699997,107.75,99.002556,40105900
4,2015-01-08,109.230003,112.150002,108.699997,111.889999,102.80648,59364500


The last piece for this portion of the code is to put all the code we've written into a script which accepts our stock id as a command line argument, and also saves the dataframe into a path also provided by a command line.

In [97]:
import re
import os
import pandas as pd
import requests
from io import StringIO
from datetime import datetime
from dateutil.relativedelta import relativedelta
import argparse
import time

import logging

##################################################################
def attempt_getting_token(asset, attempt):
    '''
    A single attempt at getting a crumb authentication token.
    Sends a request to Yahoo finance and uses regex to find
    an instance of an authentication token.

    Parameters:
    ------------
    asset (str) -   the asset (stock code) whose page is visited on 
                    Yahoo Finance
    attempt (int) - the attempt number, how many times was this
                    action attempted

    Returns:
    -----------
    an authentication token (string) if successful, None otherwise
    '''
    logging.info(f'Attempt {attempt+1}: Getting token from Yahoo Finance')

    # Get the request and response
    yahoo_request = requests.get(f'https://ca.finance.yahoo.com/quote/{asset}/history?p={asset}')
    response_string = yahoo_request.content.decode("utf-8")

    # Search for our token (crumb store) with regex
    all_tokens = re.findall(r'"CrumbStore":{"crumb":"\w+"',response_string)
    if (len(all_tokens) > 0):
        # Replace everything that isn't the crumb with empty strings
        my_token_string = all_tokens[0].replace("\"CrumbStore\":{\"crumb\":", "").replace("\"", "")
    
        logging.info(f'Token found {my_token_string}')
        return my_token_string
    else:
        return None
##################################################################
def get_crumb_token(asset="%5EGSPC"):
    '''
    Attempts getting an authentication token 10 times using attempt_getting_crumb(asset, attempt)
    
    Parameters:
    ------------
    asset (string, default S&P500) -    the asset (stock code) whose page is visited on 
                                        Yahoo Finance

    Returns:
    -----------
    an authentication token (string) if successful, None otherwise
    '''
    logging.info('Attmepting to get token')
    my_token_string = None
    attempts = 0
    while (attempts < 10) and (my_token_string is None):
        
        # Try and get the authentication token
        my_token_string = attempt_getting_token(asset, attempts)

        attempts += 1

        # If unsuccessful, sleep so you don't bombard the website
        time.sleep(2)

    # Return the appropriate result and log the process
    if (my_token_string is None)
        logging.warning('Token not found')
        return None
    else:
        logging.info(f'Token found {my_token_string}')
        return my_token_string

##################################################################
def get_stock_data(STOCK_ID, start_time, end_time, auth_code):
    ''' 
    Gets the stock data for the request stock within the range
    specified by [start_time] and [end_time]

    Parameters:
    ------------
    STOCK_ID (string) - The asset for which data is requested
                        (e.g. AAPL, MSFT)
    start_time (int) -  The start time as an epoch time (e.g. 1420229214)
    end_time (int) -    The end time as an epoch time
    auth_code (string) -The authentication code used to make 
                        the request valid for Yahoo Finance

    Returns:
    -----------
    a pandas.core.DataFrame of the stock historic
    data if successful, None otherwise
    '''
    logging.info(f'Obtaining data for symbol {STOCK_ID}')

    # Grab the data
    request_string = f'https://query1.finance.yahoo.com/v7/finance/download/{STOCK_ID}?period1={start_time}&period2={end_time}&interval=1d&events=history&crumb={auth_token}'
    r = requests.post(request_string)
    
    # If the code is okay we can process the data
    if (r.status_code == 200):
        # the response content is a bytes type so we need to turn it into a string
        csv_content = r.content.decode("utf-8")

        # feed the data to pandas
        the_data = StringIO(csv_content)
        stock_df = pd.read_csv(the_data)

        logging.info('Stock data found')
        return stock_df
    else:
        logging.warning('Stock data not found')
        return None
    

##################################################################
def get_start_and_end_dates(difference=5):
    ''' 
    Calculates start and end dates for our requests.
    End date is the current day while start date 
    is [difference] years ago

    Parameters:
    ------------
    difference (int, default 5) -   The difference between start date 
                                    and end date in years.

    Returns:
    -----------
    The start and end dates, both as datime objects
    '''
    logging.info("Calculating start and end dates")
    end_date = datetime.now()
    start_date = datetime.now() - relativedelta(years=difference)

    # remove convert to epoch time and remove fractional component
    end_date = round(datetime.timestamp(end_date))
    start_date = round(datetime.timestamp(start_date))
    
    logging.info(f'Start date as as {start_date} and end date as {end_date}')
    return start_date, end_date

##################################################################
def parse_command_line_arguments():
    '''
    Parses the command line arguments that come in

    Parameters:
    ------------
    None

    Returns:
    -----------
    [args], a namespace collection which contains the 
    variables [.output_directory] and [.stock_id]
    
    [args.stock_id] is a string containing the stock symbol
    that is requested

    [args.output_directory] contains a path to a directory or
    an empty string (current directory) if not provided
    by the user
    '''
    logging.info("Parsing command line arguments")

    # Make a parser and parse the arguments
    parser = argparse.ArgumentParser(description='Obtain Data From Yahoo Finance.')
    parser.add_argument('stock_id', help='The symbol of the asset to retrieve')
    parser.add_argument('--output_directory', help='The output directory for the data in .csv format')
    args = parser.parse_args()

    # If our output directory is empty, make it an empty string.
    # Otherwise, append the appropriate seperator to it
    if not (args.output_directory):
        args.output_directory = ""
    else:
        args.output_directory += os.path.sep
    
    logging.info("Command line arguments parsed")
    return args
##################################################################
def output_stock_df_to_csv(stock_df, output_directory):
    '''
    Outputs the dataframe containing the stock data
    to the specified output directory

    Parameters:
    ------------
    stock_df (pandas.core.DataFrame) -  The dataframe storing the stock
                                        data
    output_directory (string) -         The output directory where to
                                        store the data

    Returns:
    -----------
    None
    '''
    logging.info('Outputting dataframe to csv file')
    if (stock_df is not None):        
        stock_df.to_csv(output_directory)
        logging.info(f'data written to {output_directory}')
    else:
        logging.warning('data not written')
##################################################################
if __name__ == "__main__":

    # Initialize a log file
    logging.basicConfig(filename='pipeline.log', filemode="w", level=logging.DEBUG)

    # Get our command line arumgnets
    args = parse_command_line_arguments()

    # Get the start and end dates
    start_date, end_date = get_start_and_end_dates()

    # Get authentication token
    auth_token = get_crumb_token()

    # If a toekn was obtained we can get the stock data and write it to file
    if (auth_token is not None):
        STOCK_ID = args.stock_id

        stock_df = get_stock_data(STOCK_ID, start_date, end_date, auth_token)
        
        output_directory = args.output_directory+STOCK_ID+".csv"
        output_stock_df_to_csv(stock_df, output_directory)

^C
