### Copyright 2020 Enea Dodi
https://www.apache.org/licenses/LICENSE-2.0

In [5]:
#@title Lets important everything and not worry for later :)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
from click._compat import iteritems
import requests
import time
import pickle
import os
import threading 
import collections
from pandas.tests.frame.test_sort_values_level_as_str import ascending
from dateutil import rrule
from datetime import datetime, timedelta

# Machine Learning Project 1: Stock Market Predictors:

###### By *Enea Dodi* Summer 2020

## Introduction
I have designed a 'course' to learn the fundamentals of Machine Learning:

1. Complete Google Machine Learning Crash Course 
2. Simple/fun ML project to have hands on experience
3. Andrew Trask's Grokking Deep Learning
4. More sophisticated ML project

   **option 1**  
   5. Pedro Domingo college course for Machine Learning (found on Youtube)  
   6. Finish all HW assignments of this course

   **option 2**  
   5. Aurélien Géron Hand's On Machine Learning edition 2  
   6. Complete plenty of exercises in this book.

This project is the second on the list: *A simple/fun ML project to have hands on experience*

The use of Artifical Intelligence for Stock Market Analysis/Prediction is a very big and competitive field. There are resources and websites which act as communities and API for developing your own algorithms and papers (such as [quandl](https://www.quandl.com/) and [quantopian](https://www.quantopian.com/)) however **my goal for this project is not to develop a industry standard algorithm, but rather have hands on experience with Machine Learning and the tools involved**

I will be using TensorFlow, Pandas, Numpy, and Matplotlib for the development of the algorithm, as well as BeautifulSoup to scrape valuable information from stock analysis websites (such as [finviz](https://finviz.com/)). 
**Big** thanks to Nicolas P. Rougier for [100 Numpy Exercises](https://github.com/rougier/numpy-100) and Alex Riley for [100 Pandas Exercises](https://github.com/ajcr/100-pandas-puzzles) which acted as tutorials for using Numpy and Pandas correctly for Machine Learning Projects.

## Preface
I would like to establish goals, framing, and overall training route before creation of the Machine Learning algorithms for this project. As a template, I will be using the *Deciding on Machine Learning* outline from Google's Introduction to ML Problem Framing.

**1) What should the ML model do?**
    - The overall goal of the ML models will be to develop a general classificiation, 'Going up next week' or 'Not going up next week' which works on any Stock Ticker belonging to different sectors, industries, price ranges, and volume 
    range.
**2) What is the ideal outcome?**
    - The ideal outcome is to generate a list of stocks which have a 'high probability' of going up in price the 
    following week so an individual can assess and determine which of the high probability stocks to invest in.
**3) How will I know if system is successful or a failure (Sucess / Failure Metrics)**
    - Measuring the metrics of success is very simple with stock market predictors. During training and testing, the model will be evaluated on the precision of its classifications for the upcoming week. I will thus know immeditaely, or the following week (as I may decide to evaluate performance on real predictions rather than past price points)
    - I will be focusing on precision rather than recall or accuracy. The outcome of investing X amount at 5 companies which all go up 10% is the same as investing X amoumt at 500 companies which all go up 10%. Thus a **low false positive rate** is more important than a **low false negative rate**.
    Of course this premise can be challenged. For example if we are investing X amount at 5 companies chosen through a precision omptimized metric, then incorrect labeling, no matter how small of a chance, will result in a larger chunk of cash lost per error compared to investing X amount at 500 companies chosen through a recall optimized metric. Thus I will not attempt to minmax for any strategy, but rather, optimize for precision so far as the batch of positive classifications remains in the low dozens (out of the 1500+ Tickers evaluated). In other words, I will attempt to optimize the precision such as: 
$$(FPR) * [\frac{X}{PCC}]  \leq  [\frac{X}{TPC}] $$

     Where FPR = False Positive Rate ; X = Total Invested Money ; PCC = Positive Clasified Cases ; TPC = True Postiive Cases
    
    - I will also be displaying the Confusion Matrices, ROC curves, and AUC curves on evaluations. 
    
    
    - Finally, stock prices are actually able to do three things: decrease in price, increase in price, or stay the same. The classifier on the other hand only classifies between 'Going up next week' or 'Not going up next week'. Thus I will decide in mostly a subjective matter how to divide up the three classes. 
        - 'Going up next week' or the positive label will be those stocks which demostrate a 3% or more increase in price in the following week.
        - 'Not going up next week' or the negative label will be those stocks which do not demostrate a 3% or more increase in price the following week.
            - Upon further development of the models, I may replace the '3%' with the average variance in price of the
            specified stock.
            
**4) Are the Metrics Measurable?**
    - All the specified metrics mentioned above are measurable through Stock Market and Data Analysis as they are objective. 
**5) What failure scenarios are not related to the sucess Matrix?**
    - Of course, the features I will be feeding the algorithms are limited. The model must be refurbished frequently as
    features such as 'Institutional Holders', 'Income' , 'Sales', and 'Recommendations' are all volatile. Also, the model
    is not immune to concept drift, political influence, and unforseen world circumstances. Thus, entire sectors for
    example may fall due to some influence not listed on the features and thus these scenarios should not be counted
    against the algorithm.
    - As a third of the data pushed as features take place during a global pandemic, the model may learn certain 
    attributes that applying well to the current world circumstances but not towards a 'regular' or 'normal' world
    circumstances(which the concepts of may or may not exist as the world is constantly in flux)
**6) How will the product use the predictions**
     - These positive classifications will then optimally be filtered down through a human interpreter to a hand-full of 
    Tickers which then the human interpreter can decide to invest in.
**7) Where in the architecture should the code live?**
    - Every week I shall scrape the closing values on a Thursday and the model should be able to classify which Tickers 
    will 'Go up next week' before 4pm EST on Friday. There is ample time in this gap, the scraping will be
    automated and will only take a small fraction of the day, and the evaluation even less time. Thus there are no
    latency requirements

### Finally,
all my code can be found in the repository. Besides the building of the ML algorithm (which will vastly happen here), all code for Data Extraction and Preperation can be found under the src directory. As there are over a thousand lines of code, it would make no sense to copy paste it all on the presentation. Thus, I will be simply showcasing a few important methods, with **omitted comments** on the Jupyter notebook and the rest of my code can be viewed seperately :)

# Data Extraction

As restrictions I faced with the development of this project included cost, I sought out websites and APIs that were of no cost. While Quandl/Quantopian provides important and consistent data on stocks, free users face many restrictions. Thus I decided to use two sources for data extraction:

**NOTE** I was very hyped for the Data Extraction and Preperation. On the Google Machine Learning Crash Course, they stated that the majority of time (some say 80%+) is spend on extraction and preparation of data for Machine Learning models. Thus I wanted to see how I handled that phase. TL;DR I didn't mind it at all!! 
### 1) Finviz
         Finviz is a stock screener and financial visualizer with an awesome free option. It includes a large host
         of analytical information (such as Quick Ratio, Average True Range, Short Float, Short Ratio, 
         Insider Own, etc) as well as a candleled chart with some pattern recogniziton.


<img src="Finviz_Graph.png" width="800"/>

<img src="Finviz_table.png"  width="1000"/>

### 2) Yahoo! Finance
    Yahoo! finance provides financial news, data, and commentary for the stock market. Yahoo! Primarily I 
    interacted with Yahoo! Finance through [YFinance](https://pypi.org/project/yfinance/) as Yahoo! decommisioned
    their historical data API a couple years back.

<img src="Yahoo_F_L.png" alt="drawing" width="600"/>

Finviz offers a large amount of information in the table that could be useful for the ML algorithm to make predictions. However, **A LOT** of this information will offer bias due to them being calculated using performance in the most recent time frame (One week, One month, etc). Thus I will be searching for mostly static values. Ontop of that, unlike the table demostrated above, for the majority of stocks, a lot of information is missing. As an example:

<img src="Avg_table.png"  width="1000"/>

Also, I face restrictions on the number of features that should be used on my model. Of course, the fewer the better. Finally, some really good potential features, such as RSI and ATR are momentum indicators and derive from the preivous 14 day values. If I had decided on rescraping/rebuilding my model every day, I would surely use these features. However the purpose of this project is to get my hands dirty in Machine Learning, Data Scraping, and Data Preperation. Thus this will not be necassary for my model. Of course, my model is to be used as a 'filtration' system to recommend the best stocks to look at for a trader. Thus features such as RSI,ATR, and even Short Ratio **should** be used by the actual trader.

With all these restrictions, I had to choose a features that prioritized staticity and were found perfeably on every ticker listed on Finviz. I decided on:
   #####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Sector, Industry, Country, Recommendations, Income, and Sales   

YFinance will be used to get the historical chart data for each of the stocks chosen from Finviz. Again, as no specialist, I decided to not rely on Opening, High, Low values. Rather I am interested in how the stock's price looks at **Closing**. However there is a lot of volatility day to day, and this may serve to confuse the model. I am also not interested in short term trades. 

I thought the larger the time interval per data point, the more consistent to overall trends the price and volume will be. Thus I decided to include in my 'features' (the quotations will make sense towards the end of the project):  
#####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2-Year weekly Closing and Volume values



**NOTE** Initally, I had also decided to use Institutional Holders as a Multi-Hot-Encoded feature. This information was also available through YFinance. However, a low percentage of stocks featured these values, and the set of institional holders featured over a thousand unique holders. Thus, it would not be a good feature.

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Data Extraction Code

For Data Extraction of Finviz, I built a StockScraper class. It utilizes BeautifulSoup,lxml, and requests to scrape information from Finviz.com

StockScraper, as one of its class variables, initalizes StockScraperHelper(). Largely, this class was created to avoid overburdening the StockScraper class with methods and to differntiate any general methods that can be reused in other stock scraper classes.

In [6]:
    def __init__(self):
        self.helper = StockScraperHelper()
        self.scraped_info = []
        self.scraped_tickers = []
        self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
        #Used to prevent authorization issues.

The main method in the class, get_all_stock_table_information goes on the inital screener page, and will iteratively call get_stock_table_information for each page available.

In [None]:
def JUPYTER_get_all_stock_table_information(self,url,sector='all',industry='all',country='all',market_cap='all',minPrice=2.50,maxPrice=2000,volume=200000,minimal = True):
    
    soup = self.get_entire_HTML_page(url)
    total = int(soup.find('td',{'class':'count-text'},recursive=True).text.split(' ')[1])
    iterations = total // 20

    url_extension = 'r='
    curr_tickers = 21

    l = self.get_stock_table_information(soup, sector, industry, country, market_cap, minPrice, maxPrice, volume)

    for i in range(iterations):
        next_url = url + url_extension + str(curr_tickers)
        print(next_url)
        next_soup = self.get_entire_HTML_page(next_url)
        curr_tickers += 20
        l = l + self.get_stock_table_information(next_soup, sector, industry, country, market_cap, minPrice, maxPrice, volume)

    if minimal:
        
        keys = ('Market Cap', 'Price', 'Change', 'Volume')
        for i in l:
            for k in keys:
                del i[k]

    self.scraped_info = l
    self.scraped_tickers = self.extract_tickers() `

In [9]:
def JUPYTER_get_stock_table_information(self,soup,sector='all',industry='all',country='all',market_cap='all',minPrice=8,maxPrice=1000,volume=200000):
    table_rows = soup.find_all('tr',{'class':'table-dark-row-cp'}) + soup.find_all('tr',{'class':'table-light-row-cp'}) 
    stock_list = []
    for r in table_rows:
        info = []
        for child in r.descendants:
            if child.name == 'a':
                info.append(child.text)
        r_dict = self.helper.row_to_dict(info)
        stock_list.append(r_dict)

    if sector != 'all':
        stock_list = list(filter(lambda x: x['Sector'] == sector,stock_list))
    if industry != 'all':
        stock_list = list(filter(lambda x: x['Industry'] == industry,stock_list))
    if country != 'all':
        stock_list = list(filter(lambda x: x['Country'] == country,stock_list))
    if market_cap != 'all':
        stock_list = list(filter(lambda x: x['Market Cap']  > int(market_cap),stock_list))
    stock_list = list(filter(lambda x: (x['Price'] > minPrice) & (x['Price'] < maxPrice),stock_list))
    return stock_list

Recommendations, Income, and Sales are scraped in their seperate method: add_RIS(self)


On ML_P1_data_extraction, I use:

In [10]:
def JUPYTER_prepare_Stock_Scraper():
    ss = StockScraper()
    url = 'https://finviz.com/screener.ashx?v=111&'

    ss.get_all_stock_table_information(url,market_cap='100000000',minPrice =2,maxPrice = 2000,volume=100000)
    print('got table information')
    ss.add_RIS()
    print('added RIS')
    return ss

 The total execution time with the listed parameters to build the StockScraper() was:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3482.4135 seconds** or roughly 58 minutes for Low Criteria Tickers
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**1588.4088 seconds** or roughly 26.5 minutes for High Criteria Tickers

After saving the progress, it was time to scrape historical data with YFinance. The code for this is also found in ML_P1_data_extraction.
To Showcase the most important method:

In [12]:

def query_and_format_yfinance_data(allTickers,periodt = '1mo',intervalt = '1wk'):
    print('period = ', periodt)
    print('interval = ', intervalt)
    s_tickers = ' '.join(allTickers)
    print(s_tickers)
    data = yf.download(tickers = allTickers, period = periodt, interval = intervalt, group_by='ticker', auto_adjust=True,threads= True)
    rows = data.shape[0]
    
    queried_data = {}
    for t in allTickers:
        print("Curr ticker: " + str(t))
        queried_data[t] = {}
        curr_stock_history = data[t]
        for r in range(rows):
            curr_date = "Date: " + str(curr_stock_history.index[r].date())
            queried_data[t][str(curr_date) + ' Close'] = curr_stock_history['Close'].iloc[r]
            queried_data[t][str(curr_date) + ' Volume'] = curr_stock_history['Volume'].iloc[r]         
            
        time.sleep(0.1)
            
    print("finished gathering ticker info from yfinance")
    return queried_data


On ML_P1_data_extraction file, you will find reminisce of different versions or different applications of this version of query_and_format_yfinance_data. This is due to experimentation. YFinance proved to be very unreliable in its data extraction from Yahoo! Finance. 

* Firstly, a lot of the methods under the YFinance library were not rigorous. They lacked error checking, and exception handling. To make this code work for all stocks, I had to modify the library quite a bit.


* My first implementation featured additional threadding and divison to quicken the process. However, this led YFinance to skipping roughly a third of tickers, even though i introduced locking in my code!


* My second implementation is what I ended up using (the code above). However this also featured its own problems! Specifically, it gave a lot of incorrect dates! As an example, this is a snippet demostrating how many null values were at each specified date:

> Date: 2020-07-06 Close:      42

> Date: 2020-07-07 Close:    1858

> Date: 2020-07-13 Close:      37

> Date: 2020-07-14 Close:    1858

> Date: 2020-07-17 Close:    1859

> Date: 2020-07-20 Close:      36

> Date: 2020-07-21 Close:    1858

> Date: 2020-07-27 Close:      34

* As we can see, if a featured week is 2020-07-06 to 2020-07-13, then values that were between these two dates
  are all null for all stocks (there were 1859 tickers on the 'High' Criteria list of Tickers.
    - This bug was fixed by creating a method that pushes to a list all correct dates and filters the bad ones out for
        both Closing and Volume values.
        
        
* My third implementaiton tried to not use the yf.download() method, but rather yf.Ticker() method. While this did avoid getting the incorrect dates, it featured a lot of misisng data (Not as much as the threadding implementation)

 The total execution timet to query/scrape information with YFinance was:
 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**745.8681 seconds** or roughly 12.4 minutes for Low Criteria Tickers
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3895.1087 seconds** or roughly 65 minutes for High Criteria Tickers (featuring Institutional Holder querying and time.sleep of 0.2)
    

# Data Preparation