### Copyright 2020 Enea Dodi
https://www.apache.org/licenses/LICENSE-2.0

In [5]:
#@title Lets important everything and not worry for later :)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
from click._compat import iteritems
import requests
import time
import pickle
import os
import threading 
import collections
from pandas.tests.frame.test_sort_values_level_as_str import ascending
from dateutil import rrule
from datetime import datetime, timedelta

# Machine Learning Project 1: Stock Market Predictors:

###### By *Enea Dodi* Summer 2020

## Introduction
I have designed a 'course' to learn the fundamentals of Machine Learning:

1. Complete Google Machine Learning Crash Course 
2. Simple/fun ML project to have hands on experience
3. Andrew Trask's Grokking Deep Learning
4. More sophisticated ML project

   **option 1**  
   5. Pedro Domingo college course for Machine Learning (found on Youtube)  
   6. Finish all HW assignments of this course

   **option 2**  
   5. Aurélien Géron Hand's On Machine Learning edition 2  
   6. Complete plenty of exercises in this book.

This project is the second on the list: *A simple/fun ML project to have hands on experience*

The use of Artifical Intelligence for Stock Market Analysis/Prediction is a very big and competitive field. There are resources and websites which act as communities and API for developing your own algorithms and papers (such as [quandl](https://www.quandl.com/) and [quantopian](https://www.quantopian.com/)) however **my goal for this project is not to develop a industry standard algorithm, but rather have hands on experience with Machine Learning and the tools involved**

I will be using TensorFlow, Pandas, Numpy, and Matplotlib for the development of the algorithm, as well as BeautifulSoup to scrape valuable information from stock analysis websites (such as [finviz](https://finviz.com/)). 
**Big** thanks to Nicolas P. Rougier for [100 Numpy Exercises](https://github.com/rougier/numpy-100) and Alex Riley for [100 Pandas Exercises](https://github.com/ajcr/100-pandas-puzzles) which acted as tutorials for using Numpy and Pandas correctly for Machine Learning Projects.

## Preface
I would like to establish goals, framing, and overall training route before creation of the Machine Learning algorithms for this project. As a template, I will be using the *Deciding on Machine Learning* outline from Google's Introduction to ML Problem Framing.

**1) What should the ML model do?**
    - The overall goal of the ML models will be to develop a general classificiation, 'Going up next week' or 'Not going up next week' which works on any Stock Ticker belonging to different sectors, industries, price ranges, and volume 
    range.
**2) What is the ideal outcome?**
    - The ideal outcome is to generate a list of stocks which have a 'high probability' of going up in price the 
    following week so an individual can assess and determine which of the high probability stocks to invest in.
**3) How will I know if system is successful or a failure (Sucess / Failure Metrics)**
    - Measuring the metrics of success is very simple with stock market predictors. During training and testing, the model will be evaluated on the precision of its classifications for the upcoming week. I will thus know immeditaely, or the following week (as I may decide to evaluate performance on real predictions rather than past price points)
    - I will be focusing on precision rather than recall or accuracy. The outcome of investing X amount at 5 companies which all go up 10% is the same as investing X amoumt at 500 companies which all go up 10%. Thus a **low false positive rate** is more important than a **low false negative rate**.
    Of course this premise can be challenged. For example if we are investing X amount at 5 companies chosen through a precision omptimized metric, then incorrect labeling, no matter how small of a chance, will result in a larger chunk of cash lost per error compared to investing X amount at 500 companies chosen through a recall optimized metric. Thus I will not attempt to minmax for any strategy, but rather, optimize for precision so far as the batch of positive classifications remains in the low dozens (out of the 1500+ Tickers evaluated). In other words, I will attempt to optimize the precision such as: 
$$(FPR) * [\frac{X}{PCC}]  \leq  [\frac{X}{TPC}] $$

     Where FPR = False Positive Rate ; X = Total Invested Money ; PCC = Positive Clasified Cases ; TPC = True Postiive Cases
    
    - I will also be displaying the Confusion Matrices, ROC curves, and AUC curves on evaluations. 
    
    
    - Finally, stock prices are actually able to do three things: decrease in price, increase in price, or stay the same. The classifier on the other hand only classifies between 'Going up next week' or 'Not going up next week'. Thus I will decide in mostly a subjective matter how to divide up the three classes. 
        - 'Going up next week' or the positive label will be those stocks which demostrate a 3% or more increase in price in the following week.
        - 'Not going up next week' or the negative label will be those stocks which do not demostrate a 3% or more increase in price the following week.
            - Upon further development of the models, I may replace the '3%' with the average variance in price of the
            specified stock.
            
**4) Are the Metrics Measurable?**
    - All the specified metrics mentioned above are measurable through Stock Market and Data Analysis as they are objective. 
**5) What failure scenarios are not related to the sucess Matrix?**
    - Of course, the features I will be feeding the algorithms are limited. The model must be refurbished frequently as
    features such as 'Institutional Holders', 'Income' , 'Sales', and 'Recommendations' are all volatile. Also, the model
    is not immune to concept drift, political influence, and unforseen world circumstances. Thus, entire sectors for
    example may fall due to some influence not listed on the features and thus these scenarios should not be counted
    against the algorithm.
    - As a third of the data pushed as features take place during a global pandemic, the model may learn certain 
    attributes that applying well to the current world circumstances but not towards a 'regular' or 'normal' world
    circumstances(which the concepts of may or may not exist as the world is constantly in flux)
**6) How will the product use the predictions**
     - These positive classifications will then optimally be filtered down through a human interpreter to a hand-full of 
    Tickers which then the human interpreter can decide to invest in.
**7) Where in the architecture should the code live?**
    - Every week I shall scrape the closing values on a Thursday and the model should be able to classify which Tickers 
    will 'Go up next week' before 4pm EST on Friday. There is ample time in this gap, the scraping will be
    automated and will only take a small fraction of the day, and the evaluation even less time. Thus there are no
    latency requirements

### Finally,
all my code can be found in the repository. Besides the building of the ML algorithm (which will vastly happen here), all code for Data Extraction and Preperation can be found under the src directory. As there are over a thousand lines of code, it would make no sense to copy paste it all on the presentation. Thus, I will be simply showcasing a few important methods, with **omitted comments** on the Jupyter notebook and the rest of my code can be viewed seperately :)

# Data Extraction

As restrictions I faced with the development of this project included cost, I sought out websites and APIs that were of no cost. While Quandl/Quantopian provides important and consistent data on stocks, free users face many restrictions. Thus I decided to use two sources for data extraction:

**NOTE** I was very hyped for the Data Extraction and Preperation. On the Google Machine Learning Crash Course, they stated that the majority of time (some say 80%+) is spend on extraction and preparation of data for Machine Learning models. Thus I wanted to see how I handled that phase. TL;DR I didn't mind it at all!! 
### 1) Finviz
         Finviz is a stock screener and financial visualizer with an awesome free option. It includes a large host
         of analytical information (such as Quick Ratio, Average True Range, Short Float, Short Ratio, 
         Insider Own, etc) as well as a candleled chart with some pattern recogniziton.


<img src="img/Finviz_Graph.png" width="800"/>

<img src="img/Finviz_table.png"  width="1000"/>

### 2) Yahoo! Finance
    Yahoo! finance provides financial news, data, and commentary for the stock market. Yahoo! Primarily I 
    interacted with Yahoo! Finance through [YFinance](https://pypi.org/project/yfinance/) as Yahoo! decommisioned
    their historical data API a couple years back.

<img src="img/Yahoo_F_L.png"  width="600"/>

Finviz offers a large amount of information in the table that could be useful for the ML algorithm to make predictions. However, **A LOT** of this information will offer bias due to them being calculated using performance in the most recent time frame (One week, One month, etc). Thus I will be searching for mostly static values. Ontop of that, unlike the table demostrated above, for the majority of stocks, a lot of information is missing. As an example:

<img src="img/Avg_table.png"  width="1000"/>

Also, I face restrictions on the number of features that should be used on my model. Of course, the fewer the better. Finally, some really good potential features, such as RSI and ATR are momentum indicators and derive from the preivous 14 day values. If I had decided on rescraping/rebuilding my model every day, I would surely use these features. However the purpose of this project is to get my hands dirty in Machine Learning, Data Scraping, and Data Preperation. Thus this will not be necassary for my model. Of course, my model is to be used as a 'filtration' system to recommend the best stocks to look at for a trader. Thus features such as RSI,ATR, and even Short Ratio **should** be used by the actual trader.

With all these restrictions, I had to choose a features that prioritized staticity and were found perfeably on every ticker listed on Finviz. I decided on:
   #####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Sector, Industry, Country, Recommendations, Income, and Sales   

YFinance will be used to get the historical chart data for each of the stocks chosen from Finviz. Again, as no specialist, I decided to not rely on Opening, High, Low values. Rather I am interested in how the stock's price looks at **Closing**. However there is a lot of volatility day to day, and this may serve to confuse the model. I am also not interested in short term trades. 

I thought the larger the time interval per data point, the more consistent to overall trends the price and volume will be. Thus I decided to include in my 'features' (the quotations will make sense towards the end of the project):  
#####  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2-Year weekly Closing and Volume values



**NOTE** Initally, I had also decided to use Institutional Holders as a Multi-Hot-Encoded feature. This information was also available through YFinance. However, a low percentage of stocks featured these values, and the set of institional holders featured over a thousand unique holders. Thus, it would not be a good feature.

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Data Extraction Code

For Data Extraction of Finviz, I built a StockScraper class. It utilizes BeautifulSoup,lxml, and requests to scrape information from Finviz.com

StockScraper, as one of its class variables, initalizes StockScraperHelper(). Largely, this class was created to avoid overburdening the StockScraper class with methods and to differntiate any general methods that can be reused in other stock scraper classes.

In [6]:
    def __init__(self):
        self.helper = StockScraperHelper()
        self.scraped_info = []
        self.scraped_tickers = []
        self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
        #Used to prevent authorization issues.

The main method in the class, get_all_stock_table_information goes on the inital screener page, and will iteratively call get_stock_table_information for each page available.

In [None]:
def JUPYTER_get_all_stock_table_information(self,url,sector='all',industry='all',country='all',market_cap='all',minPrice=2.50,maxPrice=2000,volume=200000,minimal = True):
    
    soup = self.get_entire_HTML_page(url)
    total = int(soup.find('td',{'class':'count-text'},recursive=True).text.split(' ')[1])
    iterations = total // 20

    url_extension = 'r='
    curr_tickers = 21

    l = self.get_stock_table_information(soup, sector, industry, country, market_cap, minPrice, maxPrice, volume)

    for i in range(iterations):
        next_url = url + url_extension + str(curr_tickers)
        print(next_url)
        next_soup = self.get_entire_HTML_page(next_url)
        curr_tickers += 20
        l = l + self.get_stock_table_information(next_soup, sector, industry, country, market_cap, minPrice, maxPrice, volume)

    if minimal:
        
        keys = ('Market Cap', 'Price', 'Change', 'Volume')
        for i in l:
            for k in keys:
                del i[k]

    self.scraped_info = l
    self.scraped_tickers = self.extract_tickers() `

In [9]:
def JUPYTER_get_stock_table_information(self,soup,sector='all',industry='all',country='all',market_cap='all',minPrice=8,maxPrice=1000,volume=200000):
    table_rows = soup.find_all('tr',{'class':'table-dark-row-cp'}) + soup.find_all('tr',{'class':'table-light-row-cp'}) 
    stock_list = []
    for r in table_rows:
        info = []
        for child in r.descendants:
            if child.name == 'a':
                info.append(child.text)
        r_dict = self.helper.row_to_dict(info)
        stock_list.append(r_dict)

    if sector != 'all':
        stock_list = list(filter(lambda x: x['Sector'] == sector,stock_list))
    if industry != 'all':
        stock_list = list(filter(lambda x: x['Industry'] == industry,stock_list))
    if country != 'all':
        stock_list = list(filter(lambda x: x['Country'] == country,stock_list))
    if market_cap != 'all':
        stock_list = list(filter(lambda x: x['Market Cap']  > int(market_cap),stock_list))
    stock_list = list(filter(lambda x: (x['Price'] > minPrice) & (x['Price'] < maxPrice),stock_list))
    return stock_list

Recommendations, Income, and Sales are scraped in their seperate method: add_RIS(self)


On ML_P1_data_extraction, I use:

In [1]:
def JUPYTER_prepare_Stock_Scraper():
    ss = StockScraper()
    url = 'https://finviz.com/screener.ashx?v=111&'

    ss.get_all_stock_table_information(url,market_cap='100000000',minPrice =2,maxPrice = 2000,volume=100000)
    print('got table information')
    ss.add_RIS()
    print('added RIS')
    return ss

 The total execution time with the listed parameters to build the StockScraper() was:
 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3482.4135 seconds** or roughly 58 minutes for Low Criteria Tickers
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**1588.4088 seconds** or roughly 26.5 minutes for High Criteria Tickers

After saving the progress, it was time to scrape historical data with YFinance. The code for this is also found in ML_P1_data_extraction.
To Showcase the most important method:

In [12]:

def query_and_format_yfinance_data(allTickers,periodt = '1mo',intervalt = '1wk'):
    print('period = ', periodt)
    print('interval = ', intervalt)
    s_tickers = ' '.join(allTickers)
    print(s_tickers)
    data = yf.download(tickers = allTickers, period = periodt, interval = intervalt, group_by='ticker', auto_adjust=True,threads= True)
    rows = data.shape[0]
    
    queried_data = {}
    for t in allTickers:
        print("Curr ticker: " + str(t))
        queried_data[t] = {}
        curr_stock_history = data[t]
        for r in range(rows):
            curr_date = "Date: " + str(curr_stock_history.index[r].date())
            queried_data[t][str(curr_date) + ' Close'] = curr_stock_history['Close'].iloc[r]
            queried_data[t][str(curr_date) + ' Volume'] = curr_stock_history['Volume'].iloc[r]         
            
        time.sleep(0.1)
            
    print("finished gathering ticker info from yfinance")
    return queried_data


On ML_P1_data_extraction file, you will find reminisce of different versions or different applications of this version of query_and_format_yfinance_data. This is due to experimentation. YFinance proved to be very unreliable in its data extraction from Yahoo! Finance. 

* Firstly, a lot of the methods under the YFinance library were not rigorous. They lacked error checking, and exception handling. To make this code work for all stocks, I had to modify the library quite a bit.


* My first implementation featured additional threadding and divison to quicken the process. However, this led YFinance to skipping roughly a third of tickers, even though i introduced locking in my code!


* My second implementation is what I ended up using (the code above). However this also featured its own problems! Specifically, it gave a lot of incorrect dates! As an example, this is a snippet demostrating how many null values were at each specified date:

> Date: 2020-07-06 Close:      42

> Date: 2020-07-07 Close:    1858

> Date: 2020-07-13 Close:      37

> Date: 2020-07-14 Close:    1858

> Date: 2020-07-17 Close:    1859

> Date: 2020-07-20 Close:      36

> Date: 2020-07-21 Close:    1858

> Date: 2020-07-27 Close:      34

* As we can see, if a featured week is 2020-07-06 to 2020-07-13, then values that were between these two dates
  are all null for all stocks (there were 1859 tickers on the 'High' Criteria list of Tickers.
    - This bug was fixed by creating a method that pushes to a list all correct dates and filters the bad ones out for
        both Closing and Volume values.
        
        
* My third implementaiton tried to not use the yf.download() method, but rather yf.Ticker() method. While this did avoid getting the incorrect dates, it featured a lot of misisng data (Not as much as the threadding implementation)

 The total execution timet to query/scrape information with YFinance was:
 
 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**745.8681 seconds** or roughly 12.4 minutes for Low Criteria Tickers
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3895.1087 seconds** or roughly 65 minutes for High Criteria Tickers (featuring Institutional Holder querying and time.sleep of 0.2)
    

# Data Preparation & Data Preparation Code

### Removing Dangerous Examples

I planned to only use stocks that have been in the Stock Exhange for atleast the last two years. Sadly, a lot of tickers scraped have been around for less than two years. Thus, these tickers are missing valuable information and do not meet my criteria.

I however wanted to use tickers that were salvagable. As in, rather than removing all tickers that have any missing information, I created a function that removes the tickers with the most missing information and writes down the tickers that are missing a substantial amount of information in volume columns, price columns, and categorical columns.

The following code did as such:



In [None]:
def get_null_information(filename,df,remove_worst = False):
    pd.set_option('display.max_rows', len(df))
    f = open(filename + '.txt','w')
    #Divide up the df into three sections: volume per week, price per week, Non time related values
    filter_vpw = [col for col in df if 'Volume' in col]
    filter_ppw = [col for col in df if 'Close' in col]
    time_rv = filter_vpw + filter_ppw
    filter_ntrv = [col for col in df if col not in time_rv]

    volume_per_week_df = df[filter_vpw]
    price_per_week_df = df[filter_ppw]
    categorical_values_df = df[filter_ntrv]
    
    serv = volume_per_week_df.isnull().sum(axis=1)
    serv_m = serv.mean()
    serv_std = serv.std()
    f.write('average # null per ticker for volume columns in dataframe is: ' + str(serv.mean()) + '\n')
    f.write('std # null per ticker for volume columns in dataframe is: ' + str(serv_std) + '\n')
    serp = price_per_week_df.isnull().sum(axis=1)
    serp_m = serp.mean()
    serp_std = serv.std()
    f.write('average # null per ticker for price columns in dataframe is: ' + str(serp.mean())+ '\n')
    f.write('std # null per ticker for price columns in dataframe is: ' + str(serp_std) + '\n')
    serc = categorical_values_df.isnull().sum(axis=1)
    serc_m = serc.mean()
    serc_std = serc.std()
    f.write('average # null per ticker for categorical columns in dataframe is: ' + str(serc.mean())+ '\n')
    f.write('std #null per ticker for categorical columns in dataframe is: ' + str(serc_std) + '\n')
    
    #get subset where the amount of nulls is greater than one standard deviation away from the mean, then sort. 
    #This'll give me an ide aof which stocks to drop.
    serv_outliers = serv[(serv >= serv_m + 1*serv_std)]
    serv_outliers.sort_values(ascending=False,inplace=True,na_position='first')
    
    serp_outliers = serp[(serp >= serp_m + 1*serp_std)]
    serp_outliers.sort_values(ascending=False,inplace=True,na_position='first')
    
    serc_outliers = serc[(serc >= serc_m + 1*serc_std)]
    serc_outliers.sort_values(ascending=False,inplace=True,na_position='first')

    size = str(len(df.index))
    
    f.write("VOLUME NULL OUTLIERS: \n")
    f.write(str(serv_outliers.shape[0]) + "/" + size + " are one standard deviation away from average null values per ticker\n")
    f.write(serv_outliers.to_string())
    f.write("\n\n\n")
    f.write("PRICE NULL OUTLIERS: \n")
    f.write(str(serp_outliers.shape[0]) +"/" + size + " are one standard deviation away from average null values per ticker\n")
    f.write(serp_outliers.to_string())
    f.write("\n\n\n")
    f.write("CATEGORICAL NULL OUTLIERS: \n")
    f.write(str(serc_outliers.shape[0]) + "/" + size + " are one standard deviation away from average null values per ticker\n")
    f.write(serc_outliers.to_string())
    f.write("\n\n\n")
    
    
    .
    serv_i = serv_outliers.index
    serp_i = serp_outliers.index
    serc_i = serc_outliers.index
    f.write("BAD STOCKS")
    bad_stocks = set(serv_i).intersection(serp_i).intersection(serc_i)
    for b in bad_stocks:
        f.write(b)
        f.write('\n\n')
    
    if remove_worst == True:
        for i in serp_i:
            df.drop(i,inplace=True)
        return df
    pd.reset_option('display.max_rows')
    
    f.close()

Iteratively applying this function demostrated a few things:
* If a 'Close' column was missing a value at a certain date, then the volume was also missing at the same date.
* If a Stock was missing a value at a certain certain date, then the dates were almost entirely before the debut of the Ticker in the Stock Exchange

Thus I learned that in my case, I would get practically the same result if I simply created a function that dropped rows if the start date Volume and Price columns were empty.


the Google Machine Learning Crash Course states that if a feature value appears less than 5 times, then it is not a good feature. I grouped all the tickers in the DataFrame into Sectors and Industry and counted occurences of each.

` df_groupby_sector = df.groupby('Sector').count() `

` df_groupby_industry = df.groupby('Industry').count() ` 
    
Each sector had plenty of examples, thus I did not have to cut any sectors and their associated tickers.
However, there were industries with less than 5 examples. I decided to cut those industries and their associated tickers with the following code: 


For countries, the majority of tickers belonged in the United States. However, there were tickers that belonged to other countries. Some of them, like Italy and Ukraine had less than 5 occurences. Thus I removed any country alongside their tickers that had less than 5 occurences.

In [3]:
def JUPYTER_remove_bad_ticks_by_group(df,occurences,group_n):
    g = df.groupby(df[group_n]).count()
    
    bad_g = list(g[g['Sector'] < occurences].index.values)
    print(bad_g)
    df_i = df[[group_n]]
    for index,row in df_i.iterrows():
        if row.values[0] in bad_g:
            df.drop(index,inplace=True)

    return df

### One Hot Encoding

Most features that were scraped from Finviz are features that will be one-hot encoded for the model. These features include:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **Sector, Industry, Country, Recommendations**

Every stock has a sector, industry, and country value. The recommendations feature has null values. Thus for recommendation I will be adding another OHE feature that tracks 'Has Recommendation Value'

All this is handled by the same function:


In [2]:
def JUPYTER_one_hot_encode(df,c_name,contains_null = False,null_value = None,sparsev=False):
    r_df = pd.concat([df,pd.get_dummies(df[c_name],prefix=c_name,sparse=sparsev)],axis=1,sort=False)
    
    def check_for_null(x):
        
        if null_value is None:
            if pd.isnull(x):
                print_global_count() # To see if I miss a null or not.
                return 0
            else:
                return 1
        else:
            if x is null_value:
                return 0
            else:
                return 1
            
            
    if contains_null == True:
        r_df['Contain ' + c_name] = df[c_name].apply(check_for_null)
    
    r_df.drop([c_name],axis=1,inplace=True)
    print('Done one hot encoding: ' , c_name)

    reset_global_counter()
    return r_df


### Inferring Missing Values

While preparing categorical features was a simple task, preparing non-categorical features proved to be slightly trickier.

First I had to deal with missing 'Income' and 'Sales' values. On average, there were substantially more missing 'Income' values than 'Sales' values. I used this fact in my development of inferred 'Income' and 'Sales' values.

First I dealt with missing Income values. 

* A dataframe with tickers with a null income value was initalized.
* If for the ticker the sales value was not null:
    * Get group of tickers with the same sector and get median value of the ratio of the sector's income and sales values.
        * `median_income_div_sales = (df_sect['Income'] / df_sect['Sales']).median()`
    * Multiply the sales value with median_income_div_sales
* If for the ticker the sales value was null:
    * Calculate and place mean income value for the tickers in the same indutsry 
        * ` m_val_by_industry = in_df_indust.get_group(industry).mean()`
        
        


Similar logic was used to fill the missing 'Sales' values, but now we knew for sure that every ticker has a 'Income' value thus we can specialize the sales inferred value


### Scaling Values

My DataFrame is set up so each row represents a ticker where the ticker name is the index, and the columns include all the features.

There were features that had to be scaled row wise, and features that had to be scaled column wise.

#### Row-wise Scaling

Date 'Volume' and 'Closing' values had to be scaled row wise. 
I was on the fence if stock closing values are heteroscedastic. A paper by [JL Sharma](https://www.tandfonline.com/doi/abs/10.1080/096031096334132) provides proof that stock returns are heteroscedastic but no evidence online has to do with prices and volume.

Upon reaserch online, many articles demostrated [price](https://www.investopedia.com/articles/investing/102014/lognormal-and-normal-distribution.asp) values of stocks are lognormal

Simply put, 

If the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.


Thus for the closing values, I will be first transforming the values to ln(X) then I will do Standard scale.

For volume I initially thought of using MinMax scaling however of fear (probably unreasonable, and this is where my inexperienced of data analysis and ML shines) of using two different scaling mechanisms for likely correlated values is not the best idea. Thus I decided to scale volume the same as I scaled closing values. 

#### Standard Scaler

$$x_{new} =  \frac{x-\mu}{\sigma}  $$

The code used for both Volume and Price is as follows:

In [4]:
def JUPYTER_log_normal_scale_rows(df):
    
    for column in df:
        df.loc[:,column] = np.log(df[column])
    
    scaler = StandardScaler()
    df2 = pd.DataFrame(columns = df.columns,index=df.index,data=scaler.fit_transform(df.values.T).T)
    return df2

#### Column-wise Scaling

'Income' and 'Sales' values had to be scaled column wise. Intutiively I thought using a MinMax Scalar per sector would be best rather than using a single MinMax Scalar for the entire columns of 'Income' and 'Sales'. This very well may have been an error from my part but I believed ticker's incomes and sales should be compared to their sector's incomes and sales. Dividing it up by Industry would stretch the groups too thin and would ruin this feature for the model
#### MinMax Scaler

$$  x_{new} = \frac{x-x_{min}}{x_{max}-x_{min}} $$

The code used for both Income and Sales is: 

In [2]:
def JUPYTER_scale_MinMax_by_Sector(df,s_or_i='Sales'):
    g = df.groupby(df['Sector'])
    df_section = df[[s_or_i]]
    names = g.groups.keys()
    scaler = MinMaxScaler()
    for n in names:

        dfn = g.get_group(n)[[s_or_i]]

        dfn_scaled = pd.DataFrame(index=dfn.index,data=scaler.fit_transform(dfn.values))
        
        df_section[df_section.index.isin(list(dfn.index))] = dfn_scaled

        dfn_scaled.to_excel('minmax'+n+'.xlsx')
    return df_section

# Creating Different Datasets

With the current code, I have generated a scaled and one hot encoded Dataframe with features being the columns of the DataFrame and the rows being the examples. 

The columns (features) consist of : 
* Sector, Industry, Recommendations and Country One-Hot Encoded
* Sales and Income scaled by MinMax
* Volume and Date scaled lognormally for involved Dates

I have created two datasets:
* **2YStockDFHighCriteria** - Features less examples but stocks strating with a higher Market Capital
* **2YStockDFLowCriteria** - Features more examples but stocks starting with a lower Market Capital


Depending on the needs of the models I will be developing, I wrote a couple of methods to create different forms of these dataset. Initally I will be experimenting with the low criteria dataset and these names are in regards to that dataset. 
1. Dataset with Sector removed. (**2YStockDFSR**)
2. Dataset with Industry removed. (**2YStockDFIR**)
3. Dataset with Volume columns removed. (**2YStockDFVR**)
4. Dataset with a new feature, Price/Volume, lognormally scaled which replaces the Price and Volume columns (**2YStockDFRatio**)
5. Dataset where the time frame is reduced to 6 months resulting in 4 times more examples and ~4 times less Dated features
    * This will be applied to the original DataFrame as well as the Dataframe of part 3. (**2YStockDFSplit** , **2YStockDFRatioSplit**)
    


# Machine Learning Algorithms to be Used

###### The main purpose of this project is to familiarize myself with the world of Machine learning, and to have a little fun along the way. For me, fun means experiementing and comparing results from different algorithms on different data sets, alongside trying new ideas and see if they lead to an improvement.

In this project I will be implementing three different Machine Learning Algorithms:
* Long Short Term Memory (LSTM) Recurrent Neural Network
* Support Vector Machine (SVM)
* Logistic Regression

## Long Term Short Memory Brief Summary

Lets say we have a phrase:
> "I am a veterinarian. I perform ___________ on ________________."

If asked what are the words that are missing, one would most likely say 'surgey' and 'animal'. But how did we reach this conclusion?

Well, we looked at the words that came before! We kept them in memory and we filled in the blank knowing what came before in the phrase. If we never kept the previous words in memory, we would never be able to guess what the blank would be! And after filling the first blank, we didn't regard the second blank as independent. We followed through and knew that the first blank had some correlation with the second blank.

Most Machine Learning algorithms cannot do this. This is because in conventional algorithms, all test cases are considered independent. Reccurent Neural Networks fix this problem by introducing loops in the NN. These loops can be thought of as a chain of copies of the same NN, each passing a message to the next (few) NN on the chain.

Long Short Term Memory Neural Networks are a special kind of Recurrent Neural Network that allows for storing important past information and forgetting past information that is not important.

The main mechanism of LSTMs is the cell state, a part of a 'neuron' that realizes the 'memory' aspect of the NN. It acts like a conveyor belt from the previous NN repeating module in the sequence, to the current, to the next NN repeating module in the sequence. The cell state is updated by three gates:

* **Forget Gate** : Comprised of a layer (sigmoid) which outputs a value from 0 to 1 (0 being get rid of the information, 1 means keep all of it) deciding on how much and what information should be forgotten. Note that the information is not dropped here, just the forget values are given. 

* **Input Gate** : Comprised of 2 layers (sigmoid and tanh) pushing how much and what information gets stored in the cell with the updated forget values from the forget gate.. 

* **Output Gate** : Comprised of 2 layers (sigmoid and tanh) determining how much of the current cell information gets pushed to the next.

For a more in depth explanation of Long Short Term Memory, please check out the following sources:

[Source 1](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) [Source 2](https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price-machine-learningnd-deep-learning-techniques-python/) [Source 3](https://www.asimovinstitute.org/neural-network-zoo/)

<img src="img/LTSMimg.png"  width="600"/>

## Support Vector Machine Brief Summary

We can think of features of examples in a training set representing different dimensions. So if there are N features, there are N dimensions. Each example has their own configuration of values in each of these dimensions. Support Vector Machines are a supervised classificiation model which divides this N dimensional space with a N-1 dimensional Hyperplane.

As an example, let's say we have two features, x and y, which are the dimensions in the space S. We can think of these features as the axis x and y of the common 2D Cartesian Coordinate System. Each example is its own point in S:

* [ (2,5) , (13, -5) ,  ( -4 , -4) ]

Let's say there are two versions of the points: Green points and Red points as shown in my beautiful work in 'Paint':

<img src="img/SVMv1.png"  width="400"/>

S is a 2D space thus the hyperplane that will be used is a line (1D). The goal of Support Vector Machines is where to place this hyperplane, allowing few misclassfications, so that the margin between the Green and Red points from the hyperplane is maximized. This margin is called the Soft Margin, and the points that are 'most important' in the decision on where to place and how to orient this hyperplane (and the points where the soft margin is calculated) are the support vectors.

Good placement of the hyperplane increases the chance of new data being classified correctly.





<img src="img/SVMv2.png"  width="400"/>

Majority of the time, points are not linerally seperable. A precise placement and orientation of a hyperplane is not possible by simply finding good support vectors and drawing the hyerplane. Thus Support Vector Machines have kernels which functinoally transforms/compares the training data into higher dimensions to make them linearaly seperable.

Two popular kernels are:
* Polynomial Kernels : creates new variables which are polynomial transformations of old ones to fit a hyperplane between them.


<img src="img/polykern.png"  width="600"/>


* Radial Basis Function Kernels : Behaves as a weighted nearest neighbor model which gives greater weight to points closer to the new observed point to help classify the new point. 

<img src="img/rbk.png"  width="400"/>


For a more in depth explanation of Support Vector Machines, please check out the following sources:


[Source 1](https://www.kdnuggets.com/2016/07/support-vector-machines-simple-explanation.html)  [Source 2](https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496) [Source 3](https://www.youtube.com/watch?v=efR1C6CvhmE&vl=en)

## Logistic Regression Brief Summary

Logistic Regression is very closely related to Multiple Regression. A quick overview of Multiple Regression will give us better insight on Logistic Regression

#### Multiple Regression Quick overview

Suppose we have features such as humidity, temperature,windspeed, longitude, and latitude ( $ x_{1} \rightarrow x_{5} $ respectively ) and want to predict how many inches of rainfall will occur later on the day  $ y_{label} $

Each example in our dataset thus corresponds to a point ( $ x_{1} , x_{2} , x_{3} , x_{4} , x_{5} , y_{label} $ ) on a 6-dimensional space.

To predict the label, we can opt to fit a hyperplane through this space.

Then to fit a hyperplane to best predict the label, we use a plane equation:

$$ y^{'} = b \sum_{i = 1 }^{i \in F} w_{i}x_{i} \\ $$

where:
* $ y^{'} $ is the predicted inches of rainfall
* $ b $ is the bias (y-intercept) 
* $ x_{i} $ is feature i 
* $ w_{i} $ is the weight (slope) which basically defines how much of a feature should be used to calculate y
* $ F $ is the amount of features per example.

Initally, multiple regression fits a random hyperplane. However using loss measurements such as (but not limited to)

**Least Squared Errors ( $ L_{2} $ )** ,


 $$ L_{2} = (y_{label} - prediction( x_{1} , x_{2} , ..... , x_{n} ) )^2 $$
 
 and **Mean Squared Error (MSE)** 
 
 $$ MSE = \frac{1}{N} \sum_{ (x_{1} , ..... , x_{n} , y_{label}) \in D} L_{2}\\ $$
 
 where:
 * $ N $ is the size of examples
 * $ D $ is the example set
 
 alongside **Gradient Descent** (modify weights based on negative gradient  $  - \nabla f = - \left(\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},...,\frac{\partial f}{\partial x_n}\right)$ of weight to loss to locate a local minimum) , **learning rate** (scalar which the gradient is multiplied with to decide the size of the steps toward the local minimum), and **epochs** (how many times our entire dataset is passed through the algorithm) to reach a best hyperplane for the training set. 
 
We can also use techniques such as feature crossing (i.e multiplied features together) and bucketing (tuning continous variables into categorical values) creating new features to help the model best fit the data.

#### Back to Logistic Regression

Rather than predicting a continuous label, logistic regression is a binary classification supervised model which outputs the __probability__ of a given example being _True_ on a scale from 0 to 1. This is done by using the sigmoid function :

$$ Sigmoid(z) = y^{'} = \frac{1}{1+e^{-(z)}} $$

where z is known as the log-odds function:

$$ z = log(\frac{y}{1-y}) $$

where y is the output of the plane equation mentioned above.

While the loss function for linear regression is squared loss, the loss function for logistic regression is **Log Loss**:

$$ Log Loss = \sum_{(x_{1} , ..... , x_{n} , y_{label})} -ylog(y^{'})-(1-y)log(1-y^{'}) $$

The asymptotic nature of logistic regression drives loss toward 0 in high dimensions. Thus regularization (penalizing complexity of a model) is necassary for logsitic regression. 

Logitic regression models use strategies such as $ L_{1}$, $L_{2}$, and **Early Stopping** to dampen the complexity.
 
 For a more in depth explanation of Logistic Regression, please check out the following sources:


[Google MLCC](https://developers.google.com/machine-learning/crash-course) [Source 1](https://christophm.github.io/interpretable-ml-book/logistic.html)  [Source 2](https://www.kdnuggets.com/2020/03/linear-logistic-regression-explained.html) [Source 3](https://www.youtube.com/watch?v=OCwZyYH14uw)