<a href="https://colab.research.google.com/github/cmelende/CS5262-50_Project/blob/main/Assignment4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 3**

Cory Melendez

### Background:
One of the most coveted types of prediction, maybe only second to knowing the lottery numbers, is that of the stock market. Stock traders, that is Non HFT/algo traders, will typically look to tools such as quantitative analysis (QA). They attempt to detect patterns that arise from human behavior, believed to be unchanging. This project instead attempts to use ML models in order to build up a scientifically more sound method than methods such as QA that can predict how the market will move in the very short term.

One of the challenges to this project, and will require further research, is to identity not only if our model will have an acceptable accuracy using the data below, but why. Markets are ever-changing, and are constantly affected not only by geoeconomic events, but geopolitical events as well. Identifying the circumstances where it *could* work and *when* it could work is just as important as building the model itself. We found paper 1, which goes over how researchers used this dataset, and paper 2, a summary of 100 published articles, to be a good starting point in understanding what we will call the 'reproducability' problem as well as understanding how to build our models better.

The last challenge is the size of the data, I'll be trimming the number of columns not only to stay under the column limit, but to better narrow down and understand the factors that contribute to the output.



### Project Description: 
This project's main aim is to output a binary classificiation that will be able to tell a theoritical day trader whether or not they should buy and sell a stock on a certain day.This project will use data from link below in order to build our ML models. We will also refer to the paper below in order for not only reference on how a model can be built, but to also refer to it for understanding the domain.

Our main methodology will be to groom the data into something that is digestible by the model. Since our aim is to only output a binary classification (yes/no), the question we will be asking our model is 'should i buy the stock at close today and sell at close of tomorrow'. This simple strategy will simplify the model greatly, but will require restructuring the data and creating a target column that our model predicts.

Our simple goals that define success is as follows: Can we beat not entering the market at all (not investing)? Can our model beat a trading algorithm that simply randomly outputs yes/no? A goal beyond that would be to answer: What other non-ML algorithms, in the same time and circumstances, can our model beat?


Data: https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables

Paper: https://reader.elsevier.com/reader/sd/pii/S0957417419301915?token=746C33D1046F2DDA4EE614C2A4606AF2260493F0DC652081FF1F03968E01DC023369A293A638CEE35E24DB8BB7EE1259&originRegion=us-east-1&originCreation=20230119051158

Paper 2: https://reader.elsevier.com/reader/sd/pii/S0957417421009441?token=EEE8A8BF467B1370F99F73C46DC7AD74D88F0AB17AC08F5137A3B16E14E69D382DE8A689EF492C09E0CF67690D289807&originRegion=us-east-1&originCreation=20230119052216

### Performance Metric: 

Our metric will be pretty simple, given any trade, or a group of trades, advised by the model, is: 

(current price - bought price ) * X > X * epsilon, where X is the amount of money that was invested and epsilon is any arbitrary small positive number

 
If this statement is true, then our model has outperformed our money not entering the market at all by an arbitrary small number.

Furthermore, 

does:

Sum(MLProfit) > Sum(RandomProfit) * epsilon

Where Sum(MLProfit) is the profit made from a arbitrary number of trades using our ML model greater, and Sum(RandomProfit) is the profit madae from an arbitrary number of trades using a random output algorithm, and epsilon is an arbitrary small number

If so, then we have beaten randomness by a factor of epsilon and have met our main objective.





### Setup Notebook

In [268]:
#imports

import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as skl
import matplotlib.pyplot as plt
from typing import List, Dict
from datetime import datetime
from numpy.random import randint
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn import config_context


In [269]:
dji_file = 'Processed_DJI.csv'
nasdaq_file = 'Processed_NASDAQ.csv'
nyse_file = 'Processed_NYSE.csv'
russel_file = 'Processed_RUSSELL.csv'
sp_file = 'Processed_S&P.csv'

dji_df = pd.read_csv(dji_file)
nasdaq_df= pd.read_csv(nasdaq_file)
nyse_df = pd.read_csv(nyse_file);
russel_df = pd.read_csv(russel_file);
sp_df = pd.read_csv(sp_file);


### Data Cleaning and Feature Transformers
Combine, remove, and create columns to give our model a better context for each trading day and how the market reacted on previous' days

In [270]:
#GLOBALS
OIL= "Oil"
GOLD="Gold"
XOM="XOM"
JPM="JPM"
GE="GE"
JNJ="JNJ"
WFC="WFC"
AMZN="AMZN"
MSFT = "MSFT"

STOCK_COLS: List[str] = [
    OIL, GOLD, XOM, JPM, GE, JNJ, WFC, AMZN, MSFT
]

DTB4WK = "DTB4WK"
DTB3 = "DTB3"
DTB6 = "DTB6"
DGS5 = "DGS5"
DGS10 = "DGS10"


TREASURY_COLS: List[str] = [
    DTB4WK, DTB3, DTB6, DGS5, DGS10
]

STOCKS_BY_INDUSTRY: Dict[str, str] = {
    OIL: 'oil & gas',
    GOLD: 'industrial metal',
    XOM: 'oil & gas',
    JPM: 'banking',
    GE: 'industrial machinery',
    JNJ: 'drugs',
    WFC: 'banking',
    AMZN: 'tech',
    MSFT: 'tech'
}
#['oil & gas', 'industrial metal', 'Banking', 'industrial machinery', 'drugs', ]
COMMON_COLS: List[str] = [
    'Date', 'Close', 'Volume', 
    #Add this for convenience
    'Name'
] + TREASURY_COLS + STOCK_COLS

#columns that need to be calculated
CALC_COLS = [
    #dow jones
    'DJI',
    # ny stock exchang
    'NYSE', 
    #russel
    'RUSSEL', 
    #s&p 500
    'GSPC', 
    #nasdaq
    'IXIC'
]

ALL_DFS: List[pd.DataFrame] = [dji_df, nasdaq_df, nyse_df, russel_df, sp_df]

In [271]:
#calculate the rate of change from the previous two days and drop the first day

class DateTransformer(BaseEstimator, TransformerMixin):
  def fit(self, X, y=None):
    return self;
  
  def transform(self, X, y=None):
    X_=X.copy()
    X_['Date'] = pd.to_datetime(X_['Date'])
    X_['Year'] = X_['Date'].dt.year
    X_['Month'] = X_['Date'].dt.month;
    X_['DayOfMonth'] = X_['Date'].dt.day;
    X_['DayOfWeek'] = X_['Date'].dt.dayofweek

    X_.drop(columns=['Date'], inplace=True)
    return X_

class TrimUnusedColumnsTransformer(BaseEstimator, TransformerMixin):

  def __init__(self, keepColumns: List[str]):
    self.keepColumns: List[str] = keepColumns
  
  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_ = X.copy()
    X_ = X_[self.keepColumns]
    return X_

class RenameCloseVolumeColumnsTransformer(BaseEstimator, TransformerMixin):
  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_=X.copy()
    exchangeName = X.loc[0, 'Name']
    return X_.rename(columns={'Close': f'{exchangeName}_Close', 'Volume': f'{exchangeName}_Volume'})

class TreasuryRateTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, targetTreasury: str, backNumberOfDays: int):
    self.targetTreasury: str = targetTreasury;
    self.backNumberOfDays: int = backNumberOfDays

  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_= X.copy()

    for j in range(0, self.backNumberOfDays + 1):
        X_.loc[j, f'{self.targetTreasury}_Last_{self.backNumberOfDays}_Yield_Change'] = 1

    for i in range(1, len(X)):
      if (i - self.backNumberOfDays + 1) > 0 and X_.loc[i-self.backNumberOfDays, self.targetTreasury] != 0:
        X_.loc[i, f'{self.targetTreasury}_Last_{self.backNumberOfDays}_Yield_Change'] = 1 - (X_.loc[i, self.targetTreasury] / X_.loc[i-self.backNumberOfDays, self.targetTreasury])
      else:
        X_.loc[i, f'{self.targetTreasury}_Last_{self.backNumberOfDays}_Yield_Change'] = 1

    return X_   

class DaysBackRateTransformer(BaseEstimator, TransformerMixin): 

  def __init__(self, targetStock: str, numberOfDays: int):
    self.targetStock: str = targetStock
    self.numberOfDays: int = numberOfDays

  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_ = X.copy()
    return self.addTargetStockDayRate(X_, self.targetStock)

  def addTargetStockDayRate(self, X_: pd.DataFrame, stock: str):
    for i in range(self.numberOfDays, len(X_)):
      X_.loc[i, f'{stock}_-{self.numberOfDays}_DayRate'] = X_.loc[i - self.numberOfDays, stock]
    return X_


class TotalChangeOverDaysTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, overDays: int, targetStock: str):
    self.overDays = overDays
    self.targetStock = targetStock;

  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_ = X.copy()
    return self.addTotalChange(X_, self.targetStock)
  
  def addTotalChange(self, X_: pd.DataFrame, stock: str):
    for i in range(self.overDays - 1, len(X_)):
      X_.loc[i, f'{stock}_TotalChange_Last_{self.overDays}_Days'] = self.phi(X_.loc[i-self.overDays + 1:i, stock].values.tolist())
    return X_;

  def phi(self, numbers):
    product = 1
    for n in numbers:
      mult = (1+n)
      if mult == 1:
        mult = 1

      product *= mult
    return product

class WouldProfitTargetTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, targetStock: str):
    self.targetStock = targetStock

  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_=X.copy()
    for i in range(0, len(X)-1):
      if(X_.loc[i+1, f'{self.targetStock}'] > 0):
        X_.loc[i, f'{self.targetStock}_WouldProfit'] = 1
      else:
        X_.loc[i, f'{self.targetStock}_WouldProfit'] = 0
    return X_

class AddExchangeCloseVolumeTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, source: pd.DataFrame):
    self.sourceDf = source

  def fit(self, X, y=None):
    return self;

  def transform(self, X, y=None):
    X_ = X.copy()
    exchangeName = self.sourceDf.loc[0,'Name']
    X_[f'{exchangeName}_Close'] = self.sourceDf[f'Close']
    X_[f'{exchangeName}_Volume'] = self.sourceDf[f'Volume']
    return X_


class SameIndustryDaysBackRateTransformer(DaysBackRateTransformer):
  def __init__(self, industryByStock: Dict[str, str], targetStock: str, numberOfDays: int):
    super().__init__(targetStock, numberOfDays)
    self.industryByStock = industryByStock

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    X_ = X.copy()
    targetStockIndustry = self.industryByStock[self.targetStock]

    for key in self.industryByStock:
      if(self.industryByStock[key] == targetStockIndustry and key != self.targetStock):
        X_ = self.addTargetStockDayRate(X_, key)

    return X_

class SameIndustryTotalChangeOverDaysTransformer(TotalChangeOverDaysTransformer):
  def __init__(self, industryByStock: Dict[str,str], overDays: int, targetStock: str):
    super().__init__(overDays, targetStock)
    self.industryByStock = industryByStock

  def fit(self, X, y=None):
    return self;
  
  def transform(self, X, y=None):
    X_ = X.copy()
    targetStockIndustry = self.industryByStock

    for key in self.industryByStock:
      if(self.industryByStock[key] == targetStockIndustry and key != self.targetStock):
        X_ = self.addTotalChange(X_, self.targetStock)

    return X_

class ExchangeVolumeChangeOverTime(BaseEstimator, TransformerMixin):
  def __init__(self, exchangeName: str, overDays: int):
    self.exchangeName = exchangeName
    self.overDays = overDays

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    X_ = X.copy()
    return self.addTotalChange(X)

  def addTotalChange(self, X_: pd.DataFrame):
    for i in range(self.overDays - 1, len(X_)):
      X_.loc[i, f'{self.exchangeName}_Volume_Change_Over_{self.overDays}_Days'] = self.phi(X_.loc[i-self.overDays + 1:i, f'{self.exchangeName}_Volume'].values.tolist())
    return X_;

  def phi(self, numbers):
    product = 1
    for n in numbers:
      product *= (1+n)
    return product

class ExchangeCloseRateChange(BaseEstimator, TransformerMixin):
  def __init__(self, exchangeName: str, overDays: int):
    self.exchangeName = exchangeName
    self.overDays = overDays

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    X_ = X.copy()

    for i in range(self.overDays, len(X_)):
      X_.loc[i, f'{self.exchangeName}_Close_Rate_Change_Over_{self.overDays}_Days'] = 1 - (X_.loc[i, f'{self.exchangeName}_Close'] / X_.loc[i-self.overDays, f'{self.exchangeName}_Close'])

    return X_

class FinalColCleanup(BaseEstimator, TransformerMixin):

  def fit(self, X, y=None):
    return self
  def transform(self, X, y=None):
    X.drop(columns=['Name'], inplace=True)
    return X

class PreProcessPipelineConstants():
  def __init__(self, stockSymbols: List[str], treasurySymbols: List[str], common_cols: List[str], industryByStock: Dict[str,str]):
    self.stockSymbols: List[str] = stockSymbols
    self.treasurySymbols: List[str] = treasurySymbols
    self.commonColumns: List[str] = common_cols
    self.industryByStock: Dict[str, str] = industryByStock

class PreProcessPipelineConfig():
  def __init__(self, dataframes: List[pd.DataFrame], constants: PreProcessPipelineConstants, daysBack: List[int], targetStock: str):
    self.constants: PreProcessPipelineConstants = constants

    self.dataframes: List[pd.DataFrames] = dataframes
    self.targetStock: str = targetStock
    self.daysBack = daysBack

class PreProcessPipelineFactory():
  def __init__(self, config: PreProcessPipelineConfig):
    self.config = config

  def __createAddExchangeCloseVolumeTransformers(self):
    l = []

    for i in range(0, len(self.config.dataframes)):
      l.append((f'use_add_df{i}_exchange_transformer', AddExchangeCloseVolumeTransformer(self.config.dataframes[i])))

    return l

  def __createDaysBackRateTransformers(self): 
    l = []

    for stock in self.config.constants.stockSymbols:
      for n in self.config.daysBack:
        l.append((f'use_{n}_days_back_rate_transformer_{stock}', DaysBackRateTransformer(stock, n)))

    return l

  def __createExchangeCloseRateChange(self):
    l = []

    for df in self.config.dataframes:
      exchangeName = df.loc[0, 'Name']
      for n in self.config.daysBack:
        l.append((f'use_exchange_{exchangeName}_close_rate_over_{n}_days_transformer', ExchangeCloseRateChange(exchangeName, n)))

    return l

  def __createExchangeVolumeChangeOverTimeTransformers(self):
    l = []

    for df in self.config.dataframes:
      exchangeName = df.loc[0, 'Name']
      for n in self.config.daysBack:
        l.append((f'use_exchange_{exchangeName}_volume_change_over_{n}_days', ExchangeVolumeChangeOverTime(exchangeName, n)))
    
    return l

  def __createTotalChangeOverDaysTransformers(self):
    l = []

    days = self.config.daysBack
    for stock in self.config.constants.stockSymbols:
      for n in days:
        l.append((f'use_total_change_over_{n}_day_transformer_{stock}', TotalChangeOverDaysTransformer(n, stock)))

    return l
  def __createTreasuryYieldRateChangeTransformers(self):
    l = []

    for treasury in self.config.constants.treasurySymbols:
      for n in self.config.daysBack: 
        l.append((f'use_treasury_{n}_yield_rate_change_{treasury}', TreasuryRateTransformer(treasury, n)))
    
    return l

  def create(self) -> skl.pipeline.Pipeline:
    cleanupSteps = [
      ("use_trim_unused_columns_transformer", TrimUnusedColumnsTransformer(self.config.constants.commonColumns)),
      ("use_date_transformer", DateTransformer()),
      ("use_rename_close_volume_columns_tranformer", RenameCloseVolumeColumnsTransformer()),
    ] 

    addExchangeCloseVolumeTransformers = self.__createAddExchangeCloseVolumeTransformers() 

    exchangeCloseRateChangeTransformers = self.__createExchangeCloseRateChange()

    exchangeVolumeChangeOverTimeTransformers = self.__createExchangeVolumeChangeOverTimeTransformers()

    daysBackRateTransformers = self.__createDaysBackRateTransformers()

    totalChangeOverDaysTransformers = self.__createTotalChangeOverDaysTransformers()

    treasuryRateTransformer = self.__createTreasuryYieldRateChangeTransformers()

    targetTransformer = [('use_would_profit_target_transformer', WouldProfitTargetTransformer(self.config.targetStock))]

    finalCleanup = [('use_final_col_cleanup_transformer', FinalColCleanup())]

    pipelineSteps = cleanupSteps + addExchangeCloseVolumeTransformers + exchangeCloseRateChangeTransformers + exchangeVolumeChangeOverTimeTransformers + daysBackRateTransformers + totalChangeOverDaysTransformers + treasuryRateTransformer + targetTransformer + finalCleanup

    return Pipeline(
        steps=pipelineSteps
    )

class PreProcessPipelineFactoryFactory():
  def __init__(self, constants: PreProcessPipelineConstants, additionalExchangeDfs: List[pd.DataFrame]):
    self.additionalExchangeDfs = additionalExchangeDfs
    self.constants = constants

  def create(self, targetStock: str, numberOfBackDays: List[int]) -> PreProcessPipelineFactory:
    constants = PreProcessPipelineConstants(STOCK_COLS, TREASURY_COLS, COMMON_COLS, STOCKS_BY_INDUSTRY)
    config = PreProcessPipelineConfig(self.additionalExchangeDfs, constants, numberOfBackDays, targetStock)

    return PreProcessPipelineFactory(config)
  def createMany(self, targetStocks: List[str], numberOfBackDayCombinations: List[List[int]]) -> Dict[str, PreProcessPipelineFactory]:
    ret: Dict[str, PreProcessPipelineFactory] = {}

    for stock in targetStocks:
      for n in numberOfBackDayCombinations:
        ret[stock] = self.create(stock, n)
    
    return ret
  

In [272]:
constants = PreProcessPipelineConstants(STOCK_COLS, TREASURY_COLS, COMMON_COLS, STOCKS_BY_INDUSTRY)
initial_exchange = nyse_df
additional_exchange_dfs: List[pd.DataFrame] = [dji_df, nasdaq_df, russel_df, sp_df]

targetStocks: List[str] = STOCK_COLS + TREASURY_COLS
numberOfBackDayCombinations: List[List[int]] = [[1,2,3,5,10]]

factoryFactory = PreProcessPipelineFactoryFactory(constants, additional_exchange_dfs)
factories: Dict[str, PreProcessPipelineFactory] = factoryFactory.createMany(targetStocks, numberOfBackDayCombinations)

pipelines: Dict[str, pd.DataFrame] = {}
for stock in targetStocks:
  pipelines[stock] = factories[stock].create()

### Example of resulting transformed dataset

Now from above, we have a set of pipelines in 'pipelines' that we can fit on our initial dataframe.

I went ahead and created multiple pipelines and datasets, where each one has a different target stock. The one I will be using moving forward will be microsoft stock.



In [None]:
msft_df = pipelines[MSFT].fit_transform(initial_exchange)

### Feature Engineering

From the last assignment, these were the features we were planning on engineering, however, after the extensive transformations above, some have changed or be removed completly.

* Research our stock symbols and decide what industry they belong to, add a column indicating the industry. This will be later used for a one hot encode
  * Removed:
    * In our dataset, we dont have have sufficient number of stocks to warrant this. We only a few industries that had more than one stock.
    * We added transformers that calculate the rate change of all stocks across several days. The model should be able to derive which stocks are correlated on itself without adding data to signal what industry that stock is in.

* Since this project would normally use a time series approach, we'll want to add additional columns to get around this. Each row has to be indepedent of the previous rows, so we will add the return % of one or several of the previous days to simulate time series. Since in our business context, we are trading on behalf of a day trader, we dont have to add many of these columns
  * Added: Added multiple columns signaling the % change or the total change (phi of % changes that were in the dataset).

* We will drop all features that are not listed in our data_dictionary.csv as the first step in our pipeline

  * Removed some columns from our data_dictionary due to some innaccuracies in the paper that was supplied by the researchers. Not quite sure why, but Gas, Silver, & Copper were described to be in the datasets but were absent.

* We'll have to massage our data a little bit. We know that there are some columns whose data is repeated across the CSVs (like stock prices and commodoties) so we can extract that information from just one of the CSVs. Then, we'll want to transform the rest of the columns (ex: close/volume) into multiple columns, so we have NYSE_Close, NYSE_Volume, NASDAQ_Close, NASDAQ_Volume, etc., so we dont have multiple rows representing the same day
  * Added: The resulting transformer that we created in this notebook wasnt changed much from the original description.

* We'll also need to extract the return % of each stock exchange, the data is a little weird where the % return for a stock exchange x is not in that csv for x. Example: Nasdaq doesnt have the % return for the nasdaq but has NYSE, S&P, etc., we'll want to make sure all % returns our in our dataset.
  * This one was a little odd, but our originaly analysis was accurate to the resulting transformer that we needed in this notebook

* Further investigation is also needed to combine and average: 
  1. The % return of each stock exchange
    * Modified: Instead of combining or averaging, we added columns to give each row more context as to what the change in the Close price was over various amounts of days
  2. Volume
    * Modified: Instead of combining or averaging, we added columns to give each row more context as what the cumulative change over various days were
  3. Closing price
    * Modified: This was merged with #1
* Lastly, we'll want to transform the date column into something that we can sort by if needed. We can transform this column from a format 'mm-dd-yyyy' to a number yyyymmdd which will rank by year first, then month, then day
  * Modified: Instead of a date, we kept our index and split it across multiple columns (day, month, year, day of week). My thinking was that some traders would trade differently in different parts of the year, and even on different days of the week.

#### Final list of features

1. DateTransformer: transform date from dataset to year, date, month, day of month, and day of week

2. TreasuryRateTransformer: get the rate of change for treasury rates using today and x days ago

3. DaysBackRateTransformer: put the rate of change from x amount of days ago for our stocks

4. TotalChangeOverDaysTransformer: get the total % of change over last x amount of days for all of our stocks

5. WouldProfitTargetTransformer: Add a target column by looking one day into the future to see if the stock went up

6. ExchangeVolumeChangeOverTime: calculate the % rate of change for volume for every exchagne

7. ExchangeCloseRateChange: calculate the % rate of change for Close price for every exchange.

#### EDA post processing

There wasnt a whole lot to be gathered from our EDA from our datasets without the feature engineering we have done in this notebook. But ill include what we did since there were some correlated columns that exist that we can remove along with their corresponding dervied feature columns.

In [None]:
round(nasdaq_df[COMMON_COLS].corr())

In [None]:
# We can see multiple columns that are highly correlated with each other, so we'll tag them for removal
# We'll also have the columsn that were derived
remove: List[str] = [
    'DTB4WK', 'DTB3', 'DTB6'
    ] + [
        'DTB4WK_Last_1_Yield_Change','DTB4WK_Last_2_Yield_Change','DTB4WK_Last_3_Yield_Change','DTB4WK_Last_5_Yield_Change','DTB4WK_Last_10_Yield_Change','DTB3_Last_1_Yield_Change','DTB3_Last_2_Yield_Change','DTB3_Last_3_Yield_Change','DTB3_Last_5_Yield_Change','DTB3_Last_10_Yield_Change','DTB6_Last_1_Yield_Change','DTB6_Last_2_Yield_Change','DTB6_Last_3_Yield_Change','DTB6_Last_5_Yield_Change','DTB6_Last_10_Yield_Change'
        ]


In [None]:
# we can also see that we can remove one of these columsn

corr = msft_df[[DGS5, DGS10]].corr()
display(round(corr))
remove.append('DGS5')


In [None]:
# and remove the corresponding dervied columns
toRemove: List[str] = [
    'DGS5_Last_1_Yield_Change','DGS5_Last_2_Yield_Change','DGS5_Last_3_Yield_Change','DGS5_Last_5_Yield_Change','DGS5_Last_10_Yield_Change','DGS10_Last_1_Yield_Change','DGS10_Last_2_Yield_Change','DGS10_Last_3_Yield_Change','DGS10_Last_5_Yield_Change','DGS10_Last_10_Yield_Change'
  ]
for r in toRemove: 
  remove.append(r)

print(remove)

In [None]:
#remove columns and lets take a look at the correlated values again
msft_df.drop(columns=remove, inplace=True)

#we'll also remove the nan rows, that came from calculating over the previous days 
msft_df = msft_df.iloc[15:,:]
msft_df.to_csv('msft_df.csv')
corr = msft_df.corr()
corr.style.background_gradient(cmap='coolwarm')


In [None]:
msft_df['Year'].dtypes

### Training our model



In [None]:
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split

target = 'MSFT_WouldProfit'
X = msft_df.drop(columns=[target])
y = msft_df[[target]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.70)
lr = LinearRegression()
lr.fit(X_train, y_train)


print(lr.predict(X_test))

print(print(y_test))

# ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
