## Purpose

This homework is designed to give you practice with scikitlearn.  Please note that this is **NOT** a machine learning course.  Using the library the important part, not designing 'good' models.  The requirements are fairly low on this.

## Requirements

This is a group assignment.  Take a data set (either one provided, or using your group project data set) and work with Scikit Learn to train some aspect of your data set.

Some data sets may appear to be something you wouldn't use ML to solve in a 'real life' situation, but this again is just for practice.  So the models may not come out useful, and that's okay.

Each student in the group should do 2 ML type implementations using Scikit learn.  Since there are likely less applicable algorithms than there are implementations, work at looking at different slices of information (See help video).


## Required Hand-in

One notebook should be handed in.  Following best practices I've outlined.  This homework is graded as a group homework.  The data set you pick to do this practice can be either one I'm providing as part of the repo, or of your group project.

Please label each implementation with the original author (in code, comment above the implementation).

Do not use the .todo as your template.  Analysis of the models performance should be minimal (see one example on block 10 on https://github.com/TheDarkTrumpet/BAIS-6040-0EXP-Sum2021/blob/master/Notebooks/02-Analysis/09.03.01-Classification.ipynb ).

I do recommend that you lean on whoever in your group has a bit more knowledge of ML concepts. to pick the implementation that appears to yield the best results.  If you're using your group data set, this implementation can then be copied/pasted into the group project.

## Other notes

This homework will be graded as a group.  Meaning, you all will get the same grade, regardless if a specific student's implementation is poorly done.  It will count for 75 points.  I strongly recommend you discuss as a group who will do what, then meet up at least a few days before the assignment is to be turned in and do a code review and merge of the individual notebooks.

### Import

In [23]:
import yfinance as yf
import pandas as pd
import numpy as np
import os


from datetime import date
from datetime import timedelta
from pytrends.request import TrendReq

# Import modules using the from syntax
from sklearn.cluster import KMeans                      # k-means clustering
from sklearn.model_selection import train_test_split    # For generating test/train
from sklearn.linear_model import LinearRegression   # Logistic regression



### Global Varables

In [24]:
dataDir = r"./Data Files/"  #Directory of all data

today = date.today()  # Todays date

### Global Functions

In [25]:
# Function gets stock data and trend data if needed

def get_data(ticker):
    if os.path.exists(f"{dataDir}{ticker}_{today}_year.csv"):
        
        #Get stored data
        stored_data = pd.read_csv(f"{dataDir}{ticker}_{today}_year.csv")

        # Get rid of index name
        stored_data.set_index('Unnamed: 0', inplace=True)
        stored_data.index.name = None

        return stored_data
    else:
        #Get new data

        # Connect to Google API
        pytrends = TrendReq(hl='en-US', tz=360)

        # Set Keyword
        kw_list = [ticker]

        # Google API only shows last 90 days so need to intirate
        # Set start of interval
        date90front = date.today()
        # Initiate dataframe
        trend_data = pd.DataFrame()

        for x in range(4):
            # Set start end of interval
            date90back = date90front - timedelta(days=90)
            # Build Payload of 90 days
            pytrends.build_payload(kw_list,
                                   timeframe=f'{date90back} {date90front}',
                                   geo='')
            trend_90 = pytrends.interest_over_time()
            trend_data = pd.concat([trend_90, trend_data])
            date90front = date90back

        # Get Stock Data
        stock_data = yf.download(ticker,
                                 start=date.today() - timedelta(days=360),
                                 end=date.today(), interval="1d")

        # Combine Data
        new_data = stock_data.join(trend_data)

        # Export to data folder
        new_data.to_csv(f"{dataDir}{ticker}_{today}_year.csv")

        return new_data
    

### Data and Analysis

#### Gamestop(GME)
Connor Moore

##### Get Data

In [26]:
GME_DF = get_data("GME")
GME_DF

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,GME,isPartial
2020-07-27,4.020000,4.120000,3.950000,4.010000,4.010000,2472700,23.0,False
2020-07-28,3.960000,4.050000,3.920000,3.940000,3.940000,4555400,16.0,False
2020-07-29,3.940000,4.180000,3.920000,4.060000,4.060000,2879600,22.0,False
2020-07-30,4.000000,4.230000,3.970000,4.100000,4.100000,2398500,17.0,False
2020-07-31,4.060000,4.160000,3.990000,4.010000,4.010000,1879400,16.0,False
...,...,...,...,...,...,...,...,...
2021-07-13,187.679993,188.789993,179.000000,180.059998,180.059998,2397900,25.0,False
2021-07-14,180.490005,182.380005,165.070007,167.619995,167.619995,3913800,27.0,False
2021-07-15,160.000000,171.990005,158.009995,166.820007,166.820007,4298600,31.0,False
2021-07-16,170.149994,179.470001,166.300003,169.039993,169.039993,3278800,29.0,False


##### Prepare Data

In [27]:
# Rename search interest
GME_DF.rename(columns = {"GME": "Search Interest"},inplace = True)

# Add difference
GME_DF["Price Difference"] = GME_DF["Open"]-GME_DF["Close"]

# Add truth value that determines if we want to buy or not that day
GME_DF['Buy'] = np.where(GME_DF['Price Difference'] > 0, 1, 0)

# Delete isPartial

del GME_DF['isPartial']

In [28]:
# Check values - no nulls - int or float

GME_DF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, 2020-07-27 to 2021-07-19
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Open              250 non-null    float64
 1   High              250 non-null    float64
 2   Low               250 non-null    float64
 3   Close             250 non-null    float64
 4   Adj Close         250 non-null    float64
 5   Volume            250 non-null    int64  
 6   Search Interest   249 non-null    float64
 7   Price Difference  250 non-null    float64
 8   Buy               250 non-null    int64  
dtypes: float64(7), int64(2)
memory usage: 19.5+ KB


In [29]:
# Set features to target "Buy"

features = list(GME_DF.columns)
features.remove("Buy")
target = "Buy"

print(f"Feature categories: {features}")
print(f"Target feature: {target}")

Feature categories: ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Search Interest', 'Price Difference']
Target feature: Buy


In [30]:
X = GME_DF[features]
X

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Search Interest,Price Difference
2020-07-27,4.020000,4.120000,3.950000,4.010000,4.010000,2472700,23.0,0.010000
2020-07-28,3.960000,4.050000,3.920000,3.940000,3.940000,4555400,16.0,0.020000
2020-07-29,3.940000,4.180000,3.920000,4.060000,4.060000,2879600,22.0,-0.120000
2020-07-30,4.000000,4.230000,3.970000,4.100000,4.100000,2398500,17.0,-0.100000
2020-07-31,4.060000,4.160000,3.990000,4.010000,4.010000,1879400,16.0,0.050000
...,...,...,...,...,...,...,...,...
2021-07-13,187.679993,188.789993,179.000000,180.059998,180.059998,2397900,25.0,7.619995
2021-07-14,180.490005,182.380005,165.070007,167.619995,167.619995,3913800,27.0,12.870010
2021-07-15,160.000000,171.990005,158.009995,166.820007,166.820007,4298600,31.0,-6.820007
2021-07-16,170.149994,179.470001,166.300003,169.039993,169.039993,3278800,29.0,1.110001


In [31]:
y = GME_DF[target]
y


2020-07-27    1
2020-07-28    1
2020-07-29    0
2020-07-30    0
2020-07-31    1
             ..
2021-07-13    1
2021-07-14    1
2021-07-15    0
2021-07-16    1
2021-07-19    0
Name: Buy, Length: 250, dtype: int64

In [32]:
# Set training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print(f"Length of X_train (feature training set): {len(X_train)}")
print(f"Length of y_train (target training set): {len(y_train)}")
print(f"Length of X_test (feature test set): {len(X_test)}")
print(f"Length of y_test (target test set): {len(y_test)}")

Length of X_train (feature training set): 187
Length of y_train (target training set): 187
Length of X_test (feature test set): 63
Length of y_test (target test set): 63


In [33]:
# Linear Regression Analysis

lr = LinearRegression()
lr

lr.fit(X_train, y_train)
lr.score(X_train, y_train)
lr.score(X_test, y_test)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').