## Purpose

This homework is designed to give you practice with scikitlearn.  Please note that this is **NOT** a machine learning course.  Using the library the important part, not designing 'good' models.  The requirements are fairly low on this.

## Requirements

This is a group assignment.  Take a data set (either one provided, or using your group project data set) and work with Scikit Learn to train some aspect of your data set.

Some data sets may appear to be something you wouldn't use ML to solve in a 'real life' situation, but this again is just for practice.  So the models may not come out useful, and that's okay.

Each student in the group should do 2 ML type implementations using Scikit learn.  Since there are likely less applicable algorithms than there are implementations, work at looking at different slices of information (See help video).


## Required Hand-in

One notebook should be handed in.  Following best practices I've outlined.  This homework is graded as a group homework.  The data set you pick to do this practice can be either one I'm providing as part of the repo, or of your group project.

Please label each implementation with the original author (in code, comment above the implementation).

Do not use the .todo as your template.  Analysis of the models performance should be minimal (see one example on block 10 on https://github.com/TheDarkTrumpet/BAIS-6040-0EXP-Sum2021/blob/master/Notebooks/02-Analysis/09.03.01-Classification.ipynb ).

I do recommend that you lean on whoever in your group has a bit more knowledge of ML concepts. to pick the implementation that appears to yield the best results.  If you're using your group data set, this implementation can then be copied/pasted into the group project.

## Other notes

This homework will be graded as a group.  Meaning, you all will get the same grade, regardless if a specific student's implementation is poorly done.  It will count for 75 points.  I strongly recommend you discuss as a group who will do what, then meet up at least a few days before the assignment is to be turned in and do a code review and merge of the individual notebooks.

### Import

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np
import os
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from datetime import date
from datetime import timedelta
from pytrends.request import TrendReq

%matplotlib inline

### Global Varables

In [2]:
dataDir = r"./Data Files/"  #Directory of all data

today = date.today()  # Todays date

### Global Functions

In [21]:
# Function gets stock data and trend data if needed

def get_data(ticker):
    if os.path.exists(f"{dataDir}{ticker}_{today}_year.csv"):
        
        #Get stored data
        stored_data = pd.read_csv(f"{dataDir}{ticker}_{today}.csv")
        
        return stored_data
    else:
        #Get new data

        # Connect to Google API
        pytrends = TrendReq(hl='en-US', tz=360)

        # Set Keyword
        kw_list = [ticker]

        # Google API only shows last 90 days so need to intirate
        # Set start of interval
        date90front = date.today()
        # Initiate dataframe
        trend_data = pd.DataFrame()

        for x in range(4):
            # Set start end of interval
            date90back = date90front - timedelta(days=90)
            # Build Payload of 90 days
            pytrends.build_payload(kw_list,
                                   timeframe=f'{date90back} {date90front}',
                                   geo='')
            trend_90 = pytrends.interest_over_time()
            trend_data = pd.concat([trend_90, trend_data])
            date90front = date90back

        # Get Stock Data
        stock_data = yf.download(ticker,
                                 start=date.today() - timedelta(days=360),
                                 end=date.today(), interval="1d")

        # Combine Data
        new_data = stock_data.join(trend_data)

        # Export to data folder
        new_data.to_csv(f"{dataDir}{ticker}_{today}_year.csv")

        return new_data
    

### Data and Analysis

#### Gamestop(GME)
Connor Moore

##### Get Data

In [22]:
GME_DF = get_data("GME")

# Name unnamed date column
#GME_DF.rename(columns = {"Unnamed: 0": "Date"},inplace = True)

GME_DF

[*********************100%***********************]  1 of 1 completed


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,GME,isPartial
2020-07-23,4.09,4.31,4.06,4.11,4.11,3237200,19,False
2020-07-24,4.06,4.23,4.01,4.03,4.03,2215900,17,False
2020-07-27,4.02,4.12,3.95,4.01,4.01,2472700,24,False
2020-07-28,3.96,4.05,3.92,3.94,3.94,4555400,27,False
2020-07-29,3.94,4.18,3.92,4.06,4.06,2879600,17,False
2020-07-30,4.0,4.23,3.97,4.1,4.1,2398500,15,False
2020-07-31,4.06,4.16,3.99,4.01,4.01,1879400,20,False
2020-08-03,4.03,4.25,4.0,4.15,4.15,2517600,13,False
2020-08-04,4.13,4.74,4.13,4.43,4.43,10361400,21,False
2020-08-05,4.5,4.76,4.25,4.63,4.63,4919300,20,False


In [23]:
# Add difference

GME_DF["Price Difference"] = GME_DF["Open"]-GME_DF["Close"]


In [24]:
# Delete isPartial

del GME_DF['isPartial']

In [26]:
GME_DF.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2020-07-23 to 2021-07-16
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Open              251 non-null    float64
 1   High              251 non-null    float64
 2   Low               251 non-null    float64
 3   Close             251 non-null    float64
 4   Adj Close         251 non-null    float64
 5   Volume            251 non-null    int64  
 6   GME               251 non-null    int64  
 7   Price Difference  251 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 17.6 KB


DatetimeIndex(['2020-07-23', '2020-07-24', '2020-07-25', '2020-07-26',
               '2020-07-27', '2020-07-28', '2020-07-29', '2020-07-30',
               '2020-07-31', '2020-08-01',
               ...
               '2021-07-07', '2021-07-08', '2021-07-09', '2021-07-10',
               '2021-07-11', '2021-07-12', '2021-07-13', '2021-07-14',
               '2021-07-15', '2021-07-16'],
              dtype='datetime64[ns]', name='Date', length=362, freq=None)