# Homework 7 - Stock and Google Search Correlation Analysis 2
## Group 1
## 20 July 2021

### Introduction

Code imports and performs analysis on daily stock price and interest over time using machine learning
* GameStop (GME) <br>
* Apple (AAPL) <br>
* Coke (KO)<br>
* John Deere (DE) <br>
* AMC (AMC) <br>


### Import

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import os
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from datetime import date
from pytrends.request import TrendReq

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report
import random as rnd

%matplotlib inline

### Global Varables and Initialization

In [None]:
dataDir = r"./Data Files/"  #Directory of all data

today = date.today()  # Todays date

rnd.seed(1024)

### Global Functions

In [None]:
# Function gets stock data and trend data if needed

def get_data(ticker):
    if os.path.exists(f"{dataDir}{ticker}_{today}.csv"):
        #Get stored data
        stored_data = pd.read_csv(f"{dataDir}{ticker}_{today}.csv")
        
        return stored_data
    else:
        #Get new data

        # Connect to Google API
        pytrends = TrendReq(hl='en-US', tz=360)

        # Set Keyword
        kw_list = [ticker]

        # Build Payload
        pytrends.build_payload(kw_list, timeframe='2021-04-01 2021-06-30', geo='')

        # Get trends Data frame
        trend_data = pytrends.interest_over_time()
        trend_data.rename(columns = {ticker: "Search Interest"},inplace = True)

        # Get Stock Data
        stock_data = yf.download(ticker, start="2021-04-01", end="2021-06-30", interval="1d")

        # Combine Data
        new_data = pd.concat([stock_data, trend_data], axis = 1, join = 'inner')

        # Export to data folder
        new_data.to_csv(f"{dataDir}{ticker}_{today}.csv")

        return new_data
    

### Data and Analysis

#### Gamestop(GME)
Connor Moore

##### Get Data

In [None]:
GME_DF = get_data("GME")

# Name unnamed date column
GME_DF.rename(columns = {"Unnamed: 0": "Date"},inplace = True)

GME_DF

In [None]:
# Add difference

GME_DF["Price Difference"] = GME_DF["Open"]-GME_DF["Close"]

In [None]:
# Set date as index
GME_DF.set_index('Date', inplace=True)
#add rename 
GME_DF.index = pd.to_datetime(GME_DF.index)
GME_DF

##### Analysis 1

##### Analysis 2

#### Apple (AAPL)
Ken Cupples

##### Get Data

In [None]:
#Reads in Apple stock data from a csv file
Apple = pd.read_csv(f"{dataDir}/AAPL.csv")

In [None]:
AppleSecondQuarter = get_data('aapl')
AppleSecondQuarter

In [None]:
# Name unnamed date column
AppleSecondQuarter.rename(columns = {"Unnamed: 0": "Date"},inplace = True)
AppleSecondQuarter

##### Analysis 1

##### Analysis 2

### Shawn Sonnack
AMC

#### Purpose

We are looking to see if there is any correlation to google search interest with stock price changes.  In this data set we have a pulled a number from google between 0 - 100.  At 0 it means that there was little to no traffic compared to normal operation.  At 100 it means that the search traffic for that day was extremely high.

First I will dig in to see if search interest and the price it opens at can predict the close price of the day.

#### Common methods

In [None]:
def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

def printClassificationMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))


def prepareDataForPredictions(amcDataFrame):
    numElements = 3
    amcStockPreparedData = []
    for _ in range(numElements):
        dict = {}
        for column in X.columns:
            min = 0  # assume min = 0
            maxValue = round(max(amcDataFrame[column].values))
            dict[column] = rnd.randint(min, maxValue)
        amcStockPreparedData.append(dict)
    return amcStockPreparedData

#### Pull in prepared data for AMC stock: January 1 - June 30

In [None]:
amcMergedDataFrame = pd.read_csv('Data Files/AMCDataClean.zip')
amcMergedDataFrame

#### Linear Regression Setup

In [None]:
featureColumns=['Search Interest', 'Open']
targetColumn = 'Close'

X=amcMergedDataFrame[featureColumns]
y=amcMergedDataFrame[targetColumn]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#### Create the regression

In [None]:
lr = LinearRegression()
lr

#### Fit Linear Model

In [None]:
lr.fit(X_train, y_train)

#### How confident are we in our model?

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

#### Print the prediction believed accuracy using the model

In [None]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

#### Create new samples, to test our model

In [None]:
amcStockPreparedData = prepareDataForPredictions(amcMergedDataFrame)

#### Prepare the predictions for consumption

In [None]:
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData

#### Predict what the close price will be

In [None]:
predictions = lr.predict(amcPreparedData)
predictions

#### Make it pretty

In [None]:
amcPredictedPrice = amcPreparedData.copy()
amcPredictedPrice['Price Prediction'] = predictions
amcPredictedPrice

#### Classical - Logistic Regression

In [None]:
amcMergedDataFrame

#### Prepare the Data and logistic Columns

In [None]:
amcMergedDataFrame['Price Increase'] = amcMergedDataFrame['Price Increase'].astype(int)
amcMergedDataFrame['Search Interest Above Avg'] = amcMergedDataFrame['Search Interest Above Avg'].astype(int)
logisticFeatureColumns=['Open', 'Close']
logisticTargetColumn = 'Search Interest Above Avg'

X=amcMergedDataFrame[logisticFeatureColumns]
y=amcMergedDataFrame[logisticTargetColumn]


In [None]:
y

#### Train the model with my data from above

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Create the Logistic Regression

In [None]:
lr = LogisticRegression(solver="liblinear")
lr

#### Fit the data

In [None]:
lr.fit(X_train, y_train)

#### Score the model

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

#### Prepare the predictions

In [None]:
predictions = lr.predict(X_test)
printClassificationMetrics(y_test, predictions)

#### Create the data set to use for the predictions

In [None]:
amcStockPreparedData = prepareDataForPredictions(amcMergedDataFrame)
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData


Use the dummy dataset to test our prediction

In [None]:
predictions = lr.predict(amcPreparedData)
predictions

In [None]:
pdPredictedStockTrend = amcPreparedData
pdPredictedStockTrend["Search Interest Above Average"] = predictions.astype(bool)
pdPredictedStockTrend

#### Conclusion

The linear regression model worked very well.  From the looks of it I was able to predict with extremely high accuracy what the close price would be for a stock based on the opening price, and search interest for the day.  Although in the real world this would be hard to get as search interest and the price changes are happening at the same time.  In my second model using logistical regression, I am not as confident with its prediction.  I tried to look from the other side to see if I could predict the search interest based on open and closed prices of AMC per day.  I felt like this would be interesting if it would conclude that search interest is high only when the stock is doing well.  This did not prove out as being a trend, but the again the accuracy of the model was high.




### Coke
Arielle Swift

#### Get Data

In [None]:
CokeDataSetQtr2 = pd.read_csv(f"{dataDir}KO.csv",parse_dates=["Date"])
CokeDataSetQtr2

In [None]:
CokeDataSetQtr2['Date']= pd.to_datetime(CokeDataSetQtr2['Date'])
CokeDataSetQtr2

In [None]:
CokeDataSetQtr2['InvestToday'] = np.where(CokeDataSetQtr2.Close - CokeDataSetQtr2.Open>0, 1, 0)
CokeDataSetQtr2

In [None]:
Continuous_Cols=[ 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Categorical_Cols=['InvestToday']

In [None]:
Predictor_Cols = Categorical_Cols + Continuous_Cols

Target_Col = 'InvestToday'

X=CokeDataSetQtr2[Continuous_Cols]
y=CokeDataSetQtr2[Target_Col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
print("Population:\n",y.value_counts(normalize=True)*100)
print("Train:\n", y_train.value_counts(normalize=True)*100)
print("Test:\n", y_test.value_counts(normalize=True)*100)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
knn.fit(X_train, y_train)

In [None]:
knn.score(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

#### Confusion Matrix

In [None]:
def printMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

In [None]:
predictions = knn.predict(X_test)
printMetrics(y_test, predictions)

In [None]:
CokeDataSetQtr2

In [None]:
numElements = 10
SampleStock = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0
        maxValue = round(max(CokeDataSetQtr2[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStock.append(dict)
SampleStock

In [None]:
pdSampleStock = pd.DataFrame.from_dict(SampleStock)
predictions = knn.predict(pdSampleStock)
predictions

In [None]:
pdPredictedInvest = pdSampleStock
pdPredictedInvest["InvestToday?"] = predictions.astype(bool)
pdPredictedInvest

In [None]:
lr = LogisticRegression(solver="liblinear")
lr

In [None]:
lr.fit(X_train, y_train)


In [None]:
lr.score(X_train, y_train)


In [None]:
lr.score(X_test, y_test)

In [None]:
predictions = lr.predict(X_test)

In [None]:
fig = px.scatter(CokeDataSetQtr2, x='Close', y='Open',color='InvestToday')
fig.show()

### Tesla (TSLA)
Andrew T.

Next few lines are about extracting Google Trends data the word Tesla over 04-01-2021 to 06-30-2021 time period as well as the financial information of the Tesla stock.  You are able to see the open and closing price along with the high and lows for the day and the trading volume.

In [None]:
tesladf = get_data("TSLA")

# Name unnamed date column
tesladf.rename(columns = {"Unnamed: 0": "Date"},inplace = True)

# Set date as index
tesladf.set_index('Date', inplace=True)

tesladf

The mean Search Interest for Tesla is 67.1 over the time period of 04-01-2021 to 06-30-2021. For a given day the search interest is subracted from the mean and is added to the teslaTrendsdf Data Frame in a new column.

In [None]:
teslaTrendsMean = tesladf['Search Interest'].mean()
#print(f'The mean of the Search Interest for Tesla is {teslaTrendsMean}')
tesladf["Points Away From Mean"] = tesladf["Search Interest"] - teslaTrendsMean
tesladf



##### Analysis 1

##### Analysis 2

### John Deere (DE)
Dan Knobloch

#### Data Load:

In [None]:
JDStockTrend = get_data("DE")

# Name unnamed date column
JDStockTrend.rename(columns = {"Unnamed: 0": "Date"},inplace = True)

JDStockTrend

##### Analysis 1

##### Analysis 2