# Homework 7 - Stock and Google Search Correlation Analysis 2
## Group 1
## 20 July 2021

### Introduction

Code imports and performs analysis on daily stock price and interest over time using machine learning
* GameStop (GME) <br>
* Apple (AAPL) <br>
* Coke (KO)<br>
* John Deere (DE) <br>
* AMC (AMC) <br>


### Import

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import os
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from datetime import date
from pytrends.request import TrendReq

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report
import random as rnd

%matplotlib inline

### Global Varables and Initialization

In [None]:
dataDir = r"./Data Files/"  #Directory of all data

today = date.today()  # Todays date

rnd.seed(1024)

### Global Functions

In [None]:
# Function gets stock data and trend data if needed

def get_data(ticker):
    if os.path.exists(f"{dataDir}{ticker}_{today}.csv"):
        #Get stored data
        stored_data = pd.read_csv(f"{dataDir}{ticker}_{today}.csv")
        
        return stored_data
    else:
        #Get new data

        # Connect to Google API
        pytrends = TrendReq(hl='en-US', tz=360)

        # Set Keyword
        kw_list = [ticker]

        # Build Payload
        pytrends.build_payload(kw_list, timeframe='2021-04-01 2021-06-30', geo='')

        # Get trends Data frame
        trend_data = pytrends.interest_over_time()
        trend_data.rename(columns = {ticker: "Search Interest"},inplace = True)

        # Get Stock Data
        stock_data = yf.download(ticker, start="2021-04-01", end="2021-06-30", interval="1d")

        # Combine Data
        new_data = pd.concat([stock_data, trend_data], axis = 1, join = 'inner')

        # Export to data folder
        new_data.to_csv(f"{dataDir}{ticker}_{today}.csv")

        return new_data
    

### Data and Analysis

#### Gamestop(GME)
Connor Moore

##### Get Data

In [None]:
GME_DF = get_data("GME")

# Name unnamed date column
GME_DF.rename(columns = {"Unnamed: 0": "Date"},inplace = True)

GME_DF

In [None]:
# Add difference

GME_DF["Price Difference"] = GME_DF["Open"]-GME_DF["Close"]

In [None]:
# Set date as index
GME_DF.set_index('Date', inplace=True)
#add rename 
GME_DF.index = pd.to_datetime(GME_DF.index)
GME_DF

##### Analysis 1

##### Analysis 2

#### Apple (AAPL)
Ken Cupples

##### Get Data

In [None]:
#Reads in Apple stock data from a csv file
Apple = pd.read_csv(f"{dataDir}/AAPL.csv")

In [None]:
AppleSecondQuarter = get_data('aapl')
AppleSecondQuarter

In [None]:
# Name unnamed date column
AppleSecondQuarter.rename(columns = {"Unnamed: 0": "Date"},inplace = True)
AppleSecondQuarter

##### Analysis 1

##### Analysis 2

### Shawn Sonnack
AMC

#### Purpose

We are looking to see if there is any correlation to google search interest with stock price changes.  In this data set we have a pulled a number from google between 0 - 100.  At 0 it means that there was little to no traffic compared to normal operation.  At 100 it means that the search traffic for that day was extremely high.

First I will dig in to see if search interest and the price it opens at can predict the close price of the day.

#### Common methods

In [None]:
def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

def printClassificationMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))


def prepareDataForPredictions(amcDataFrame):
    numElements = 3
    amcStockPreparedData = []
    for _ in range(numElements):
        dict = {}
        for column in X.columns:
            min = 0  # assume min = 0
            maxValue = round(max(amcDataFrame[column].values))
            dict[column] = rnd.randint(min, maxValue)
        amcStockPreparedData.append(dict)
    return amcStockPreparedData

#### Pull in prepared data for AMC stock: January 1 - June 30

In [None]:
amcMergedDataFrame = pd.read_csv('Data Files/AMCDataClean.zip')
amcMergedDataFrame

#### Linear Regression Setup

In [None]:
featureColumns=['Search Interest', 'Open']
targetColumn = 'Close'

X=amcMergedDataFrame[featureColumns]
y=amcMergedDataFrame[targetColumn]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#### Create the regression

In [None]:
lr = LinearRegression()
lr

#### Fit Linear Model

In [None]:
lr.fit(X_train, y_train)

#### How confident are we in our model?

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

#### Print the prediction believed accuracy using the model

In [None]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

#### Create new samples, to test our model

In [None]:
amcStockPreparedData = prepareDataForPredictions(amcMergedDataFrame)

#### Prepare the predictions for consumption

In [None]:
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData

#### Predict what the close price will be

In [None]:
predictions = lr.predict(amcPreparedData)
predictions

#### Make it pretty

In [None]:
amcPredictedPrice = amcPreparedData.copy()
amcPredictedPrice['Price Prediction'] = predictions
amcPredictedPrice

#### Classical - Logistic Regression

In [None]:
amcMergedDataFrame

#### Prepare the Data and logistic Columns

In [None]:
amcMergedDataFrame['Price Increase'] = amcMergedDataFrame['Price Increase'].astype(int)
amcMergedDataFrame['Search Interest Above Avg'] = amcMergedDataFrame['Search Interest Above Avg'].astype(int)
logisticFeatureColumns=['Open', 'Close']
logisticTargetColumn = 'Search Interest Above Avg'

X=amcMergedDataFrame[logisticFeatureColumns]
y=amcMergedDataFrame[logisticTargetColumn]


In [None]:
y

#### Train the model with my data from above

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Create the Logistic Regression

In [None]:
lr = LogisticRegression(solver="liblinear")
lr

#### Fit the data

In [None]:
lr.fit(X_train, y_train)

#### Score the model

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

#### Prepare the predictions

In [None]:
predictions = lr.predict(X_test)
printClassificationMetrics(y_test, predictions)

#### Create the data set to use for the predictions

In [None]:
amcStockPreparedData = prepareDataForPredictions(amcMergedDataFrame)
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData


Use the dummy dataset to test our prediction

In [None]:
predictions = lr.predict(amcPreparedData)
predictions

In [None]:
pdPredictedStockTrend = amcPreparedData
pdPredictedStockTrend["Search Interest Above Average"] = predictions.astype(bool)
pdPredictedStockTrend

#### Conclusion

The linear regression model worked very well.  From the looks of it I was able to predict with extremely high accuracy what the close price would be for a stock based on the opening price, and search interest for the day.  Although in the real world this would be hard to get as search interest and the price changes are happening at the same time.  In my second model using logistical regression, I am not as confident with its prediction.  I tried to look from the other side to see if I could predict the search interest based on open and closed prices of AMC per day.  I felt like this would be interesting if it would conclude that search interest is high only when the stock is doing well.  This did not prove out as being a trend, but the again the accuracy of the model was high.




### Coke
Arielle Swift

#### Get Data

In [None]:
CokeDataSetQtr2 = pd.read_csv(f"{dataDir}KO.csv",parse_dates=["Date"])
CokeDataSetQtr2

In [None]:
CokeDataSetQtr2['Date']= pd.to_datetime(CokeDataSetQtr2['Date'])
CokeDataSetQtr2

In [None]:
CokeDataSetQtr2['InvestToday'] = np.where(CokeDataSetQtr2.Close - CokeDataSetQtr2.Open>0, 1, 0)
CokeDataSetQtr2

In [None]:
Continuous_Cols=[ 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Categorical_Cols=['InvestToday']

In [None]:
Predictor_Cols = Categorical_Cols + Continuous_Cols

Target_Col = 'InvestToday'

X=CokeDataSetQtr2[Continuous_Cols]
y=CokeDataSetQtr2[Target_Col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
print("Population:\n",y.value_counts(normalize=True)*100)
print("Train:\n", y_train.value_counts(normalize=True)*100)
print("Test:\n", y_test.value_counts(normalize=True)*100)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
knn.fit(X_train, y_train)

In [None]:
knn.score(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

#### Confusion Matrix

In [None]:
def printMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

In [None]:
predictions = knn.predict(X_test)
printMetrics(y_test, predictions)

In [None]:
CokeDataSetQtr2

In [None]:
numElements = 10
SampleStock = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0
        maxValue = round(max(CokeDataSetQtr2[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStock.append(dict)
SampleStock

In [None]:
pdSampleStock = pd.DataFrame.from_dict(SampleStock)
predictions = knn.predict(pdSampleStock)
predictions

In [None]:
pdPredictedInvest = pdSampleStock
pdPredictedInvest["InvestToday?"] = predictions.astype(bool)
pdPredictedInvest

In [None]:
lr = LogisticRegression(solver="liblinear")
lr

In [None]:
lr.fit(X_train, y_train)


In [None]:
lr.score(X_train, y_train)


In [None]:
lr.score(X_test, y_test)

In [None]:
predictions = lr.predict(X_test)

In [None]:
fig = px.scatter(CokeDataSetQtr2, x='Close', y='Open',color='InvestToday')
fig.show()

### Tesla (TSLA)
Andrew T.

In [None]:
pytrends = TrendReq(hl='en-US', tz=360)

#build list of keywords in this case only use Teslas
kw_list = ["Bitcoin"]

# build the payload
pytrends.build_payload(kw_list, timeframe='2021-03-31 2021-06-29', geo='US')

# store interest over time information in df and rename Tesla column to Search Interest
bitcoinTrendsdf = pytrends.interest_over_time()
bitcoinTrendsdf = bitcoinTrendsdf.rename(columns={'Bitcoin': 'Previous_Search_Interest'})
#telsaStockdf.set_index('date': 'Date', inplace=True)
bitcoinTrendsdf.reset_index(inplace=True, drop=True)
bitcoinTrendsdf

In [None]:
bitcoinPricedf = pd.read_csv("https://raw.githubusercontent.com/atlas125gev/StockProject/main/Homework7/BTC-USD.csv")
bitcoinPreviousPricedf = pd.read_csv("https://raw.githubusercontent.com/atlas125gev/StockProject/main/Homework7/BTC-USD-Previous.csv")
mergedPrice = pd.concat([bitcoinPricedf, bitcoinPreviousPricedf], axis=1)
bitcoinPreviousPricedf

In [None]:
mergedStockPrice = pd.concat([mergedPrice, bitcoinTrendsdf], axis=1)
#mergedStockPrice.reset_index(inplace=True, drop=True)
mergedStockPrice.set_index('isPartial', drop=True)
#mergedStockPrice['Previous_Close'] = mergedStockPrice['Close']
#meanSearchInterest = mergedStockPrice['Search_Interest'].mean()
#mergedStockPrice["Interest Points Away From Mean"] = mergedStockPrice["Search Interest"] - meanSearchInterest
mergedStockPrice["Price_Increase"] = mergedStockPrice["Open"] - mergedStockPrice["Close"] > 0.0
mergedStockPrice["Price_Increase"] = mergedStockPrice["Price_Increase"]*1


#pd.set_option('display.max_rows', len(mergedStockPrice))

mergedStockPrice

In [None]:
mergedStockPrice.info()

In [None]:
#columns = ["Open", "High", "Low", "Close", "Search Interest", "Interest Points Away From Mean", "Price Increase Points", "Price Increase"]
columns = ["Previous_Open", "Previous_High", "Previous_Low", "Previous_Close", "Previous_Volume", "Previous_Search_Interest", "Close"]
mergedStockPrice = mergedStockPrice[columns]
mergedStockPrice

In [None]:
features = list(mergedStockPrice.columns)
features.remove("Close")
target = "Close"

X = mergedStockPrice[features]
y = mergedStockPrice[target]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [None]:
lr = LinearRegression()
lr

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

In [None]:

def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

In [None]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

In [None]:
numElements = 3
samplePrice = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # We'll always allow at lea
        maxValue = round(max(mergedStockPrice[column].values))
        dict[column] = rnd.randint(min, maxValue)
    samplePrice.append(dict)
samplePrice


In [None]:
pdSamplePrice = pd.DataFrame.from_dict(samplePrice)
pdSamplePrice


In [None]:
predictions = lr.predict(pdSamplePrice)
predictions



In [None]:
pdSamplePrice = pdSamplePrice.copy()
pdSamplePrice['Predicted'] = predictions
pdSamplePrice



# Logarithmic Regression

In [None]:
def printClassificationMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

In [None]:
pd.set_option('display.max_rows', len(mergedStockPrice))

mergedStockPrice["Price_Increase"] = bitcoinPricedf["Open"] - bitcoinPricedf["Close"] > 0.0
mergedStockPrice['Price_Increase'] = mergedStockPrice.Price_Increase.astype(int)
mergedStockPrice


In [None]:

columns = ["Previous_Open", "Previous_Close", "Previous_Search_Interest", "Price_Increase"]
mergedStockPrice = mergedStockPrice[columns]
mergedStockPrice

features = list(mergedStockPrice.columns)
features.remove("Price_Increase")
target = "Price_Increase"

X = mergedStockPrice[features]
y = mergedStockPrice[target]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:


logReg = LogisticRegression(solver="liblinear")
logReg

In [None]:
logReg.fit(X_train, y_train)

In [None]:
logReg.score(X_train, y_train)

In [None]:
logReg.score(X_test, y_test)

In [None]:
predictions = logReg.predict(X_test)
printMetrics(y_test, predictions)

In [None]:
predictions = logReg.predict(X_test)
printClassificationMetrics(y_test, predictions)



In [None]:
pdSamplePrice = pd.DataFrame.from_dict(pdSamplePrice)

predictions = logReg.predict(pdSamplePrice)

pdPredicted = pdSamplePrice
pdPredicted["Price_Increase"] = predictions.astype(bool)
pdPredicted

### John Deere (DE)
Dan Knobloch

##### Helper methods

In [None]:
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

In [None]:
def printRegressionMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

In [None]:
def printClassificationMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

#### Part 1: Regression - Linear Regression
##### Prepare the Data
load data, clean up, prepare features and targets, split up training and test data.

In [None]:
# read file, drop null values, convert binary values in trend and price to integers
JDStockTrend = pd.read_csv(r"C:\Users\dk12955\BAISsummer2021\ClassProject\DeereStockPrice.csv")
JDStockTrend.dropna(inplace=True)
JDStockTrend['trend_daily_increase'] = JDStockTrend.trend_daily_increase.astype(int)
JDStockTrend['price_daily_increase'] = JDStockTrend.price_daily_increase.astype(int)
JDStockTrend

In [None]:
# determine feature vectors and target vectors. in this case targeting a prediction of price based on the previous days closing price, volume traded, and trends data
featureColumns = ["Trends","Previous_Close","Volume","trend_daily_increase"]
target = "Price"

X=JDStockTrend[featureColumns]
y=JDStockTrend[target]

X.info()

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

## Modeling with Linear Regression

fit a line to our data set, with the minimum distance between the points.

In [None]:
lr = LinearRegression()    #use this algorithm to start developing the line betwee data points
lr

In [None]:
lr.fit(X_train, y_train)
trainScore = lr.score(X_train, y_train)
testScore = lr.score(X_test, y_test)

print(f'the score with the training data set = {trainScore}')
print(f'the score with the test data set = {testScore}')

## LR Metrics Output

In [None]:
predictions = lr.predict(X_test)
printRegressionMetrics(y_test, predictions)

#### Predict some new samples

define a few new samples.

In [None]:
numElements = 3
sampleStockTrend = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # assume min = 0
        maxValue = round(max(JDStockTrend[column].values))
        dict[column] = rnd.randint(min, maxValue)
    sampleStockTrend.append(dict)
sampleStockTrend

In [None]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)
pdSampleStockTrend

In [None]:
predictions = lr.predict(pdSampleStockTrend)
predictions

In [None]:
pdPredictedStockTrend = pdSampleStockTrend.copy()
pdPredictedStockTrend['Predicted'] = predictions
pdPredictedStockTrend

##### Part 2: Classification - Logistic Regression

##### Prepare the Data
load data, clean up, prepare features and targets, split up training and test data.

In [None]:
# Need to prepare the data seperately for logistic regression becuase our feature and target vectors will be different
JDStockTrend

In [None]:
#determine feature vectors and target vectors. in this case targeting a classification on whether or not the price would increase based on the previous days closing price, volume traded, and search trends data
featureColumns = ["Trends","Previous_Close","Volume","trend_daily_increase"]
target = "price_daily_increase"

X=JDStockTrend[featureColumns]
y=JDStockTrend[target]

X

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

#### Modeling with Logistic Regression (classification)


In [None]:
lr = LogisticRegression(solver="liblinear")
lr

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

#### Classification Metrics Output

In [None]:
predictions = lr.predict(X_test)
printClassificationMetrics(y_test, predictions)

##### Predict with the samples that were generated above

In [None]:
sampleStockTrend

In [None]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)

In [None]:
predictions = lr.predict(pdSampleStockTrend)
predictions

In [None]:
pdPredictedStockTrend = pdSampleStockTrend
pdPredictedStockTrend["price_daily_increse"] = predictions.astype(bool)
pdPredictedStockTrend

## Conclusion

analyzing the current results of the both methods used (linear regression, and logistic regression (Classification). it does seem like the data points to the fact that the linear regrission is a strong canidate for predicting price of John Deere. in each linear regression scenario, the score is around 90%. and the sample data that is generated also seems to develop reasonable price predictions for the closing price. the logistic model (for classification) only has a score of 44%. overall. in some ways, developing a stronger classification model would provide more value for a day trader of stocks becuase it would give them the insight of the confidience to buy shares of a compnay to make money in a short amount of time.
