## Purpose

We are looking to see if there is any correlation to google search interest with stock price changes.  In this data set we have a pulled a number from google between 0 - 100.  At 0 it means that there was little to no traffic compared to normal operation.  At 100 it means that the search traffic for that day was extremely high.

First I will dig in to see if search interest and the price it opens at can predict the close price of the day.


# Imports

In [1]:
import yfinance as yf
import pandas as pd
from pytrends.request import TrendReq
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import random as rnd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report

import math 
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

rnd.seed(1024)

# Common methods

In [2]:
def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

def printClassificationMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))


def prepareDataForPredictions(amcDataFrame):
    numElements = 3
    amcStockPreparedData = []
    for _ in range(numElements):
        dict = {}
        for column in X.columns:
            min = 0  # assume min = 0
            maxValue = round(max(amcDataFrame[column].values))
            dict[column] = rnd.randint(min, maxValue)
        amcStockPreparedData.append(dict)
    return amcStockPreparedData

# Pull in prepared data for AMC stock: January 1 - June 30

In [3]:
amcMergedDataFrame = pd.read_csv('Data Files/AMCDataClean.zip')  
amcMergedDataFrame

Unnamed: 0,Search Interest,Open,Close,Volume,Amount Changed,Days Spread,Price Increase,Search Interest Above Avg
0,2,2.200000,2.010000,29873800,0.190000,0.200000,1,0
1,3,1.990000,1.980000,28148300,0.010000,0.120000,1,0
2,2,2.030000,2.010000,67363300,0.020000,0.260000,1,0
3,2,2.080000,2.050000,26150500,0.030000,0.090000,1,0
4,3,2.090000,2.140000,39553300,-0.050000,0.140000,0,0
...,...,...,...,...,...,...,...,...
118,17,57.040001,58.299999,116291800,-1.259998,4.299999,0,1
119,16,57.980000,56.700001,80351200,1.279999,3.099998,1,1
120,19,55.750000,54.060001,77596900,1.689999,3.320000,1,1
121,16,55.099998,58.110001,99310200,-3.010002,5.029999,0,1


# Linear Regression Setup

In [4]:
featureColumns=['Search Interest', 'Open']
targetColumn = 'Close'

X=amcMergedDataFrame[featureColumns]
y=amcMergedDataFrame[targetColumn]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Create the regression

In [5]:
lr = LinearRegression()
lr

LinearRegression()

# Fit Linear Model

In [6]:
lr.fit(X_train, y_train)

LinearRegression()

# How confident are we in our model?

In [7]:
lr.score(X_train, y_train) 

0.9691667374025036

In [8]:
lr.score(X_test, y_test) 

0.9709538923793088

# Print the prediction believed accuracy using the model

In [9]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

Score: 0.97
MAE: 1.91
RMSE: 3.13
r2: 0.97


# Create new samples, to test our model

In [10]:
amcStockPreparedData = prepareDataForPredictions(amcMergedDataFrame)

# Prepare the predictions for consumption

In [11]:
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData

Unnamed: 0,Search Interest,Open
0,2,30
1,49,20
2,66,6


# Predict what the close price will be

In [12]:
predictions = lr.predict(amcPreparedData)
predictions

array([29.84559392, 22.07508896,  8.74029546])

# Make it pretty

In [13]:
amcPredictedPrice = amcPreparedData.copy()
amcPredictedPrice['Price Prediction'] = predictions
amcPredictedPrice

Unnamed: 0,Search Interest,Open,Price Prediction
0,2,30,29.845594
1,49,20,22.075089
2,66,6,8.740295


# Classical - Logistic Regression

In [14]:
amcMergedDataFrame

Unnamed: 0,Search Interest,Open,Close,Volume,Amount Changed,Days Spread,Price Increase,Search Interest Above Avg
0,2,2.200000,2.010000,29873800,0.190000,0.200000,1,0
1,3,1.990000,1.980000,28148300,0.010000,0.120000,1,0
2,2,2.030000,2.010000,67363300,0.020000,0.260000,1,0
3,2,2.080000,2.050000,26150500,0.030000,0.090000,1,0
4,3,2.090000,2.140000,39553300,-0.050000,0.140000,0,0
...,...,...,...,...,...,...,...,...
118,17,57.040001,58.299999,116291800,-1.259998,4.299999,0,1
119,16,57.980000,56.700001,80351200,1.279999,3.099998,1,1
120,19,55.750000,54.060001,77596900,1.689999,3.320000,1,1
121,16,55.099998,58.110001,99310200,-3.010002,5.029999,0,1


# Prepare the Data and logistic Columns

In [15]:
amcMergedDataFrame['Price Increase'] = amcMergedDataFrame['Price Increase'].astype(int)
amcMergedDataFrame['Search Interest Above Avg'] = amcMergedDataFrame['Search Interest Above Avg'].astype(int)
logisticFeatureColumns=['Open', 'Close']
logisticTargetColumn = 'Search Interest Above Avg'

X=amcMergedDataFrame[logisticFeatureColumns]
y=amcMergedDataFrame[logisticTargetColumn]


In [16]:
y

0      0
1      0
2      0
3      0
4      0
      ..
118    1
119    1
120    1
121    1
122    1
Name: Search Interest Above Avg, Length: 123, dtype: int64

# Train the model with my data from above

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Create the Logistic Regression

In [18]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

# Fit the data

In [19]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

# Score the model

In [20]:
lr.score(X_train, y_train)

0.9021739130434783

In [21]:
lr.score(X_test, y_test)

0.9032258064516129

# Prepare the predictions

In [22]:
predictions = lr.predict(X_test)
printClassificationMetrics(y_test, predictions)

Confusion Matrix:
[[21  0]
 [ 3  7]]
------------------
Accuracy: 0.90
Recall: 0.70
Prediction: 1.00
f-measure: 0.82
------------------
              precision    recall  f1-score   support

           0       0.88      1.00      0.93        21
           1       1.00      0.70      0.82        10

    accuracy                           0.90        31
   macro avg       0.94      0.85      0.88        31
weighted avg       0.92      0.90      0.90        31



# Create the data set to use for the predictions

In [23]:
amcStockPreparedData = prepareDataForPredictions(amcMergedDataFrame)
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData


Unnamed: 0,Open,Close
0,28,46
1,61,47
2,24,12


# Use the dummy dataset to test our prediction

In [24]:
predictions = lr.predict(amcPreparedData)
predictions

array([1, 1, 1])

In [25]:
pdPredictedStockTrend = amcPreparedData
pdPredictedStockTrend["Search Interest Above Average"] = predictions.astype(bool)
pdPredictedStockTrend

Unnamed: 0,Open,Close,Search Interest Above Average
0,28,46,True
1,61,47,True
2,24,12,True


# Conclusion

The linear regression model worked very well.  From the looks of it I was able to predict with extremely high accuracy what the close price would be for a stock based on the opening price, and search interest for the day.  Although in the real world this would be hard to get as search interest and the price changes are happening at the same time.  In my second model using logistical regression, I am not as confident with its prediction.  I tried to look from the other side to see if I could predict the search interest based on open and closed prices of AMC per day.  I felt like this would be interesting if it would conclude that search interest is high only when the stock is doing well.  This did not prove out as being a trend, but the again the accuracy of the model was high.