## Purpose

We are looking to see if there is any correlation to google search interest with stock price changes.  In this data set we have a pulled a number from google between 0 - 100.  At 0 it means that there was little to no traffic compared to normal operation.  At 100 it means that the search traffic for that day was extremely high.

First I will dig in to see if search interest and the price it opens at can predict the close price of the day.


# Imports

In [1]:
import yfinance as yf
import pandas as pd
from pytrends.request import TrendReq
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import random as rnd

import math 
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

rnd.seed(1024)

# Common methods

In [2]:
def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

# Pull in prepared data for AMC stock: January 1 - June 30

In [3]:
amcMergedDataFrame = pd.read_csv('AMCDataClean.zip')  
amcMergedDataFrame

Unnamed: 0,Search Interest,Open,Close,Volume,Amount Changed,Days Spread
0,2,2.200000,2.010000,29873800,0.190000,0.200000
1,3,1.990000,1.980000,28148300,0.010000,0.120000
2,2,2.030000,2.010000,67363300,0.020000,0.260000
3,2,2.080000,2.050000,26150500,0.030000,0.090000
4,3,2.090000,2.140000,39553300,-0.050000,0.140000
...,...,...,...,...,...,...
118,17,57.040001,58.299999,116291800,-1.259998,4.299999
119,16,57.980000,56.700001,80351200,1.279999,3.099998
120,19,55.750000,54.060001,77596900,1.689999,3.320000
121,16,55.099998,58.110001,99310200,-3.010002,5.029999


# Linear Regression Setup

In [4]:
featureColumns=['Search Interest', 'Open']
targetColumn = 'Close'

X=amcMergedDataFrame[featureColumns]
y=amcMergedDataFrame[targetColumn]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Create the regression

In [5]:
lr = LinearRegression()
lr

LinearRegression()

# Fit Linear Model

In [6]:
lr.fit(X_train, y_train)

LinearRegression()

# How confident are we in our model?

In [7]:
lr.score(X_train, y_train) 

0.9691667374025036

In [8]:
lr.score(X_test, y_test) 

0.9709538923793088

# Print the prediction believed accuracy using the model

In [9]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

Score: 0.97
MAE: 1.91
RMSE: 3.13
r2: 0.97


# Create new samples, to test our model

In [10]:
numElements = 3
amcStockPreparedData = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # assume min = 0
        maxValue = round(max(amcMergedDataFrame[column].values))
        dict[column] = rnd.randint(min, maxValue)
    amcStockPreparedData.append(dict)
amcStockPreparedData

[{'Search Interest': 2, 'Open': 30},
 {'Search Interest': 49, 'Open': 20},
 {'Search Interest': 66, 'Open': 6}]

# Prepare the predictions for consumption

In [11]:
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData

Unnamed: 0,Search Interest,Open
0,2,30
1,49,20
2,66,6


# Predict what the close price will be

In [13]:
predictions = lr.predict(amcPreparedData)
predictions

array([29.84559392, 22.07508896,  8.74029546])

# Make it pretty

In [14]:
amcPredictedPrice = amcPreparedData.copy()
amcPredictedPrice['Price Prediction'] = predictions
amcPredictedPrice

Unnamed: 0,Search Interest,Open,Price Prediction
0,2,30,29.845594
1,49,20,22.075089
2,66,6,8.740295


# Classical - Logistic Regression

In [None]:
lr = LinearRegression()    #use this algorithm to start developing the line betwee data points
lr

