# Homework 7
Stock and Google Search Correlation Analysis 2
#### Group 1
#### 20 July 2021

## Introduction
Code imports and performs analysis on daily stock price and interest over time using machine learning
* GameStop (GME) <br>
* Apple (AAPL) <br>
* Coke (KO)<br>
* John Deere (DE) <br>
* AMC (AMC) <br>

## Import

In [332]:
import yfinance as yf
import pandas as pd
import numpy as np
import os
import math
import plotly.express as px
import random as rnd
from datetime import date
from datetime import timedelta
from pytrends.request import TrendReq
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error
from sklearn.cluster import KMeans                      # k-means clustering
from sklearn.model_selection import train_test_split    # For generating test/train
from sklearn.linear_model import LinearRegression   # Logistic regression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
%matplotlib inline

## Global Variables and Initialization

In [333]:
dataDir = r"./Data Files/"  #Directory of all data
today = date.today()  # Todays date
rnd.seed(1024)

## Global Functions

In [334]:
# Function gets stock data and trend data if needed
def get_data(ticker):
    if os.path.exists(f"{dataDir}{ticker}_{today}_year.csv"):
        #Get stored data
        stored_data = pd.read_csv(f"{dataDir}{ticker}_{today}_year.csv")
        # Get rid of index name
        stored_data.set_index('Unnamed: 0', inplace=True)
        stored_data.index.name = None
        return stored_data
    else:
        #Get new data
        # Connect to Google API
        pytrends = TrendReq(hl='en-US', tz=360)
        # Set Keyword
        kw_list = [ticker]
        # Google API only shows last 90 days so need to itirate
        # Set start of interval
        date90front = date.today()
        # Initiate dataframe
        trend_data = pd.DataFrame()
        for x in range(4):
            # Set start end of interval
            date90back = date90front - timedelta(days=90)
            # Build Payload of 90 days
            pytrends.build_payload(kw_list,
                                   timeframe=f'{date90back} {date90front}',
                                   geo='')
            trend_90 = pytrends.interest_over_time()
            trend_data = pd.concat([trend_90, trend_data])
            date90front = date90back
        # Get Stock Data
        stock_data = yf.download(ticker,
                                 start=date.today() - timedelta(days=360),
                                 end=date.today(), interval="1d")
        # Combine Data
        new_data = stock_data.join(trend_data)
        # Export to data folder
        new_data.to_csv(f"{dataDir}{ticker}_{today}_year.csv")
        return new_data

# Function prints metrics of regression model
def PrintMetricsRegression(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

# Function prints metrics of classification model
def PrintMetricsClassiffication(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))


# Functions gets random data for predictions
def prepareDataForPredictions(X_df):
    numElements = 3
    random_df = []
    for _ in range(numElements):
        dict = {}
        for column in X_df.columns:
            min = 0  # assume min = 0
            maxValue = round(max(X_df[column].values))
            dict[column] = rnd.randint(min, maxValue)
        random_df.append(dict)
    return random_df

# Create categorical dummies
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

## Data and Analysis

### Gamestop(GME)
Connor Moore

#### Get Data

In [335]:
# Gets Data for last year
GME_DF = get_data("GME")
GME_DF

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,GME,isPartial
2020-07-27,4.020000,4.120000,3.950000,4.010000,4.010000,2472700,25.0,False
2020-07-28,3.960000,4.050000,3.920000,3.940000,3.940000,4555400,30.0,False
2020-07-29,3.940000,4.180000,3.920000,4.060000,4.060000,2879600,33.0,False
2020-07-30,4.000000,4.230000,3.970000,4.100000,4.100000,2398500,22.0,False
2020-07-31,4.060000,4.160000,3.990000,4.010000,4.010000,1879400,23.0,False
...,...,...,...,...,...,...,...,...
2021-07-14,180.490005,182.380005,165.070007,167.619995,167.619995,3913800,28.0,False
2021-07-15,160.000000,171.990005,158.009995,166.820007,166.820007,4298600,31.0,False
2021-07-16,170.149994,179.470001,166.300003,169.039993,169.039993,3278800,27.0,False
2021-07-19,163.300003,176.000000,161.220001,173.490005,173.490005,2436900,,


#### Prepare Data

In [336]:
# Rename search interest
GME_DF.rename(columns = {"GME": "Search Interest"},inplace = True)
# Add difference
GME_DF["Price Difference"] = GME_DF["Open"]-GME_DF["Close"]
# Add truth value that determines if we want to buy or not that day
GME_DF['Buy'] = np.where(GME_DF['Price Difference'] > 0, 1, 0)
# Make search interest a value showing the day before
GME_DF['Search Interest'] = GME_DF['Search Interest'].shift(-1)
# Delete isPartial
del GME_DF['isPartial']
# Remove NaN
GME_DF.dropna(inplace=True)

In [337]:
# Check values - no nulls - int or float
GME_DF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 247 entries, 2020-07-27 to 2021-07-15
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Open              247 non-null    float64
 1   High              247 non-null    float64
 2   Low               247 non-null    float64
 3   Close             247 non-null    float64
 4   Adj Close         247 non-null    float64
 5   Volume            247 non-null    int64  
 6   Search Interest   247 non-null    float64
 7   Price Difference  247 non-null    float64
 8   Buy               247 non-null    int64  
dtypes: float64(7), int64(2)
memory usage: 19.3+ KB


#### Model 1 - Linear Regression

In [338]:
# Set features to target "Close"
features = ['Open','Search Interest']
target = "Close"
print(f"Feature categories: {features}")
print(f"Target feature: {target}")

Feature categories: ['Open', 'Search Interest']
Target feature: Close


In [339]:
X = GME_DF[features]
X

Unnamed: 0,Open,Search Interest
2020-07-27,4.020000,30.0
2020-07-28,3.960000,33.0
2020-07-29,3.940000,22.0
2020-07-30,4.000000,23.0
2020-07-31,4.060000,23.0
...,...,...
2021-07-09,190.880005,23.0
2021-07-12,191.419998,25.0
2021-07-13,187.679993,28.0
2021-07-14,180.490005,31.0


In [340]:
y = GME_DF[target]
y

2020-07-27      4.010000
2020-07-28      3.940000
2020-07-29      4.060000
2020-07-30      4.100000
2020-07-31      4.010000
                 ...    
2021-07-09    191.229996
2021-07-12    189.250000
2021-07-13    180.059998
2021-07-14    167.619995
2021-07-15    166.820007
Name: Close, Length: 247, dtype: float64

In [341]:
# Set training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print(f"Length of X_train (feature training set): {len(X_train)}")
print(f"Length of y_train (target training set): {len(y_train)}")
print(f"Length of X_test (feature test set): {len(X_test)}")
print(f"Length of y_test (target test set): {len(y_test)}")

Length of X_train (feature training set): 185
Length of y_train (target training set): 185
Length of X_test (feature test set): 62
Length of y_test (target test set): 62


In [342]:
rr = Ridge(solver='svd')
rr

Ridge(solver='svd')

In [343]:
rr.fit(X_train, y_train)

Ridge(solver='svd')

In [344]:
rr.score(X_train, y_train)

0.9725864884920568

In [345]:
rr.score(X_test, y_test)

0.9751790419601696

In [346]:
predictions = rr.predict(X_test)
PrintMetricsRegression(y_test, predictions)

Score: 0.98
MAE: 8.96
RMSE: 14.23
r2: 0.98


In [347]:
# Get predictor data based of X and convert to dataframe
sampleGME = pd.DataFrame.from_dict(prepareDataForPredictions(X))
sampleGME

Unnamed: 0,Open,Search Interest
0,9,61
1,199,41
2,266,12


In [348]:
predictions = rr.predict(sampleGME)
predictions

array([ 12.05832812, 192.77841298, 256.07989719])

In [349]:
sampleGME_predicted = sampleGME.copy()
sampleGME_predicted['Predictions'] = predictions
sampleGME_predicted

Unnamed: 0,Open,Search Interest,Predictions
0,9,61,12.058328
1,199,41,192.778413
2,266,12,256.079897


#### Model 2 - Logistic Regression

In [350]:
# Set features to target "Buy"
features = ['Open','Search Interest']
target = "Buy"

print(f"Feature categories: {features}")
print(f"Target feature: {target}")

Feature categories: ['Open', 'Search Interest']
Target feature: Buy


In [351]:
X = GME_DF[features]
X

Unnamed: 0,Open,Search Interest
2020-07-27,4.020000,30.0
2020-07-28,3.960000,33.0
2020-07-29,3.940000,22.0
2020-07-30,4.000000,23.0
2020-07-31,4.060000,23.0
...,...,...
2021-07-09,190.880005,23.0
2021-07-12,191.419998,25.0
2021-07-13,187.679993,28.0
2021-07-14,180.490005,31.0


In [352]:
y = GME_DF[target]
y

2020-07-27    1
2020-07-28    1
2020-07-29    0
2020-07-30    0
2020-07-31    1
             ..
2021-07-09    0
2021-07-12    1
2021-07-13    1
2021-07-14    1
2021-07-15    0
Name: Buy, Length: 247, dtype: int64

In [353]:
# Set training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print(f"Length of X_train (feature training set): {len(X_train)}")
print(f"Length of y_train (target training set): {len(y_train)}")
print(f"Length of X_test (feature test set): {len(X_test)}")
print(f"Length of y_test (target test set): {len(y_test)}")

Length of X_train (feature training set): 185
Length of y_train (target training set): 185
Length of X_test (feature test set): 62
Length of y_test (target test set): 62


In [354]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [355]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [356]:
lr.score(X_train, y_train)

0.572972972972973

In [357]:
lr.score(X_test, y_test)

0.5967741935483871

In [358]:
predictions = lr.predict(X_test)
PrintMetricsClassiffication(y_test, predictions)

Confusion Matrix:
[[16  4]
 [21 21]]
------------------
Accuracy: 0.60
Recall: 0.50
Prediction: 0.84
f-measure: 0.63
------------------
              precision    recall  f1-score   support

           0       0.43      0.80      0.56        20
           1       0.84      0.50      0.63        42

    accuracy                           0.60        62
   macro avg       0.64      0.65      0.59        62
weighted avg       0.71      0.60      0.61        62



In [359]:
# Get predictor data based of X and convert to dataframe
sampleGME = pd.DataFrame.from_dict(prepareDataForPredictions(X))
sampleGME

Unnamed: 0,Open,Search Interest
0,227,65
1,187,92
2,366,47


In [360]:
predictions = lr.predict(sampleGME)
predictions

array([1, 0, 1])

In [361]:
sampleGME_predicted = sampleGME.copy()
sampleGME_predicted['Predictions'] = predictions
sampleGME_predicted

Unnamed: 0,Open,Search Interest,Predictions
0,227,65,1
1,187,92,0
2,366,47,1


#### Analysis
The initial ridge regression looked at the opening price and the search interest of the day before to predict the closing price today. The model fit the training data quite and it did a great job at predicting the clsoing price with a score of 0.97.

The 2nd model used logistic regression to predict if we should buy today or not(a calculationof net profit positive or negative) based off of interest yesterday and opening price. The model scored moderately at .64 in training and 0.46 in testing.

### Apple (AAPL)
Ken Cupples

##### Purpose
The purpose of this notebook is to provide practice with scikitlearn using both linear regression and logistical regressional models.  In this example, Applestock is used to determine if there is a correlation between google search interest about Applestock and the price the stock itself.

Pull Google search interest for Apple for second Quarter

In [362]:
pytrends = TrendReq(hl='en-US', tz=360)

#build list of keywords in this case only use AAPL
kw_list = ["AAPL"]

# build the payload
pytrends.build_payload(kw_list, timeframe='2021-04-01 2021-06-30', geo='US')

# Store interest over time information in df and rename AAPL column to Search Interest
AppleTrendsdf = pytrends.interest_over_time()
AppleTrendsdf = AppleTrendsdf.rename(columns={"AAPL": 'Search Interest'})
AppleTrendsdf.reset_index(inplace=True, drop=True)

Prepare the Data

In [363]:
# Removes the weekends and holidays from AppleTrendsdf since the stock data only trades on these days
Index = 2
while Index < len(AppleTrendsdf-1):
    Next = Index +1
    if Next < len(AppleTrendsdf-1):
        AppleTrendsdf.drop([Index, Next], axis=0, inplace=True)
        Index +=5
    else:
        AppleTrendsdf.drop([Index], axis=0, inplace=True)
        Index +=5
AppleTrendsdf.drop([1, 60], axis=0, inplace=True)


In [364]:
#Import Apple stock data from csv file
AppleStockdf = pd.read_csv(f"{dataDir}AAPL.csv")
AppleStockdf

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,4/1/2021,123.660004,124.18,122.489998,123.0,122.791428,75089100
1,4/5/2021,123.870003,126.160004,123.07,125.900002,125.686516,88651200
2,4/6/2021,126.5,127.129997,125.650002,126.209999,125.995987,80171300
3,4/7/2021,125.830002,127.919998,125.139999,127.900002,127.683121,83466700
4,4/8/2021,128.949997,130.389999,128.520004,130.360001,130.138947,88844600
5,4/9/2021,129.800003,133.039993,129.470001,133.0,132.774475,106686700
6,4/12/2021,132.520004,132.850006,130.630005,131.240005,131.017456,91420000
7,4/13/2021,132.440002,134.660004,131.929993,134.429993,134.202042,91266500
8,4/14/2021,134.940002,135.0,131.660004,132.029999,131.806122,87222800
9,4/15/2021,133.820007,135.0,133.639999,134.5,134.271927,89347100


In [365]:
# Resets the index for the AppleTrends data frame and merges with the Apple stock data for the same time frame
AppleTrendsdf.reset_index(inplace=True, drop=True)
MergedStockApple = pd.concat([AppleStockdf, AppleTrendsdf], axis=1)


# Adds columns to determine search interest and price that is above average for the day of trading.
MeanSearchInterest = AppleTrendsdf["Search Interest"].mean()
MeanPrice = MergedStockApple["Close"].mean()
MergedStockApple["Interest Points Away From Mean"] = MergedStockApple["Search Interest"] - MeanSearchInterest
MergedStockApple["Daily Price Range"] = MergedStockApple["Open"] - MergedStockApple["Close"]
MergedStockApple["Price Points Away from Mean"] =  MergedStockApple["Close"] - MeanPrice
MergedStockApple["Search Interests Above Average"] = MergedStockApple["Interest Points Away From Mean"] > 0.0
MergedStockApple

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Search Interest,isPartial,Interest Points Away From Mean,Daily Price Range,Price Points Away from Mean,Search Interests Above Average
0,4/1/2021,123.660004,124.18,122.489998,123.0,122.791428,75089100,37,False,8.730159,0.660004,-6.568889,True
1,4/5/2021,123.870003,126.160004,123.07,125.900002,125.686516,88651200,44,False,15.730159,-2.029999,-3.668887,True
2,4/6/2021,126.5,127.129997,125.650002,126.209999,125.995987,80171300,43,False,14.730159,0.290001,-3.35889,True
3,4/7/2021,125.830002,127.919998,125.139999,127.900002,127.683121,83466700,39,False,10.730159,-2.07,-1.668887,True
4,4/8/2021,128.949997,130.389999,128.520004,130.360001,130.138947,88844600,9,False,-19.269841,-1.410004,0.791112,False
5,4/9/2021,129.800003,133.039993,129.470001,133.0,132.774475,106686700,7,False,-21.269841,-3.199997,3.431111,False
6,4/12/2021,132.520004,132.850006,130.630005,131.240005,131.017456,91420000,44,False,15.730159,1.279999,1.671116,True
7,4/13/2021,132.440002,134.660004,131.929993,134.429993,134.202042,91266500,51,False,22.730159,-1.989991,4.861104,True
8,4/14/2021,134.940002,135.0,131.660004,132.029999,131.806122,87222800,39,False,10.730159,2.910003,2.46111,True
9,4/15/2021,133.820007,135.0,133.639999,134.5,134.271927,89347100,7,False,-21.269841,-0.679993,4.931111,False


In [366]:
# This is changing the boolean types to categorical and dropping the primary as it is implied.  Info function shows no objects or boolean types
categories = "Search Interests Above Average"
MergedStockApple = pd.concat(
    [MergedStockApple.drop(categories, axis=1), createCategoricalDummies(MergedStockApple, categories)], axis= 1)
MergedStockApple.drop(["Date", "isPartial"], axis=1, inplace=True)
MergedStockApple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Open                            63 non-null     float64
 1   High                            63 non-null     float64
 2   Low                             63 non-null     float64
 3   Close                           63 non-null     float64
 4   Adj Close                       63 non-null     float64
 5   Volume                          63 non-null     int64  
 6   Search Interest                 63 non-null     int64  
 7   Interest Points Away From Mean  63 non-null     float64
 8   Daily Price Range               63 non-null     float64
 9   Price Points Away from Mean     63 non-null     float64
 10  True                            63 non-null     uint8  
dtypes: float64(8), int64(2), uint8(1)
memory usage: 5.1 KB


In [367]:
# Generating the test data based upon search interest and opening price in the apple stock data set
Features = ["Search Interest", "Low", "High", "Open"]
X = MergedStockApple[Features]
X

Unnamed: 0,Search Interest,Low,High,Open
0,37,122.489998,124.18,123.660004
1,44,123.07,126.160004,123.870003
2,43,125.650002,127.129997,126.5
3,39,125.139999,127.919998,125.830002
4,9,128.520004,130.389999,128.949997
5,7,129.470001,133.039993,129.800003
6,44,130.630005,132.850006,132.520004
7,51,131.929993,134.660004,132.440002
8,39,131.660004,135.0,134.940002
9,7,133.639999,135.0,133.820007


In [368]:
# Generating the test data look at the ability to produce the close price
Target = "Close"
y = MergedStockApple[Target]
y

0     123.000000
1     125.900002
2     126.209999
3     127.900002
4     130.360001
5     133.000000
6     131.240005
7     134.429993
8     132.029999
9     134.500000
10    134.160004
11    134.839996
12    133.110001
13    133.500000
14    131.940002
15    134.320007
16    134.720001
17    134.389999
18    133.580002
19    133.479996
20    131.460007
21    132.539993
22    127.849998
23    128.100006
24    129.740005
25    130.210007
26    126.849998
27    125.910004
28    122.769997
29    124.970001
30    127.449997
31    126.269997
32    124.849998
33    124.690002
34    127.309998
35    125.430000
36    127.099998
37    126.900002
38    126.849998
39    125.279999
40    124.610001
41    124.279999
42    125.059998
43    123.540001
44    125.889999
45    125.900002
46    126.739998
47    127.129997
48    126.110001
49    127.349998
50    130.479996
51    129.639999
52    130.149994
53    131.789993
54    130.460007
55    132.300003
56    133.979996
57    133.699997
58    133.4100

In [369]:
# Training model with the defined test data created above
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


Linear Regression

In [370]:
# Defining lr for linear regression function to reduce typing below
lr = LinearRegression()
lr

LinearRegression()

In [371]:
#Using linear regression function to fit the model
lr.fit(X_train, y_train)

LinearRegression()

In [372]:
#Score the training model
lr.score(X_train, y_train)

0.9833709159910372

In [373]:
#Score the testing model
lr.score(X_test, y_test)

0.9610863879936604

In [374]:
#Produce the linear regression metrics indicating the quality of the model
PredictionsRegression = lr.predict(X_test)
PrintMetricsRegression(y_test, PredictionsRegression)

Score: 0.96
MAE: 0.58
RMSE: 0.72
r2: 0.96


Linear Regression Predicting New Samples

In [375]:
#Generating random values for the search interest and opening price for prediction
NumberElements = 3
SampleStockTrendRegression = []
for _ in range(NumberElements):
    dict = {}
    for column in X.columns:
        min = 0
        maxValue = round(max(MergedStockApple[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStockTrendRegression.append(dict)
SampleStockTrendRegression

[{'Search Interest': 24, 'Low': 24, 'High': 35, 'Open': 99},
 {'Search Interest': 6, 'Low': 104, 'High': 39, 'Open': 111},
 {'Search Interest': 30, 'Low': 114, 'High': 134, 'Open': 88}]

In [376]:
#Defining a dataframe for the chosen values for search interest and opening price created above
pdSampleStockTrendRegression = pd.DataFrame.from_dict(SampleStockTrendRegression)
pdSampleStockTrendRegression

Unnamed: 0,Search Interest,Low,High,Open
0,24,24,35,99
1,6,104,39,111
2,30,114,134,88


In [377]:
#Predicting the closing price based upon the defined model
PredictionsRegression = lr.predict(pdSampleStockTrendRegression)
PredictionsRegression

array([-25.3535584 ,  30.30473019, 155.77292293])

In [378]:
#Adding the predicted values to the dataframe
pdPredictedStockTrend = pdSampleStockTrendRegression.copy()
pdPredictedStockTrend ['Predicted'] = PredictionsRegression
pdPredictedStockTrend

Unnamed: 0,Search Interest,Low,High,Open,Predicted
0,24,24,35,99,-25.353558
1,6,104,39,111,30.30473
2,30,114,134,88,155.772923


Classification Logistic Regression

In [379]:
#Redefining the test data for logical regression with choosing a categorical variable for the target case
Target = True
y = MergedStockApple[Target]
y

0     1
1     1
2     1
3     1
4     0
5     0
6     1
7     1
8     1
9     0
10    1
11    1
12    1
13    0
14    1
15    1
16    1
17    0
18    0
19    1
20    1
21    1
22    1
23    1
24    1
25    0
26    0
27    1
28    1
29    1
30    0
31    1
32    1
33    0
34    0
35    0
36    0
37    0
38    0
39    1
40    1
41    1
42    0
43    1
44    0
45    0
46    1
47    1
48    1
49    1
50    1
51    0
52    0
53    1
54    1
55    1
56    1
57    1
58    0
59    0
60    1
61    1
62    1
Name: True, dtype: uint8

In [380]:
# Training model with the defined test data created above
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [381]:
# Defining LogisticRegression fucntion as lr to reduce the number of characters below
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [382]:
# Using logistics regression to fit the model
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [383]:
# Using logistics regression to score the training data
lr.score(X_train, y_train)

1.0

In [384]:
# Using logistics regression to score the test data
lr.score(X_test, y_test)

0.9375

In [385]:
#Produce the logistic regression to display metrics indicating the quality of the model and generate the confusion matrix
PredictionsClassification = lr.predict(X_test)
PrintMetricsClassiffication(y_test, PredictionsClassification)

Confusion Matrix:
[[ 4  0]
 [ 1 11]]
------------------
Accuracy: 0.94
Recall: 0.92
Prediction: 1.00
f-measure: 0.96
------------------
              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.92      0.96        12

    accuracy                           0.94        16
   macro avg       0.90      0.96      0.92        16
weighted avg       0.95      0.94      0.94        16



Classifiction Predicting New Samples

In [386]:
#Generating random values for the search interest and opening price for prediction
numElements = 3
SampleStockTrendsClassification = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # We'll always allow at lea
        maxValue = round(max(MergedStockApple[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStockTrendsClassification.append(dict)
SampleStockTrendsClassification

[{'Search Interest': 39, 'Low': 22, 'High': 9, 'Open': 21},
 {'Search Interest': 40, 'Low': 21, 'High': 26, 'Open': 92},
 {'Search Interest': 44, 'Low': 85, 'High': 0, 'Open': 94}]

In [387]:
#Defining a dataframe for the chosen values for search interest and opening price created above
pdSampleStockTrendsClassification = pd.DataFrame.from_dict(SampleStockTrendsClassification)

In [388]:
#Predicting the closing price based upon the defined model
predictionsClassification = lr.predict(pdSampleStockTrendsClassification)
predictionsClassification

array([1, 1, 1], dtype=uint8)

In [389]:
#Adding the predicted values to the dataframe
pdPredictedStockTrendsClassification= pdSampleStockTrendsClassification
pdPredictedStockTrendsClassification["Predicted"] = predictionsClassification.astype(bool)
pdPredictedStockTrendsClassification

Unnamed: 0,Search Interest,Low,High,Open,Predicted
0,39,22,9,21,True
1,40,21,26,92,True
2,44,85,0,94,True


#### Conclusion

Both the linear regression and logistical regression models indicate a reasonable correlation between Google search activity, price points on the days trading, and the closing price for the dataset investigation in this notebook.  This could indicate that Apple is a stable stock with the google search interest and the price change through out the trading day is fairly consistent.   Further investigation on this project will include less stable stocks which may answer if this is true.


### AMC
Shawn Sonnack

#### Purpose

We are looking to see if there is any correlation to google search interest with stock price changes.  In this data set we have a pulled a number from google between 0 - 100.  At 0 it means that there was little to no traffic compared to normal operation.  At 100 it means that the search traffic for that day was extremely high.

First I will dig in to see if search interest and the price it opens at can predict the close price of the day.


#### Pull in prepared data for AMC stock: January 1 - June 30

In [390]:
amcMergedDataFrame = pd.read_csv(f'{dataDir}AMCDataClean.zip')
amcMergedDataFrame

Unnamed: 0,Search Interest,Open,Close,Volume,Amount Changed,Days Spread,Price Increase,Search Interest Above Avg
0,2,2.200000,2.010000,29873800,0.190000,0.200000,1,0
1,3,1.990000,1.980000,28148300,0.010000,0.120000,1,0
2,2,2.030000,2.010000,67363300,0.020000,0.260000,1,0
3,2,2.080000,2.050000,26150500,0.030000,0.090000,1,0
4,3,2.090000,2.140000,39553300,-0.050000,0.140000,0,0
...,...,...,...,...,...,...,...,...
118,17,57.040001,58.299999,116291800,-1.259998,4.299999,0,1
119,16,57.980000,56.700001,80351200,1.279999,3.099998,1,1
120,19,55.750000,54.060001,77596900,1.689999,3.320000,1,1
121,16,55.099998,58.110001,99310200,-3.010002,5.029999,0,1


#### Linear Regression Setup

In [391]:
featureColumns=['Search Interest', 'Open']
targetColumn = 'Close'

X=amcMergedDataFrame[featureColumns]
y=amcMergedDataFrame[targetColumn]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#### Create the regression

In [392]:
lr = LinearRegression()
lr

LinearRegression()

#### Fit Linear Model

In [393]:
lr.fit(X_train, y_train)

LinearRegression()

#### How confident are we in our model?

In [394]:
lr.score(X_train, y_train)

0.9691667374025036

In [395]:
lr.score(X_test, y_test)

0.9709538923793088

#### Print the prediction believed accuracy using the model

In [396]:
predictions = lr.predict(X_test)
PrintMetricsRegression(y_test, predictions)

Score: 0.97
MAE: 1.91
RMSE: 3.13
r2: 0.97


#### Create new samples, to test our model

In [397]:
amcStockPreparedData = prepareDataForPredictions(X)

#### Prepare the predictions for consumption

In [398]:
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData

Unnamed: 0,Search Interest,Open
0,38,11
1,17,10
2,49,18


#### Predict what the close price will be

In [399]:
predictions = lr.predict(amcPreparedData)
predictions

array([12.39906455, 10.32854224, 20.04788933])

#### Make it pretty

In [400]:
amcPredictedPrice = amcPreparedData.copy()
amcPredictedPrice['Price Prediction'] = predictions
amcPredictedPrice

Unnamed: 0,Search Interest,Open,Price Prediction
0,38,11,12.399065
1,17,10,10.328542
2,49,18,20.047889


#### Classical - Logistic Regression

In [401]:
amcMergedDataFrame

Unnamed: 0,Search Interest,Open,Close,Volume,Amount Changed,Days Spread,Price Increase,Search Interest Above Avg
0,2,2.200000,2.010000,29873800,0.190000,0.200000,1,0
1,3,1.990000,1.980000,28148300,0.010000,0.120000,1,0
2,2,2.030000,2.010000,67363300,0.020000,0.260000,1,0
3,2,2.080000,2.050000,26150500,0.030000,0.090000,1,0
4,3,2.090000,2.140000,39553300,-0.050000,0.140000,0,0
...,...,...,...,...,...,...,...,...
118,17,57.040001,58.299999,116291800,-1.259998,4.299999,0,1
119,16,57.980000,56.700001,80351200,1.279999,3.099998,1,1
120,19,55.750000,54.060001,77596900,1.689999,3.320000,1,1
121,16,55.099998,58.110001,99310200,-3.010002,5.029999,0,1


#### Prepare the Data and logistic Columns

In [402]:
amcMergedDataFrame['Price Increase'] = amcMergedDataFrame['Price Increase'].astype(int)
amcMergedDataFrame['Search Interest Above Avg'] = amcMergedDataFrame['Search Interest Above Avg'].astype(int)
logisticFeatureColumns=['Open', 'Close']
logisticTargetColumn = 'Search Interest Above Avg'

X=amcMergedDataFrame[logisticFeatureColumns]
y=amcMergedDataFrame[logisticTargetColumn]


In [403]:
y

0      0
1      0
2      0
3      0
4      0
      ..
118    1
119    1
120    1
121    1
122    1
Name: Search Interest Above Avg, Length: 123, dtype: int64

#### Train the model with my data from above

In [404]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Create the Logistic Regression

In [405]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

#### Fit the data

In [406]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

#### Score the model

In [407]:
lr.score(X_train, y_train)

0.9021739130434783

In [408]:
lr.score(X_test, y_test)

0.9032258064516129

#### Prepare the predictions

In [409]:
predictions = lr.predict(X_test)
PrintMetricsClassiffication(y_test, predictions)

Confusion Matrix:
[[21  0]
 [ 3  7]]
------------------
Accuracy: 0.90
Recall: 0.70
Prediction: 1.00
f-measure: 0.82
------------------
              precision    recall  f1-score   support

           0       0.88      1.00      0.93        21
           1       1.00      0.70      0.82        10

    accuracy                           0.90        31
   macro avg       0.94      0.85      0.88        31
weighted avg       0.92      0.90      0.90        31



#### Create the data set to use for the predictions

In [410]:
amcStockPreparedData = prepareDataForPredictions(X)
amcPreparedData = pd.DataFrame.from_dict(amcStockPreparedData)
amcPreparedData


Unnamed: 0,Open,Close
0,10,11
1,14,15
2,29,3


Use the dummy dataset to test our prediction

In [411]:
predictions = lr.predict(amcPreparedData)
predictions

array([0, 0, 1])

In [412]:
pdPredictedStockTrend = amcPreparedData
pdPredictedStockTrend["Search Interest Above Average"] = predictions.astype(bool)
pdPredictedStockTrend

Unnamed: 0,Open,Close,Search Interest Above Average
0,10,11,False
1,14,15,False
2,29,3,True


#### Conclusion
The linear regression model worked very well.  From the looks of it I was able to predict with extremely high accuracy what the close price would be for a stock based on the opening price, and search interest for the day.  Although in the real world this would be hard to get as search interest and the price changes are happening at the same time.  In my second model using logistical regression, I am not as confident with its prediction.  I tried to look from the other side to see if I could predict the search interest based on open and closed prices of AMC per day.  I felt like this would be interesting if it would conclude that search interest is high only when the stock is doing well.  This did not prove out as being a trend, but the again the accuracy of the model was high.

### Coke
Arielle Swift

#### Get Data

In [413]:
CokeDataSetQtr2 = get_data("KO")

# Name unnamed date column
CokeDataSetQtr2.reset_index(inplace=True)
CokeDataSetQtr2.rename(columns = {"index": "Date"},inplace = True)
CokeDataSetQtr2

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,KO,isPartial
0,2020-07-27,48.180000,48.509998,48.180000,48.480000,46.968781,17346500,82.0,False
1,2020-07-28,48.340000,49.279999,48.099998,48.180000,46.678131,13872700,82.0,False
2,2020-07-29,48.139999,48.500000,47.820000,48.020000,46.523117,13758100,84.0,False
3,2020-07-30,47.669998,48.230000,47.200001,47.689999,46.203403,17276500,85.0,False
4,2020-07-31,47.439999,47.770000,46.730000,47.240002,45.767437,14849200,87.0,False
...,...,...,...,...,...,...,...,...,...
245,2021-07-14,55.020000,56.349998,54.959999,56.259998,56.259998,22002700,87.0,False
246,2021-07-15,56.240002,56.470001,55.910000,56.439999,56.439999,15068200,89.0,False
247,2021-07-16,56.459999,56.680000,56.259998,56.400002,56.400002,14857600,91.0,False
248,2021-07-19,56.080002,56.349998,55.160000,55.730000,55.730000,19527000,,


In [414]:
CokeDataSetQtr2['Date']= pd.to_datetime(CokeDataSetQtr2['Date'],format='%Y-%m-%d')
CokeDataSetQtr2['Date'] = CokeDataSetQtr2['Date'].dt.strftime('%m-%d-%Y')
CokeDataSetQtr2


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,KO,isPartial
0,07-27-2020,48.180000,48.509998,48.180000,48.480000,46.968781,17346500,82.0,False
1,07-28-2020,48.340000,49.279999,48.099998,48.180000,46.678131,13872700,82.0,False
2,07-29-2020,48.139999,48.500000,47.820000,48.020000,46.523117,13758100,84.0,False
3,07-30-2020,47.669998,48.230000,47.200001,47.689999,46.203403,17276500,85.0,False
4,07-31-2020,47.439999,47.770000,46.730000,47.240002,45.767437,14849200,87.0,False
...,...,...,...,...,...,...,...,...,...
245,07-14-2021,55.020000,56.349998,54.959999,56.259998,56.259998,22002700,87.0,False
246,07-15-2021,56.240002,56.470001,55.910000,56.439999,56.439999,15068200,89.0,False
247,07-16-2021,56.459999,56.680000,56.259998,56.400002,56.400002,14857600,91.0,False
248,07-19-2021,56.080002,56.349998,55.160000,55.730000,55.730000,19527000,,


In [415]:
CokeDataSetQtr2 = CokeDataSetQtr2.loc[(CokeDataSetQtr2['Date'] >= '04-01-2021')
                     & (CokeDataSetQtr2['Date'] < '07-01-2021')]
CokeDataSetQtr2.reset_index(drop=True, inplace=True)

In [416]:
CokeDataSetQtr2['InvestToday'] = np.where(CokeDataSetQtr2.Close - CokeDataSetQtr2.Open>0, 1, 0)
CokeDataSetQtr2



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,KO,isPartial,InvestToday
0,04-01-2021,52.959999,53.150002,52.459999,52.509998,52.117294,15834700,85.0,False,0
1,04-05-2021,52.349998,53.220001,52.290001,52.810001,52.415054,16368700,85.0,False,1
2,04-06-2021,53.040001,53.650002,52.900002,53.189999,52.79221,15614300,82.0,False,1
3,04-07-2021,53.279999,53.5,53.119999,53.279999,52.881535,10062700,84.0,False,0
4,04-08-2021,53.169998,53.380001,52.970001,53.119999,52.722733,9695600,84.0,False,0
5,04-09-2021,53.169998,53.279999,52.810001,53.18,52.782284,10828200,84.0,False,1
6,04-12-2021,53.330002,53.549999,53.099998,53.349998,52.951012,8565300,83.0,False,1
7,04-13-2021,53.040001,53.290001,52.810001,53.09,52.692959,11071700,84.0,False,1
8,04-14-2021,52.98,53.189999,52.650002,53.080002,52.683033,9787600,84.0,False,1
9,04-15-2021,53.130001,53.66,53.119999,53.330002,52.931164,13078100,84.0,False,1


In [417]:
Continuous_Cols=[ 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Categorical_Cols=['InvestToday']

In [418]:
Predictor_Cols = Categorical_Cols + Continuous_Cols

Target_Col = 'InvestToday'

X=CokeDataSetQtr2[Continuous_Cols]
y=CokeDataSetQtr2[Target_Col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
print("Population:\n",y.value_counts(normalize=True)*100)
print("Train:\n", y_train.value_counts(normalize=True)*100)
print("Test:\n", y_test.value_counts(normalize=True)*100)

Population:
 0    50.0
1    50.0
Name: InvestToday, dtype: float64
Train:
 1    54.545455
0    45.454545
Name: InvestToday, dtype: float64
Test:
 0    60.0
1    40.0
Name: InvestToday, dtype: float64


In [419]:
knn = KNeighborsClassifier(n_neighbors=3)

In [420]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [421]:
knn.score(X_train, y_train)

0.7954545454545454

In [422]:
knn.score(X_test, y_test)

0.5

#### Confusion Matrix

In [423]:
predictions = knn.predict(X_test)
PrintMetricsClassiffication(y_test, predictions)

Confusion Matrix:
[[5 7]
 [3 5]]
------------------
Accuracy: 0.50
Recall: 0.62
Prediction: 0.42
f-measure: 0.50
------------------
              precision    recall  f1-score   support

           0       0.62      0.42      0.50        12
           1       0.42      0.62      0.50         8

    accuracy                           0.50        20
   macro avg       0.52      0.52      0.50        20
weighted avg       0.54      0.50      0.50        20



In [424]:
CokeDataSetQtr2

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,KO,isPartial,InvestToday
0,04-01-2021,52.959999,53.150002,52.459999,52.509998,52.117294,15834700,85.0,False,0
1,04-05-2021,52.349998,53.220001,52.290001,52.810001,52.415054,16368700,85.0,False,1
2,04-06-2021,53.040001,53.650002,52.900002,53.189999,52.79221,15614300,82.0,False,1
3,04-07-2021,53.279999,53.5,53.119999,53.279999,52.881535,10062700,84.0,False,0
4,04-08-2021,53.169998,53.380001,52.970001,53.119999,52.722733,9695600,84.0,False,0
5,04-09-2021,53.169998,53.279999,52.810001,53.18,52.782284,10828200,84.0,False,1
6,04-12-2021,53.330002,53.549999,53.099998,53.349998,52.951012,8565300,83.0,False,1
7,04-13-2021,53.040001,53.290001,52.810001,53.09,52.692959,11071700,84.0,False,1
8,04-14-2021,52.98,53.189999,52.650002,53.080002,52.683033,9787600,84.0,False,1
9,04-15-2021,53.130001,53.66,53.119999,53.330002,52.931164,13078100,84.0,False,1


In [425]:
numElements = 10
SampleStock = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0
        maxValue = round(max(CokeDataSetQtr2[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStock.append(dict)
SampleStock

[{'Open': 44,
  'High': 26,
  'Low': 51,
  'Close': 1,
  'Adj Close': 52,
  'Volume': 1439518},
 {'Open': 56,
  'High': 15,
  'Low': 48,
  'Close': 53,
  'Adj Close': 9,
  'Volume': 22392310},
 {'Open': 22,
  'High': 29,
  'Low': 56,
  'Close': 39,
  'Adj Close': 11,
  'Volume': 36866680},
 {'Open': 44,
  'High': 51,
  'Low': 51,
  'Close': 27,
  'Adj Close': 3,
  'Volume': 4501192},
 {'Open': 52,
  'High': 24,
  'Low': 55,
  'Close': 6,
  'Adj Close': 46,
  'Volume': 40024183},
 {'Open': 24,
  'High': 11,
  'Low': 13,
  'Close': 55,
  'Adj Close': 45,
  'Volume': 2306388},
 {'Open': 11,
  'High': 39,
  'Low': 51,
  'Close': 42,
  'Adj Close': 21,
  'Volume': 20200103},
 {'Open': 54,
  'High': 26,
  'Low': 29,
  'Close': 33,
  'Adj Close': 5,
  'Volume': 56684817},
 {'Open': 39,
  'High': 23,
  'Low': 6,
  'Close': 16,
  'Adj Close': 52,
  'Volume': 2563551},
 {'Open': 50,
  'High': 4,
  'Low': 49,
  'Close': 53,
  'Adj Close': 50,
  'Volume': 40063769}]

In [426]:
pdSampleStock = pd.DataFrame.from_dict(SampleStock)
predictions = knn.predict(pdSampleStock)
predictions

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [427]:
pdPredictedInvest = pdSampleStock
pdPredictedInvest["InvestToday?"] = predictions.astype(bool)
pdPredictedInvest

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,InvestToday?
0,44,26,51,1,52,1439518,True
1,56,15,48,53,9,22392310,True
2,22,29,56,39,11,36866680,True
3,44,51,51,27,3,4501192,True
4,52,24,55,6,46,40024183,True
5,24,11,13,55,45,2306388,True
6,11,39,51,42,21,20200103,True
7,54,26,29,33,5,56684817,True
8,39,23,6,16,52,2563551,True
9,50,4,49,53,50,40063769,True


In [428]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [429]:
lr.fit(X_train, y_train)


LogisticRegression(solver='liblinear')

In [430]:
lr.score(X_train, y_train)


0.5454545454545454

In [431]:
lr.score(X_test, y_test)

0.4

In [432]:
predictions = lr.predict(X_test)

In [433]:
fig = px.scatter(CokeDataSetQtr2, x='Close', y='Open',color='InvestToday')
fig.show()

### Tesla (TSLA)
Andrew T.

In [434]:
pytrends = TrendReq(hl='en-US', tz=360)

#build list of keywords in this case only use Teslas
kw_list = ["Bitcoin"]

# build the payload
pytrends.build_payload(kw_list, timeframe='2021-03-31 2021-06-29', geo='US')

# store interest over time information in df and rename Tesla column to Search Interest
bitcoinTrendsdf = pytrends.interest_over_time()
bitcoinTrendsdf = bitcoinTrendsdf.rename(columns={'Bitcoin': 'Previous_Search_Interest'})
#telsaStockdf.set_index('date': 'Date', inplace=True)
bitcoinTrendsdf.reset_index(inplace=True, drop=True)
bitcoinTrendsdf

Unnamed: 0,Previous_Search_Interest,isPartial
0,29,False
1,41,False
2,29,False
3,22,False
4,22,False
5,23,False
6,24,False
7,23,False
8,24,False
9,22,False


In [435]:
bitcoinPricedf = pd.read_csv(f"{dataDir}BTC-USD.csv")
bitcoinPreviousPricedf = pd.read_csv(f"{dataDir}BTC-USD-Previous.csv")
mergedPrice = pd.concat([bitcoinPricedf, bitcoinPreviousPricedf], axis=1)
bitcoinPreviousPricedf

Unnamed: 0,Previous_Open,Previous_High,Previous_Low,Previous_Close,Previous_Adj_Close,Previous_Volume
0,58930.277344,59930.027344,57726.417969,58918.832031,58918.832031,65520826225
1,58926.5625,59586.070313,58505.277344,59095.808594,59095.808594,61669163792
2,59098.878906,60267.1875,58869.28125,59384.3125,59384.3125,58727860620
3,59397.410156,60110.269531,57603.890625,57603.890625,57603.890625,59641344484
4,57604.839844,58913.746094,57168.675781,58758.554688,58758.554688,50749662970
5,58760.875,59891.296875,57694.824219,59057.878906,59057.878906,60706272115
6,59171.933594,59479.578125,57646.808594,58192.359375,58192.359375,66058027988
7,58186.507813,58731.144531,55604.023438,56048.9375,56048.9375,75645303584
8,56099.914063,58338.738281,55879.085938,58323.953125,58323.953125,53053855641
9,58326.5625,58937.046875,57807.863281,58245.003906,58245.003906,46655208546


In [436]:
mergedStockPrice = pd.concat([mergedPrice, bitcoinTrendsdf], axis=1)
#mergedStockPrice.reset_index(inplace=True, drop=True)
mergedStockPrice.set_index('isPartial', drop=True)
#mergedStockPrice['Previous_Close'] = mergedStockPrice['Close']
#meanSearchInterest = mergedStockPrice['Search_Interest'].mean()
#mergedStockPrice["Interest Points Away From Mean"] = mergedStockPrice["Search Interest"] - meanSearchInterest
mergedStockPrice["Price_Increase"] = mergedStockPrice["Open"] - mergedStockPrice["Close"] > 0.0
mergedStockPrice["Price_Increase"] = mergedStockPrice["Price_Increase"]*1


#pd.set_option('display.max_rows', len(mergedStockPrice))

mergedStockPrice

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Previous_Open,Previous_High,Previous_Low,Previous_Close,Previous_Adj_Close,Previous_Volume,Previous_Search_Interest,isPartial,Price_Increase
0,2021-04-01,58926.5625,59586.070313,58505.277344,59095.808594,59095.808594,61669163792,58930.277344,59930.027344,57726.417969,58918.832031,58918.832031,65520826225,29,False,0
1,2021-04-02,59098.878906,60267.1875,58869.28125,59384.3125,59384.3125,58727860620,58926.5625,59586.070313,58505.277344,59095.808594,59095.808594,61669163792,41,False,0
2,2021-04-03,59397.410156,60110.269531,57603.890625,57603.890625,57603.890625,59641344484,59098.878906,60267.1875,58869.28125,59384.3125,59384.3125,58727860620,29,False,1
3,2021-04-04,57604.839844,58913.746094,57168.675781,58758.554688,58758.554688,50749662970,59397.410156,60110.269531,57603.890625,57603.890625,57603.890625,59641344484,22,False,0
4,2021-04-05,58760.875,59891.296875,57694.824219,59057.878906,59057.878906,60706272115,57604.839844,58913.746094,57168.675781,58758.554688,58758.554688,50749662970,22,False,0
5,2021-04-06,59171.933594,59479.578125,57646.808594,58192.359375,58192.359375,66058027988,58760.875,59891.296875,57694.824219,59057.878906,59057.878906,60706272115,23,False,1
6,2021-04-07,58186.507813,58731.144531,55604.023438,56048.9375,56048.9375,75645303584,59171.933594,59479.578125,57646.808594,58192.359375,58192.359375,66058027988,24,False,1
7,2021-04-08,56099.914063,58338.738281,55879.085938,58323.953125,58323.953125,53053855641,58186.507813,58731.144531,55604.023438,56048.9375,56048.9375,75645303584,23,False,0
8,2021-04-09,58326.5625,58937.046875,57807.863281,58245.003906,58245.003906,46655208546,56099.914063,58338.738281,55879.085938,58323.953125,58323.953125,53053855641,24,False,1
9,2021-04-10,58253.777344,61276.664063,58038.707031,59793.234375,59793.234375,58238470525,58326.5625,58937.046875,57807.863281,58245.003906,58245.003906,46655208546,22,False,0


In [437]:
mergedStockPrice.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Date                      91 non-null     object 
 1   Open                      91 non-null     float64
 2   High                      91 non-null     float64
 3   Low                       91 non-null     float64
 4   Close                     91 non-null     float64
 5   Adj Close                 91 non-null     float64
 6   Volume                    91 non-null     int64  
 7   Previous_Open             91 non-null     float64
 8   Previous_High             91 non-null     float64
 9   Previous_Low              91 non-null     float64
 10  Previous_Close            91 non-null     float64
 11  Previous_Adj_Close        91 non-null     float64
 12  Previous_Volume           91 non-null     int64  
 13  Previous_Search_Interest  91 non-null     int64  
 14  isPartial   

In [438]:
#columns = ["Open", "High", "Low", "Close", "Search Interest", "Interest Points Away From Mean", "Price Increase Points", "Price Increase"]
columns = ["Previous_Open", "Previous_High", "Previous_Low", "Previous_Close", "Previous_Volume", "Previous_Search_Interest", "Close"]
mergedStockPrice = mergedStockPrice[columns]
mergedStockPrice

Unnamed: 0,Previous_Open,Previous_High,Previous_Low,Previous_Close,Previous_Volume,Previous_Search_Interest,Close
0,58930.277344,59930.027344,57726.417969,58918.832031,65520826225,29,59095.808594
1,58926.5625,59586.070313,58505.277344,59095.808594,61669163792,41,59384.3125
2,59098.878906,60267.1875,58869.28125,59384.3125,58727860620,29,57603.890625
3,59397.410156,60110.269531,57603.890625,57603.890625,59641344484,22,58758.554688
4,57604.839844,58913.746094,57168.675781,58758.554688,50749662970,22,59057.878906
5,58760.875,59891.296875,57694.824219,59057.878906,60706272115,23,58192.359375
6,59171.933594,59479.578125,57646.808594,58192.359375,66058027988,24,56048.9375
7,58186.507813,58731.144531,55604.023438,56048.9375,75645303584,23,58323.953125
8,56099.914063,58338.738281,55879.085938,58323.953125,53053855641,24,58245.003906
9,58326.5625,58937.046875,57807.863281,58245.003906,46655208546,22,59793.234375


In [439]:
features = list(mergedStockPrice.columns)
features.remove("Close")
target = "Close"

X = mergedStockPrice[features]
y = mergedStockPrice[target]

In [440]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [441]:
lr = LinearRegression()
lr

LinearRegression()

In [442]:
lr.fit(X_train, y_train)

LinearRegression()

In [443]:
lr.score(X_train, y_train)

0.9600860921909471

In [444]:
lr.score(X_test, y_test)

0.9614613962909508

In [445]:
predictions = lr.predict(X_test)
PrintMetricsRegression(y_test, predictions)

Score: 0.96
MAE: 1573.91
RMSE: 1988.37
r2: 0.96


In [446]:
numElements = 3
samplePrice = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # We'll always allow at lea
        maxValue = round(max(mergedStockPrice[column].values))
        dict[column] = rnd.randint(min, maxValue)
    samplePrice.append(dict)
samplePrice


[{'Previous_Open': 11948,
  'Previous_High': 8226,
  'Previous_Low': 34695,
  'Previous_Close': 31818,
  'Previous_Volume': 119406232053,
  'Previous_Search_Interest': 41},
 {'Previous_Open': 27803,
  'Previous_High': 30215,
  'Previous_Low': 23176,
  'Previous_Close': 5105,
  'Previous_Volume': 122753068767,
  'Previous_Search_Interest': 63},
 {'Previous_Open': 51009,
  'Previous_High': 23506,
  'Previous_Low': 58654,
  'Previous_Close': 35309,
  'Previous_Volume': 24995876771,
  'Previous_Search_Interest': 12}]

In [447]:
pdSamplePrice = pd.DataFrame.from_dict(samplePrice)
pdSamplePrice


Unnamed: 0,Previous_Open,Previous_High,Previous_Low,Previous_Close,Previous_Volume,Previous_Search_Interest
0,11948,8226,34695,31818,119406232053,41
1,27803,30215,23176,5105,122753068767,63
2,51009,23506,58654,35309,24995876771,12


In [448]:
predictions = lr.predict(pdSamplePrice)
predictions



array([12085.74080034, -5118.65570803, 23689.78017242])

In [449]:
pdSamplePrice = pdSamplePrice.copy()
pdSamplePrice['Predicted'] = predictions
pdSamplePrice



Unnamed: 0,Previous_Open,Previous_High,Previous_Low,Previous_Close,Previous_Volume,Previous_Search_Interest,Predicted
0,11948,8226,34695,31818,119406232053,41,12085.7408
1,27803,30215,23176,5105,122753068767,63,-5118.655708
2,51009,23506,58654,35309,24995876771,12,23689.780172


# Logarithmic Regression


In [450]:
pd.set_option('display.max_rows', len(mergedStockPrice))

mergedStockPrice["Price_Increase"] = bitcoinPricedf["Open"] - bitcoinPricedf["Close"] > 0.0
mergedStockPrice['Price_Increase'] = mergedStockPrice.Price_Increase.astype(int)
mergedStockPrice




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Previous_Open,Previous_High,Previous_Low,Previous_Close,Previous_Volume,Previous_Search_Interest,Close,Price_Increase
0,58930.277344,59930.027344,57726.417969,58918.832031,65520826225,29,59095.808594,0
1,58926.5625,59586.070313,58505.277344,59095.808594,61669163792,41,59384.3125,0
2,59098.878906,60267.1875,58869.28125,59384.3125,58727860620,29,57603.890625,1
3,59397.410156,60110.269531,57603.890625,57603.890625,59641344484,22,58758.554688,0
4,57604.839844,58913.746094,57168.675781,58758.554688,50749662970,22,59057.878906,0
5,58760.875,59891.296875,57694.824219,59057.878906,60706272115,23,58192.359375,1
6,59171.933594,59479.578125,57646.808594,58192.359375,66058027988,24,56048.9375,1
7,58186.507813,58731.144531,55604.023438,56048.9375,75645303584,23,58323.953125,0
8,56099.914063,58338.738281,55879.085938,58323.953125,53053855641,24,58245.003906,1
9,58326.5625,58937.046875,57807.863281,58245.003906,46655208546,22,59793.234375,0


In [451]:

columns = ["Previous_Open", "Previous_Close", "Previous_Search_Interest", "Price_Increase"]
mergedStockPrice = mergedStockPrice[columns]
mergedStockPrice

features = list(mergedStockPrice.columns)
features.remove("Price_Increase")
target = "Price_Increase"

X = mergedStockPrice[features]
y = mergedStockPrice[target]

In [452]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [453]:
logReg = LogisticRegression(solver="liblinear")
logReg

LogisticRegression(solver='liblinear')

In [454]:
logReg.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [455]:
logReg.score(X_train, y_train)

0.5

In [456]:
logReg.score(X_test, y_test)

0.6086956521739131

In [457]:
predictions = logReg.predict(X_test)
PrintMetricsClassiffication(y_test, predictions)

Confusion Matrix:
[[ 3  3]
 [ 6 11]]
------------------
Accuracy: 0.61
Recall: 0.65
Prediction: 0.79
f-measure: 0.71
------------------
              precision    recall  f1-score   support

           0       0.33      0.50      0.40         6
           1       0.79      0.65      0.71        17

    accuracy                           0.61        23
   macro avg       0.56      0.57      0.55        23
weighted avg       0.67      0.61      0.63        23



In [458]:
predictions = logReg.predict(X_test)
PrintMetricsClassiffication(y_test, predictions)

Confusion Matrix:
[[ 3  3]
 [ 6 11]]
------------------
Accuracy: 0.61
Recall: 0.65
Prediction: 0.79
f-measure: 0.71
------------------
              precision    recall  f1-score   support

           0       0.33      0.50      0.40         6
           1       0.79      0.65      0.71        17

    accuracy                           0.61        23
   macro avg       0.56      0.57      0.55        23
weighted avg       0.67      0.61      0.63        23



In [459]:
# debug - you need a new random sample with all columns this model uses - I pulled a new one from the function

samplePrice = prepareDataForPredictions(X)
pdSamplePrice = pd.DataFrame.from_dict(samplePrice)
pdSamplePrice

Unnamed: 0,Previous_Open,Previous_Close,Previous_Search_Interest
0,54877,16067,65
1,36229,50501,83
2,27531,43100,65


In [460]:
pdSamplePrice = pd.DataFrame.from_dict(pdSamplePrice)

predictions = logReg.predict(pdSamplePrice)

pdPredicted = pdSamplePrice
pdPredicted["Price_Increase"] = predictions.astype(bool)
pdPredicted

Unnamed: 0,Previous_Open,Previous_Close,Previous_Search_Interest,Price_Increase
0,54877,16067,65,False
1,36229,50501,83,False
2,27531,43100,65,True


### John Deere (DE)
Dan Knobloch

#### Part 1: Regression - Linear Regression
##### Prepare the Data
load data, clean up, prepare features and targets, split up training and test data.

In [461]:
# read file, drop null values, convert binary values in trend and price to integers
JDStockTrend = pd.read_csv(f"{dataDir}DeereStockPrice.csv")
JDStockTrend.dropna(inplace=True)
JDStockTrend['trend_daily_increase'] = JDStockTrend.trend_daily_increase.astype(int)
JDStockTrend['price_daily_increase'] = JDStockTrend.price_daily_increase.astype(int)
JDStockTrend

Unnamed: 0,Date,Price,Trends,High,Volume,Previous_Close,trend_daily_increase,price_daily_increase
1,4/5/2021,374.809998,88,377.940002,1419500,372.12,1,1
2,4/6/2021,375.609985,83,381.839996,1309700,374.81,0,1
3,4/7/2021,374.790009,88,378.880005,1299300,375.61,1,0
4,4/8/2021,374.070007,79,374.529999,1250700,374.79,0,0
5,4/9/2021,377.0,82,378.079987,1224500,374.07,1,1
6,4/12/2021,378.26001,96,379.149994,1165100,377.0,1,1
7,4/13/2021,378.670013,90,384.399994,985100,378.26,0,1
8,4/14/2021,381.5,92,382.980011,1077100,378.67,1,1
9,4/15/2021,382.140015,88,385.950012,1127000,381.5,0,1
10,4/16/2021,383.070007,80,386.059998,1092000,382.14,0,1


In [462]:
# determine feature vectors and target vectors. in this case targeting a prediction of price based on the previous days closing price, volume traded, and trends data
featureColumns = ["Trends","Previous_Close","Volume","trend_daily_increase"]
target = "Price"

X=JDStockTrend[featureColumns]
y=JDStockTrend[target]

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61 entries, 1 to 61
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Trends                61 non-null     int64  
 1   Previous_Close        61 non-null     float64
 2   Volume                61 non-null     int64  
 3   trend_daily_increase  61 non-null     int64  
dtypes: float64(1), int64(3)
memory usage: 2.4 KB


In [463]:
y

1     374.809998
2     375.609985
3     374.790009
4     374.070007
5     377.000000
6     378.260010
7     378.670013
8     381.500000
9     382.140015
10    383.070007
11    380.720001
12    370.269989
13    375.600006
14    368.359985
15    376.269989
16    380.450012
17    382.359985
18    379.799988
19    376.390015
20    370.850006
21    373.769989
22    379.579987
23    378.859985
24    389.910004
25    394.220001
26    391.399994
27    383.200012
28    373.630005
29    378.109985
30    384.000000
31    383.549988
32    369.690002
33    358.420013
34    355.220001
35    359.750000
36    359.359985
37    360.690002
38    357.739990
39    362.209991
40    361.100006
41    364.609985
42    356.709991
43    358.910004
44    356.640015
45    355.429993
46    356.579987
47    349.529999
48    341.440002
49    341.570007
50    335.540009
51    338.100006
52    336.559998
53    328.380005
54    328.970001
55    337.880005
56    342.100006
57    347.750000
58    350.619995
59    349.9899

In [464]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#### Modeling with Linear Regression
fit a line to our data set, with the minimum distance between the points.

In [465]:
lr = LinearRegression()    #use this algorithm to start developing the line betwee data points
lr

LinearRegression()

In [466]:
lr.fit(X_train, y_train)
trainScore = lr.score(X_train, y_train)
testScore = lr.score(X_test, y_test)

print(f'the score with the training data set = {trainScore}')
print(f'the score with the test data set = {testScore}')

the score with the training data set = 0.9152033901019356
the score with the test data set = 0.8976667513186476


#### LR Metrics Output

In [467]:
predictions = lr.predict(X_test)
PrintMetricsRegression(y_test, predictions)

Score: 0.90
MAE: 4.29
RMSE: 5.18
r2: 0.90


#### Predict some new samples

define a few new samples.

In [468]:
numElements = 3
sampleStockTrend = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # assume min = 0
        maxValue = round(max(JDStockTrend[column].values))
        dict[column] = rnd.randint(min, maxValue)
    sampleStockTrend.append(dict)
sampleStockTrend

[{'Trends': 63,
  'Previous_Close': 335,
  'Volume': 1136452,
  'trend_daily_increase': 0},
 {'Trends': 28,
  'Previous_Close': 158,
  'Volume': 3137419,
  'trend_daily_increase': 0},
 {'Trends': 91,
  'Previous_Close': 286,
  'Volume': 1718015,
  'trend_daily_increase': 0}]

In [469]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)
pdSampleStockTrend

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase
0,63,335,1136452,0
1,28,158,3137419,0
2,91,286,1718015,0


In [470]:
predictions = lr.predict(pdSampleStockTrend)
predictions

array([335.6263195 , 167.580081  , 291.50703499])

In [471]:
pdPredictedStockTrend = pdSampleStockTrend.copy()
pdPredictedStockTrend['Predicted'] = predictions
pdPredictedStockTrend

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase,Predicted
0,63,335,1136452,0,335.626319
1,28,158,3137419,0,167.580081
2,91,286,1718015,0,291.507035


##### Part 2: Classification - Logistic Regression

##### Prepare the Data
load data, clean up, prepare features and targets, split up training and test data.

In [472]:
# Need to prepare the data seperately for logistic regression becuase our feature and target vectors will be different
JDStockTrend

Unnamed: 0,Date,Price,Trends,High,Volume,Previous_Close,trend_daily_increase,price_daily_increase
1,4/5/2021,374.809998,88,377.940002,1419500,372.12,1,1
2,4/6/2021,375.609985,83,381.839996,1309700,374.81,0,1
3,4/7/2021,374.790009,88,378.880005,1299300,375.61,1,0
4,4/8/2021,374.070007,79,374.529999,1250700,374.79,0,0
5,4/9/2021,377.0,82,378.079987,1224500,374.07,1,1
6,4/12/2021,378.26001,96,379.149994,1165100,377.0,1,1
7,4/13/2021,378.670013,90,384.399994,985100,378.26,0,1
8,4/14/2021,381.5,92,382.980011,1077100,378.67,1,1
9,4/15/2021,382.140015,88,385.950012,1127000,381.5,0,1
10,4/16/2021,383.070007,80,386.059998,1092000,382.14,0,1


In [473]:
#determine feature vectors and target vectors. in this case targeting a classification on whether or not the price would increase based on the previous days closing price, volume traded, and search trends data
featureColumns = ["Trends","Previous_Close","Volume","trend_daily_increase"]
target = "price_daily_increase"

X=JDStockTrend[featureColumns]
y=JDStockTrend[target]

X

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase
1,88,372.12,1419500,1
2,83,374.81,1309700,0
3,88,375.61,1299300,1
4,79,374.79,1250700,0
5,82,374.07,1224500,1
6,96,377.0,1165100,1
7,90,378.26,985100,0
8,92,378.67,1077100,1
9,88,381.5,1127000,0
10,80,382.14,1092000,0


In [474]:
y

1     1
2     1
3     0
4     0
5     1
6     1
7     1
8     1
9     1
10    1
11    0
12    0
13    1
14    0
15    1
16    1
17    1
18    0
19    0
20    0
21    1
22    1
23    0
24    1
25    1
26    0
27    0
28    0
29    1
30    1
31    0
32    0
33    0
34    0
35    1
36    0
37    1
38    0
39    1
40    0
41    1
42    0
43    1
44    0
45    0
46    1
47    0
48    0
49    1
50    0
51    1
52    0
53    0
54    1
55    1
56    1
57    1
58    1
59    0
60    0
61    0
Name: price_daily_increase, dtype: int64

In [475]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

#### Modeling with Logistic Regression (classification)

In [476]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [477]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [478]:
lr.score(X_train, y_train)

0.5777777777777777

In [479]:
lr.score(X_test, y_test)

0.5

#### Classification Metrics Output

In [480]:
predictions = lr.predict(X_test)
PrintMetricsClassiffication(y_test, predictions)

Confusion Matrix:
[[4 4]
 [4 4]]
------------------
Accuracy: 0.50
Recall: 0.50
Prediction: 0.50
f-measure: 0.50
------------------
              precision    recall  f1-score   support

           0       0.50      0.50      0.50         8
           1       0.50      0.50      0.50         8

    accuracy                           0.50        16
   macro avg       0.50      0.50      0.50        16
weighted avg       0.50      0.50      0.50        16



#### Predict with the samples that were generated above

In [481]:
sampleStockTrend

[{'Trends': 63,
  'Previous_Close': 335,
  'Volume': 1136452,
  'trend_daily_increase': 0},
 {'Trends': 28,
  'Previous_Close': 158,
  'Volume': 3137419,
  'trend_daily_increase': 0},
 {'Trends': 91,
  'Previous_Close': 286,
  'Volume': 1718015,
  'trend_daily_increase': 0}]

In [482]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)

In [483]:
predictions = lr.predict(pdSampleStockTrend)
predictions

array([1, 0, 0])

In [484]:
pdPredictedStockTrend = pdSampleStockTrend
pdPredictedStockTrend["price_daily_increse"] = predictions.astype(bool)
pdPredictedStockTrend

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase,price_daily_increse
0,63,335,1136452,0,True
1,28,158,3137419,0,False
2,91,286,1718015,0,False


#### Conclusion

Analyzing the current results of the both methods used (linear regression, and logistic regression (Classification). it does seem like the data points to the fact that the linear regrission is a strong canidate for predicting price of John Deere. in each linear regression scenario, the score is around 90%. and the sample data that is generated also seems to develop reasonable price predictions for the closing price. the logistic model (for classification) only has a score of 44%. overall. in some ways, developing a stronger classification model would provide more value for a day trader of stocks becuase it would give them the insight of the confidience to buy shares of a compnay to make money in a short amount of time.

