Purpose

The purpose of this notebook is to provide practice with scikitlearn using both linear regression and logistical regressional models.  In this example, Applestock is used to determine if there is a correlation between google search interest about Applestock and the price the stock itself.

Imports

In [1]:
import random as rnd
rnd.seed(1024)
import math
import pandas as pd
from pytrends.request import TrendReq
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

Helper Functions

In [2]:
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

def PrintMetricsRegression(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")
    
def PrintMetricsClassiffication(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

Pull Google search interest for Apple for second Quarter 

In [3]:
pytrends = TrendReq(hl='en-US', tz=360)

#build list of keywords in this case only use AAPL
kw_list = ["AAPL"] 

# build the payload
pytrends.build_payload(kw_list, timeframe='2021-04-01 2021-06-30', geo='US')

# Store interest over time information in df and rename AAPL column to Search Interest
AppleTrendsdf = pytrends.interest_over_time()
AppleTrendsdf = AppleTrendsdf.rename(columns={"AAPL": 'Search Interest'})
AppleTrendsdf.reset_index(inplace=True, drop=True)

Prepare the Data

In [4]:
# Removes the weekends and holidays from AppleTrendsdf since the stock data only trades on these days
Index = 2
while Index < len(AppleTrendsdf-1): 
    Next = Index +1
    if Next < len(AppleTrendsdf-1):
        AppleTrendsdf.drop([Index, Next], axis=0, inplace=True)
        Index +=5
    else:
        AppleTrendsdf.drop([Index], axis=0, inplace=True)
        Index +=5
AppleTrendsdf.drop([1, 60], axis=0, inplace=True)


In [5]:
#Import Apple stock data from csv file 
AppleStockdf = pd.read_csv("https://raw.githubusercontent.com/atlas125gev/StockProject/main/Ken/AAPL.csv")
AppleStockdf

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,4/1/2021,123.660004,124.180000,122.489998,123.000000,122.791428,75089100
1,4/5/2021,123.870003,126.160004,123.070000,125.900002,125.686516,88651200
2,4/6/2021,126.500000,127.129997,125.650002,126.209999,125.995987,80171300
3,4/7/2021,125.830002,127.919998,125.139999,127.900002,127.683121,83466700
4,4/8/2021,128.949997,130.389999,128.520004,130.360001,130.138947,88844600
...,...,...,...,...,...,...,...
58,6/24/2021,134.449997,134.639999,132.929993,133.410004,133.410004,68711000
59,6/25/2021,133.460007,133.889999,132.809998,133.110001,133.110001,70783700
60,6/28/2021,133.410004,135.250000,133.350006,134.779999,134.779999,62111300
61,6/29/2021,134.800003,136.490005,134.350006,136.330002,136.330002,64556100


In [6]:
# Resets the index for the AppleTrends data frame and merges with the Apple stock data for the same time frame
AppleTrendsdf.reset_index(inplace=True, drop=True)
MergedStockApple = pd.concat([AppleStockdf, AppleTrendsdf], axis=1)


# Adds columns to determine search interest and price that is above average for the day of trading.  
MeanSearchInterest = AppleTrendsdf["Search Interest"].mean()
MeanPrice = MergedStockApple["Close"].mean()
MergedStockApple["Interest Points Away From Mean"] = MergedStockApple["Search Interest"] - MeanSearchInterest
MergedStockApple["Daily Price Range"] = MergedStockApple["Open"] - MergedStockApple["Close"]
MergedStockApple["Price Points Away from Mean"] =  MergedStockApple["Close"] - MeanPrice 
MergedStockApple["Search Interests Above Average"] = MergedStockApple["Interest Points Away From Mean"] > 0.0
MergedStockApple

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Search Interest,isPartial,Interest Points Away From Mean,Daily Price Range,Price Points Away from Mean,Search Interests Above Average
0,4/1/2021,123.660004,124.180000,122.489998,123.000000,122.791428,75089100,37,False,8.730159,0.660004,-6.568889,True
1,4/5/2021,123.870003,126.160004,123.070000,125.900002,125.686516,88651200,44,False,15.730159,-2.029999,-3.668887,True
2,4/6/2021,126.500000,127.129997,125.650002,126.209999,125.995987,80171300,43,False,14.730159,0.290001,-3.358890,True
3,4/7/2021,125.830002,127.919998,125.139999,127.900002,127.683121,83466700,39,False,10.730159,-2.070000,-1.668887,True
4,4/8/2021,128.949997,130.389999,128.520004,130.360001,130.138947,88844600,9,False,-19.269841,-1.410004,0.791112,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,6/24/2021,134.449997,134.639999,132.929993,133.410004,133.410004,68711000,6,False,-22.269841,1.039993,3.841115,False
59,6/25/2021,133.460007,133.889999,132.809998,133.110001,133.110001,70783700,4,False,-24.269841,0.350006,3.541112,False
60,6/28/2021,133.410004,135.250000,133.350006,134.779999,134.779999,62111300,29,False,0.730159,-1.369995,5.211110,True
61,6/29/2021,134.800003,136.490005,134.350006,136.330002,136.330002,64556100,34,False,5.730159,-1.529999,6.761113,True


In [7]:
# This is changing the boolean types to categorical and dropping the primary as it is implied.  Info function shows no objects or boolean types 
categories = "Search Interests Above Average"
MergedStockApple = pd.concat(
    [MergedStockApple.drop(categories, axis=1), createCategoricalDummies(MergedStockApple, categories)], axis= 1)
MergedStockApple.drop(["Date", "isPartial"], axis=1, inplace=True)
MergedStockApple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Open                            63 non-null     float64
 1   High                            63 non-null     float64
 2   Low                             63 non-null     float64
 3   Close                           63 non-null     float64
 4   Adj Close                       63 non-null     float64
 5   Volume                          63 non-null     int64  
 6   Search Interest                 63 non-null     int32  
 7   Interest Points Away From Mean  63 non-null     float64
 8   Daily Price Range               63 non-null     float64
 9   Price Points Away from Mean     63 non-null     float64
 10  True                            63 non-null     uint8  
dtypes: float64(8), int32(1), int64(1), uint8(1)
memory usage: 4.9 KB


In [8]:
# Generating the test data based upon search interest and opening price in the apple stock data set
Features = ["Search Interest", "Low", "High", "Open"]
X = MergedStockApple[Features]
X

Unnamed: 0,Search Interest,Low,High,Open
0,37,122.489998,124.180000,123.660004
1,44,123.070000,126.160004,123.870003
2,43,125.650002,127.129997,126.500000
3,39,125.139999,127.919998,125.830002
4,9,128.520004,130.389999,128.949997
...,...,...,...,...
58,6,132.929993,134.639999,134.449997
59,4,132.809998,133.889999,133.460007
60,29,133.350006,135.250000,133.410004
61,34,134.350006,136.490005,134.800003


In [9]:
# Generating the test data look at the ability to produce the close price
Target = "Close"
y = MergedStockApple[Target]
y

0     123.000000
1     125.900002
2     126.209999
3     127.900002
4     130.360001
         ...    
58    133.410004
59    133.110001
60    134.779999
61    136.330002
62    136.960007
Name: Close, Length: 63, dtype: float64

In [10]:
# Training model with the defined test data created above
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


Linear Regression

In [11]:
# Defining lr for linear regression function to reduce typing below
lr = LinearRegression()
lr

LinearRegression()

In [12]:
#Using linear regression function to fit the model
lr.fit(X_train, y_train)

LinearRegression()

In [13]:
#Score the training model
lr.score(X_train, y_train) 

0.9802427410135256

In [14]:
#Score the testing model
lr.score(X_test, y_test)

0.97072762662332

In [15]:
#Produce the linear regression metrics indicating the quality of the model
PredictionsRegression = lr.predict(X_test)
PrintMetricsRegression(y_test, PredictionsRegression)

Score: 0.97
MAE: 0.47
RMSE: 0.58
r2: 0.97


Linear Regression Predicting New Samples

In [16]:
#Generating random values for the search interest and opening price for prediction
NumberElements = 3
SampleStockTrendRegression = []
for _ in range(NumberElements):
    dict = {}
    for column in X.columns:
        min = 0  
        maxValue = round(max(MergedStockApple[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStockTrendRegression.append(dict)
SampleStockTrendRegression

[{'Search Interest': 51, 'Low': 4, 'High': 123, 'Open': 99},
 {'Search Interest': 20, 'Low': 133, 'High': 25, 'Open': 113},
 {'Search Interest': 32, 'Low': 93, 'High': 94, 'Open': 99}]

In [17]:
#Defining a dataframe for the chosen values for search interest and opening price created above
pdSampleStockTrendRegression = pd.DataFrame.from_dict(SampleStockTrendRegression)
pdSampleStockTrendRegression

Unnamed: 0,Search Interest,Low,High,Open
0,51,4,123,99
1,20,133,25,113
2,32,93,94,99


In [18]:
#Predicting the closing price based upon the defined model
PredictionsRegression = lr.predict(pdSampleStockTrendRegression)
PredictionsRegression

array([56.63626172, 32.6311884 , 88.78436126])

In [19]:
#Adding the predicted values to the dataframe
pdPredictedStockTrend = pdSampleStockTrendRegression.copy()
pdPredictedStockTrend ['Predicted'] = PredictionsRegression
pdPredictedStockTrend 

Unnamed: 0,Search Interest,Low,High,Open,Predicted
0,51,4,123,99,56.636262
1,20,133,25,113,32.631188
2,32,93,94,99,88.784361


Classification Logistic Regression

In [20]:
#Redefining the test data for logical regression with choosing a categorical variable for the target case
Target = True
y = MergedStockApple[Target]
y

0     1
1     1
2     1
3     1
4     0
     ..
58    0
59    0
60    1
61    1
62    1
Name: True, Length: 63, dtype: uint8

In [21]:
# Training model with the defined test data created above
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [22]:
# Defining LogisticRegression fucntion as lr to reduce the number of characters below
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [23]:
# Using logistics regression to fit the model
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [24]:
# Using logistics regression to score the training data
lr.score(X_train, y_train)

1.0

In [25]:
# Using logistics regression to score the test data
lr.score(X_test, y_test)

0.9375

In [26]:
#Produce the logistic regression to display metrics indicating the quality of the model and generate the confusion matrix
PredictionsClassification = lr.predict(X_test)
PrintMetricsClassiffication(y_test, PredictionsClassification)

Confusion Matrix:
[[ 4  0]
 [ 1 11]]
------------------
Accuracy: 0.94
Recall: 0.92
Prediction: 1.00
f-measure: 0.96
------------------
              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.92      0.96        12

    accuracy                           0.94        16
   macro avg       0.90      0.96      0.92        16
weighted avg       0.95      0.94      0.94        16



Classifiction Predicting New Samples 

In [27]:
#Generating random values for the search interest and opening price for prediction
numElements = 3
SampleStockTrendsClassification = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # We'll always allow at lea
        maxValue = round(max(MergedStockApple[column].values))
        dict[column] = rnd.randint(min, maxValue)
    SampleStockTrendsClassification.append(dict)
SampleStockTrendsClassification

[{'Search Interest': 50, 'Low': 24, 'High': 35, 'Open': 99},
 {'Search Interest': 6, 'Low': 104, 'High': 39, 'Open': 111},
 {'Search Interest': 30, 'Low': 114, 'High': 134, 'Open': 88}]

In [28]:
#Defining a dataframe for the chosen values for search interest and opening price created above
pdSampleStockTrendsClassification = pd.DataFrame.from_dict(SampleStockTrendsClassification)

In [29]:
#Predicting the closing price based upon the defined model
predictionsClassification = lr.predict(pdSampleStockTrendsClassification)
predictionsClassification

array([1, 0, 0], dtype=uint8)

In [30]:
#Adding the predicted values to the dataframe
pdPredictedStockTrendsClassification= pdSampleStockTrendsClassification
pdPredictedStockTrendsClassification["Predicted"] = predictionsClassification.astype(bool)
pdPredictedStockTrendsClassification

Unnamed: 0,Search Interest,Low,High,Open,Predicted
0,50,24,35,99,True
1,6,104,39,111,False
2,30,114,134,88,False


Conclusion

Both the linear regression and logistical regression models indicate a reasonable correlation between Google search activity, price points on the days trading, and the closing price for the dataset investigation in this notebook.  This could indicate that Apple is a stable stock with the google search interest and the price change through out the trading day is fairly consistent.   Further investigation on this project will include less stable stocks which may answer if this is true.


