# Knobloch - homework 7: group assignment to introduce machine learning

## Purpose

This notebook includes Dan Knobloch's portion of the homework 7 group notebook. 
the intent is to practice with scikitlearn library as well as overall understanding of classification and regression topics of machine learning.

## Primary Library

* Sklearn



## Imports & Helpers

In [1]:
import pandas as pd
from seaborn import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

#imports for regression model output analysis:
import math
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

#imports for Classification model output analysis:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report

In [2]:
# Helper methods
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

In [3]:
def printRegressionMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

In [4]:
def printClassificationMetrics(test, predictions):
    print("Confusion Matrix:")
    print(confusion_matrix(test, predictions))
    print("------------------")
    print(f"Accuracy: {accuracy_score(test, predictions):.2f}")
    print(f"Recall: {recall_score(test, predictions):.2f}")
    print(f"Prediction: {precision_score(test, predictions):.2f}")
    print(f"f-measure: {fbeta_score(test, predictions, beta=1):.2f}")
    print("------------------")
    print(classification_report(test, predictions))

# Part 1: Regression - Linear Regression
## Prepare the Data
load data, clean up, prepare features and targets, split up training and test data.

In [5]:
# read file, drop null values, convert binary values in trend and price to integers 
JDStockTrend = pd.read_csv(r"C:\Users\dk12955\BAISsummer2021\ClassProject\DeereStockPrice.csv")
JDStockTrend.dropna(inplace=True)
JDStockTrend['trend_daily_increase'] = JDStockTrend.trend_daily_increase.astype(int)
JDStockTrend['price_daily_increase'] = JDStockTrend.price_daily_increase.astype(int)
JDStockTrend

Unnamed: 0,Date,Price,Trends,High,Volume,Previous_Close,trend_daily_increase,price_daily_increase
1,4/5/2021,374.809998,88,377.940002,1419500,372.12,1,1
2,4/6/2021,375.609985,83,381.839996,1309700,374.81,0,1
3,4/7/2021,374.790009,88,378.880005,1299300,375.61,1,0
4,4/8/2021,374.070007,79,374.529999,1250700,374.79,0,0
5,4/9/2021,377.000000,82,378.079987,1224500,374.07,1,1
...,...,...,...,...,...,...,...,...
57,6/23/2021,347.750000,71,349.170013,3352700,342.10,0,1
58,6/24/2021,350.619995,74,354.739990,2140200,347.75,1,1
59,6/25/2021,349.989990,80,355.890015,6604800,350.62,1,0
60,6/28/2021,349.890015,72,350.950012,1304200,349.99,0,0


In [6]:
#determine feature vectors and target vectors. in this case targeting a prediction of price based on the previous days closing price, volume traded, and trends data
featureColumns = ["Trends","Previous_Close","Volume","trend_daily_increase"]
target = "Price"

X=JDStockTrend[featureColumns]
y=JDStockTrend[target]

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61 entries, 1 to 61
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Trends                61 non-null     int64  
 1   Previous_Close        61 non-null     float64
 2   Volume                61 non-null     int64  
 3   trend_daily_increase  61 non-null     int32  
dtypes: float64(1), int32(1), int64(2)
memory usage: 2.1 KB


In [7]:
y

1     374.809998
2     375.609985
3     374.790009
4     374.070007
5     377.000000
         ...    
57    347.750000
58    350.619995
59    349.989990
60    349.890015
61    348.929993
Name: Price, Length: 61, dtype: float64

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

## Modeling with Linear Regression

fit a line to our data set, with the minimum distance between the points.

In [9]:
lr = LinearRegression()    #use this algorithm to start developing the line betwee data points
lr

LinearRegression()

In [10]:
lr.fit(X_train, y_train)
trainScore = lr.score(X_train, y_train) 
testScore = lr.score(X_test, y_test) 

print(f'the score with the training data set = {trainScore}')
print(f'the score with the test data set = {testScore}')

the score with the training data set = 0.9152033901019357
the score with the test data set = 0.8976667513188491


## LR Metrics Output

In [11]:
predictions = lr.predict(X_test)
printRegressionMetrics(y_test, predictions)

Score: 0.90
MAE: 4.29
RMSE: 5.18
r2: 0.90


## Predict some new samples

define a few new samples.

In [12]:
import random as rnd
rnd.seed(1024)

In [13]:
numElements = 3
sampleStockTrend = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # assume min = 0
        maxValue = round(max(JDStockTrend[column].values))
        dict[column] = rnd.randint(min, maxValue)
    sampleStockTrend.append(dict)
sampleStockTrend

[{'Trends': 2,
  'Previous_Close': 247,
  'Volume': 3266757,
  'trend_daily_increase': 1},
 {'Trends': 66,
  'Previous_Close': 51,
  'Volume': 3733943,
  'trend_daily_increase': 1},
 {'Trends': 92,
  'Previous_Close': 366,
  'Volume': 3093938,
  'trend_daily_increase': 1}]

In [14]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)
pdSampleStockTrend

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase
0,2,247,3266757,1
1,66,51,3733943,1
2,92,366,3093938,1


In [15]:
predictions = lr.predict(pdSampleStockTrend)
predictions

array([250.28101518,  72.93647955, 366.1470389 ])

In [16]:
pdPredictedStockTrend = pdSampleStockTrend.copy()
pdPredictedStockTrend['Predicted'] = predictions
pdPredictedStockTrend

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase,Predicted
0,2,247,3266757,1,250.281015
1,66,51,3733943,1,72.93648
2,92,366,3093938,1,366.147039


# Part 2: Classification - Logistic Regression

## Prepare the Data
load data, clean up, prepare features and targets, split up training and test data.

In [17]:
# Need to prepare the data seperately for logistic regression becuase our feature and target vectors will be different
JDStockTrend

Unnamed: 0,Date,Price,Trends,High,Volume,Previous_Close,trend_daily_increase,price_daily_increase
1,4/5/2021,374.809998,88,377.940002,1419500,372.12,1,1
2,4/6/2021,375.609985,83,381.839996,1309700,374.81,0,1
3,4/7/2021,374.790009,88,378.880005,1299300,375.61,1,0
4,4/8/2021,374.070007,79,374.529999,1250700,374.79,0,0
5,4/9/2021,377.000000,82,378.079987,1224500,374.07,1,1
...,...,...,...,...,...,...,...,...
57,6/23/2021,347.750000,71,349.170013,3352700,342.10,0,1
58,6/24/2021,350.619995,74,354.739990,2140200,347.75,1,1
59,6/25/2021,349.989990,80,355.890015,6604800,350.62,1,0
60,6/28/2021,349.890015,72,350.950012,1304200,349.99,0,0


In [18]:
#determine feature vectors and target vectors. in this case targeting a classification on whether or not the price would increase based on the previous days closing price, volume traded, and search trends data
featureColumns = ["Trends","Previous_Close","Volume","trend_daily_increase"]
target = "price_daily_increase"

X=JDStockTrend[featureColumns]
y=JDStockTrend[target]

X

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase
1,88,372.12,1419500,1
2,83,374.81,1309700,0
3,88,375.61,1299300,1
4,79,374.79,1250700,0
5,82,374.07,1224500,1
...,...,...,...,...
57,71,342.10,3352700,0
58,74,347.75,2140200,1
59,80,350.62,6604800,1
60,72,349.99,1304200,0


In [19]:
y

1     1
2     1
3     0
4     0
5     1
     ..
57    1
58    1
59    0
60    0
61    0
Name: price_daily_increase, Length: 61, dtype: int32

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## Modeling with Logistic Regression (classification)


In [21]:
lr = LogisticRegression(solver="liblinear")
lr

LogisticRegression(solver='liblinear')

In [22]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [23]:
lr.score(X_train, y_train)

0.5777777777777777

In [24]:
lr.score(X_test, y_test)

0.4375

## Classification Metrics Output

In [25]:
predictions = lr.predict(X_test)
printClassificationMetrics(y_test, predictions)

Confusion Matrix:
[[0 8]
 [1 7]]
------------------
Accuracy: 0.44
Recall: 0.88
Prediction: 0.47
f-measure: 0.61
------------------
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.47      0.88      0.61         8

    accuracy                           0.44        16
   macro avg       0.23      0.44      0.30        16
weighted avg       0.23      0.44      0.30        16



## Predict with the samples that were generated above


In [26]:
import random as rnd
rnd.seed(1024)


In [27]:
sampleStockTrend

[{'Trends': 2,
  'Previous_Close': 247,
  'Volume': 3266757,
  'trend_daily_increase': 1},
 {'Trends': 66,
  'Previous_Close': 51,
  'Volume': 3733943,
  'trend_daily_increase': 1},
 {'Trends': 92,
  'Previous_Close': 366,
  'Volume': 3093938,
  'trend_daily_increase': 1}]

In [28]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)

In [29]:
predictions = lr.predict(pdSampleStockTrend)
predictions

array([0, 0, 0])

In [30]:
pdPredictedStockTrend = pdSampleStockTrend
pdPredictedStockTrend["price_daily_increse"] = predictions.astype(bool)
pdPredictedStockTrend

Unnamed: 0,Trends,Previous_Close,Volume,trend_daily_increase,price_daily_increse
0,2,247,3266757,1,False
1,66,51,3733943,1,False
2,92,366,3093938,1,False


## Conclusion

analyzing the current results of the both methods used (linear regression, and logistic regression (Classification). it does seem like the data points to the fact that the linear regrission is a strong canidate for predicting price of John Deere. in each linear regression scenario, the score is around 90%. and the sample data that is generated also seems to develop reasonable price predictions for the closing price. the logistic model (for classification) only has a score of 44%. overall. in some ways, developing a stronger classification model would provide more value for a day trader of stocks becuase it would give them the insight of the confidience to buy shares of a compnay to make money in a short amount of time.