<a href="https://colab.research.google.com/github/dipesh2108/AI_Notes/blob/main/Decision_(CART)_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Decision Trees in Python with Scikit-Learn
-------------------------------------------------------------

<hr>

Connect with author of this NB here - <a href= "https://www.linkedin.com/in/rocky-jagtiani-3b390649/">  Rocky Jagtiani </a>
<hr>

Introduction
-----------------
A decision tree is one of most frequently and widely **used supervised machine learning algorithms that can perform both regression and classification tasks**. Hence called <font color='green'><b>CART</b> - <u>C</u>lassification <u>A</u>nd <u>R</u>egression <u>T</u>rees.</font>

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.

Consider a scenario where a person asks you to lend them your car for a day, and you have to make a decision whether or not to lend them the car. There are several factors that help determine your decision, some of which have been listed below:

![decison_tree_image](https://drive.google.com/uc?id=1nCTEZVfy_m6dMu_Qz7SQDZWNhGZZOFKz 'decison_tree_image')



![decison_tree_example](https://drive.google.com/uc?id=1F_T2ICas2htr6b-GDBFRIo3DTbPnIzSI 'decison_tree_example')

**Entropy** is a measure of impurity or disorder in a dataset within the context of a decision tree. In a binary classification problem, it quantifies how mixed the target classes are within a dataset. The entropy of a dataset is calculated as:

`Entropy(S) = -p1.log(p1) - p2.log(p2)`

Where (p1) and (p2) represent the proportions of two different classes within the dataset. **Log is to base 2**

**Information Gain** is a metric used to find the most informative attribute for splitting a dataset in a decision tree. It measures the reduction in entropy (impurity) achieved by partitioning the data based on a particular attribute. The attribute with the highest information gain is chosen as the best split.

For example, consider a dataset of email messages that are classified as either "spam" or "not spam." You want to decide which feature to use for splitting the dataset in a decision tree.

- Initially, the dataset contains an equal proportion of spam and non-spam emails (50% each). The entropy of the dataset is 1.0 (maximum disorder) since it's a 50-50 mix.

- Now, you consider splitting the dataset based on the presence of the word "free" in the subject line. After this split, the "spam" branch has 80% spam emails, and the "not spam" branch has 20% spam emails. The entropy of the "free" branch is lower, indicating less disorder.

- You calculate the information gain as the reduction in entropy: `(Information Gain = Entropy(Original) - Weighted Average Entropy(Children))`. The weighted average entropy of the "free" branch and "not free" branch is lower than the entropy of the original dataset, indicating that the "free" feature provides information gain.

In decision tree construction, attributes with higher information gain are preferred because they lead to more effective splits, resulting in more homogeneous groups of data.

>**Advantages of Decision Trees**
------------------------------

There are several advantages of using decision trees for predictive analysis:

1> Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks.

2> They require relatively less effort for training the algorithm.

3> They can be used to classify non-linearly separable data.

 4> They're very fast and efficient compared to KNN and other classification algorithms.

<h4><font color='green'> Do play this Fun video <b>Friendly Introduction to Decision_(CART)_Trees</b></font></h4>

<a href="https://drive.google.com/open?id=1GQpsUHQgd9Jaw58E33vFtH4d2ACaXzQ1">
  <img src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" alt="Friendly Intro to Decision_(CART)_Trees" width="90" height="55">
</a>

<small>Credits : This video is recorded by Augmented Startups team</small>

# 1. Decision Tree for Classification
---------------------------------------------------------
<b><font color='green'>( We will be using DecisionTreeClassifier from sklearn.tree.</b> It is fast, simple and takes care of all the Math part. We will concentrate only on Coding and solving the Real time problem. )</font><br><br>
<font color='red'>
Here, we will predict whether a <b>bank note is authentic or fake</b> depending upon the four different attributes of the image of the note. The <u>attributes</u> are Variance of wavelet transformed image, curtosis of the image, entropy, and skewness of the image.</font>

**Note :** In the dataset the **class** variable can be **0 or 1**. **0 indicates authentic BankNote and 1 indicates fake BankNote.**

In [None]:
# Steps to upload any dataset into your Colab NB :
# step 1 : First Download the dataset to your local PC.
#          The link for downloading our dataset for practicing is https://drive.google.com/open?id=19YvsKMdlIZ_bxgJOSg4waIkVTyJ-UYx_
# step 2 : Run the below code and select the (above downloaded) dataset.
#from google.colab import files
#file = files.upload()

Saving bill_authentication.csv to bill_authentication.csv


In [None]:
# doing the minimum necessary imports
# more modules would be imported as and when needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# reading data from CSV file.
# reading bank currency note data into pandas dataframe.
bankdata = pd.read_csv("bill_authentication.csv")

# Exploratory Data Analysis
print(bankdata.shape)
print("------------")
#print(bankdata.head(10))
#bankdata.head()

# shuffling the 100% of the data
print(bankdata.sample(random_state=100, frac=1.0).head(10))

## shuffle the original dataframe
# bankdata = bankdata.sample(random_state=100, frac=1)

(1372, 5)
------------
      Variance  Skewness  Curtosis  Entropy  Class
1058  -1.56210   -2.2121   4.25910  0.27972      1
714    2.55590    3.3605   2.03210  0.26809      0
1061  -2.31470    3.6668  -0.69690 -1.24740      1
399    2.96950    5.6222   0.27561 -1.15560      0
382    0.86202    2.6963   4.29080  0.54739      0
376    3.23030    7.8384  -3.53480 -1.21510      0
987   -0.55648    3.2136  -3.30850 -2.79650      1
416    4.34830   11.1079  -4.08570 -4.25390      0
945   -1.76970    3.4329  -1.21440 -2.37890      1
595    3.18360    7.2321  -1.07130 -2.59090      0


In [None]:
bankdata.info()  # this helps in finding any missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Variance  1372 non-null   float64
 1   Skewness  1372 non-null   float64
 2   Curtosis  1372 non-null   float64
 3   Entropy   1372 non-null   float64
 4   Class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


<b> Analysis : </b> Their is no missing data. This data is clean.

In [None]:
# Data Preprocessing
# Data preprocessing involves
# (1) Dividing the data into attributes and labels and
# (2) dividing the data into training and testing sets.

# To divide the data into attributes and labels, do :
X = bankdata.drop('Class', axis=1)
y = bankdata['Class']

# the final preprocessing step is to divide data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)
# default test_size parameter value is 0.25

# Training the Algorithm. Here we would use DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

# make predictions on the test data
y_pred = classifier.predict(X_test)

# Evaluating the Algorithm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

# Remember : for evaluating classification-based ML algo use
# confusion_matrix, classification_report and accuracy_score.

# And for evaluating regression-based ML Algo use Mean Squared Error(MSE)
# or RMSE (Root Mean Squared Error), ...

[[197   1]
 [  3 142]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       198
           1       0.99      0.98      0.99       145

    accuracy                           0.99       343
   macro avg       0.99      0.99      0.99       343
weighted avg       0.99      0.99      0.99       343

0.9883381924198251


<b><font color='green'>Analysis</font></b> : From the confusion matrix, you can see that out of 343 test instances, our algorithm misclassified only 4. This is approx 99% accuracy.

# 2. Decision Tree for Regression
------------------------------------------------------
<b><font color='green'>( We will be using DecisionTreeRegressor from sklearn.tree.</b> It is fast, simple and takes care of all the Math part. We will concentrate only on Coding and solving the Real time problem. )</font><br><br>
<font color='red'>
We will use petrol_consumption.csv dataset and <b>try to predict gas consumptions</b> (in millions of gallons) in 48 US states <u>based upon</u> gas tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with a drivers license. </font>

**Note :** In the dataset **Petrol_Consumption** is the target variable.

In [None]:
# Steps to upload any dataset into your Colab NB :
# step 1 : First Download the dataset to your local PC.
#          The link for downloading our dataset for practicing is https://drive.google.com/open?id=1_YH4VnFwlZBd3MS45TYyU7dzchn-4skk
# step 2 : Run the below code and select the (above downloaded) dataset.
#from google.colab import files
#files.upload()

Saving petrol_consumption.csv to petrol_consumption.csv


{'petrol_consumption.csv': b'Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption\r\n9.00,3571,1976,0.5250,541\r\n9.00,4092,1250,0.5720,524\r\n9.00,3865,1586,0.5800,561\r\n7.50,4870,2351,0.5290,414\r\n8.00,4399,431,0.5440,410\r\n10.00,5342,1333,0.5710,457\r\n8.00,5319,11868,0.4510,344\r\n8.00,5126,2138,0.5530,467\r\n8.00,4447,8577,0.5290,464\r\n7.00,4512,8507,0.5520,498\r\n8.00,4391,5939,0.5300,580\r\n7.50,5126,14186,0.5250,471\r\n7.00,4817,6930,0.5740,525\r\n7.00,4207,6580,0.5450,508\r\n7.00,4332,8159,0.6080,566\r\n7.00,4318,10340,0.5860,635\r\n7.00,4206,8508,0.5720,603\r\n7.00,3718,4725,0.5400,714\r\n7.00,4716,5915,0.7240,865\r\n8.50,4341,6010,0.6770,640\r\n7.00,4593,7834,0.6630,649\r\n8.00,4983,602,0.6020,540\r\n9.00,4897,2449,0.5110,464\r\n9.00,4258,4686,0.5170,547\r\n8.50,4574,2619,0.5510,460\r\n9.00,3721,4746,0.5440,566\r\n8.00,3448,5399,0.5480,577\r\n7.50,3846,9061,0.5790,631\r\n8.00,4188,5975,0.5630,574\r\n9.00,3601,4650,0.4930,534\r\n7.00,36

In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Importing the Dataset
dataset = pd.read_csv('petrol_consumption.csv')

dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [None]:
# To see statistical details of the dataset, execute the following command:

#dataset.describe()
dataset['Petrol_Consumption'].mean()*0.1  # avg of the target var.

57.67708333333334

In [None]:
# Preparing the Data
# divide the data into attributes and labels
X = dataset.drop('Petrol_Consumption', axis=1)
y = dataset['Petrol_Consumption']

# dividing data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=10)

# Training and Making Predictions
# Note : we will using DecisionTreeRegressor class, not DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

# To make predictions on the test set,
y_pred = regressor.predict(X_test)

# Now let's compare some of our predicted values with the actual values
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df

Unnamed: 0,Actual,Predicted
35,644,566.0
23,547,534.0
42,632,610.0
40,587,540.0
45,510,457.0
20,649,540.0
3,414,471.0
30,571,554.0
7,467,460.0
6,344,464.0


**Note** :

that in your case the records compared may be different, depending upon the training and testing split. Since the train_test_split method randomly splits the data we likely won't have the same training and test sets. For train_test_split with random_state=0 , you would get the same results.


In [None]:
# Evaluating the Algorithm
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 52.3
Mean Squared Error: 4162.3
Root Mean Squared Error: 64.51588951568444


The root mean squared error for our algorithm is 64.51, which is more than *10 percent of the mean* of all the values in the '**Petrol_Consumption**' column ( i.e **57.6** ). This means that our algorithm did not do a fine prediction job. Allthough getting a value **less than 10%** <u>would have been better</u>.

Their could many reasons for a Regression Algo to not perform that well, some reasons are :

1. Decision Trees can easily overfit. This can be negated by validation methods and pruning. <font color='red'>Due to Overfitting the performance metric on Test_data does not perform that well.</font> More about Overfitting and how to handle it in **Machine Learning - Intermidate** course.

2. The data sample is too less for training the Model. So instead of testing the model on just 20% of data and judging it , we can do a better job by applying the **CROSS VALIDATING** our Model. In **CROSS VALIDATING** we use the 100% of the data for training as well for testing. **CROSS VALIDATION** is covered in **Machine Learning - Intermidate** course.



In [None]:
## do cv = 5


<hr />

**IMPORTANT REFERENCE**

If one wants to know **HOW Decision Trees are used for REGRESSION ?**

Read this - http://www.saedsayad.com/decision_tree_reg.htm

<hr />

<font color='red'><b><u>Problem Solving (25-30 mins):</u></b> </font>
--
<br>
<b><h4>Can you predict tomorrow's stock price for a stock like <u>HDFC or SBI bank</u> on BSE ?</h4></b>

<font color='green'> <b>Follow the steps </b></font>

**Introduction** <br>
The stock market is a market that enables the seamless exchange of buying and selling of company stocks. Every Stock Exchange has its own Stock Index value. The index is the average value that is calculated by combining several stocks.

This helps in representing the entire stock market and predicting the market’s movement over time. The stock market can have a huge impact on people and the country’s economy as a whole. Therefore, predicting the stock trends in an efficient manner can minimize the risk of loss and maximize profit.


**How does stock market work?** <br>
The concept behind how the stock market works is pretty simple. Operating much like an auction house, the stock market enables buyers and sellers to negotiate prices and make trades.

The stock market works through a network of exchanges — you may have heard of the New York Stock Exchange, Nasdaq or Sensex or the NSE. Companies list shares of their stock on an exchange through a process called an initial public offering or IPO. Investors purchase those shares, which allows the company to raise money to grow its business. Investors can then buy and sell these stocks among themselves, and the exchange tracks the supply and demand of each listed stock.

That supply and demand help determine the price for each security or the levels at which stock market participants — investors and traders — are willing to buy or sell.


**How Share Prices Are Set** <br>
To actually buy shares of a stock on a stock exchange, investors go through brokers — an intermediary trained in the science of stock trading, who can get an investor a stock at a fair price, at a moment’s notice. Investors simply let their broker know what stock they want, how many shares they want, and usually at a general price range. That’s called a “bid” and sets the stage for the execution of a trade. If an investor wants to sell shares of a stock, they tell their broker what stock to sell, how many shares, and at what price level. That process is called an “offer” or “ask price.”

**Predicting**  <br>
How the stock market will perform is one of the **most difficult things to do**. There are so many factors involved in the prediction — physical factors vs. physiological, rational and irrational behavior, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy.
<br>
<font color = 'green'> <br>
We will try predicting the <b>next day's stock price</b> using <b>DECISION TREE REGRESSOR</b> </font>

Understanding the Problem Statement
--

Broadly, stock market analysis is divided into two parts – Fundamental Analysis and Technical Analysis.

**Fundamental Analysis** involves analyzing the company’s future profitability on the basis of its current business environment and financial performance.

**Technical Analysis**, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market.

As you might have guessed, our focus will be on the technical analysis part. We’ll be using a dataset from **Quandl** (you can find historical data for various stocks here) and for this particular project, I have used the data for ‘HDFC bank Ltd- BSE’.

`data source` : https://www.quandl.com/ or  https://data.nasdaq.com

<u>Important</u> : <br>
<b>1. Please create <u>Student</u> Account on Quandl.com.</b> <br>
<b>2. Search</b> <font color='blue'>HDFC-Bank-Ltd, BSE</font> data and <b>download the csv file. You would get latest data upto yesterday.</b>
<br><br>
<small>Our SuvenML team is not readily giving you the dataset, as we have done in previous NB's / case-studies.</small>


In [None]:
# doing minimum necessary imports

import pandas as pd                            # for loading and analysing data
import matplotlib.pyplot as plt                # for data visualization
from sklearn.tree import DecisionTreeRegressor # Our Decision Tree classifier

In [None]:
!pip install Quandl

Collecting Quandl
  Downloading Quandl-3.7.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting inflection>=0.3.1 (from Quandl)
  Downloading inflection-0.5.1-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading Quandl-3.7.0-py2.py3-none-any.whl (26 kB)
Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Installing collected packages: inflection, Quandl
Successfully installed Quandl-3.7.0 inflection-0.5.1


In [None]:
# ### import Quandl into your program as
import quandl  # Stock market API for fetching Data

# you can then fetch the stock data directly into your code as :
quandl.ApiConfig.api_key = 'cKoVfJoyzLxzqzsgb1Uz'   ## enter your key
stock_data = quandl.get("BSE/HDFC-BANK-LTD", start_date="2023-01-01", end_date="2023-10-01")
#stock_data = quandl.get('BSE/BOM500112', start_date='2023-02-21', end_date='2024-09-11') ## BOM500112-state-bank-of-india-eod-prices
# choose upto yesterday's date
print(stock_data.head())   ### Let's see the data
print("----------------------")
print(stock_data.shape)   ### Let's see the data
print("----------------------")
print(stock_data.tail())   ### Let's see the data

QuandlError: (Status 410) Something went wrong. Please try again. If you continue to have problems, please contact us at connect@quandl.com.

In [None]:
# How many samples do we have ?
stock_data.shape

(131, 12)

In [None]:
# checking whether any column or feature has missing values
stock_data.isnull().sum()

Unnamed: 0,0
Open,0
High,0
Low,0
Close,0
WAP,0
No. of Shares,0
No. of Trades,0
Total Turnover,0
Deliverable Quantity,0
% Deli. Qty to Traded Qty,0


In [None]:
## working with shift()  --> https://www.geeksforgeeks.org/python-pandas-dataframe-shift/
print(stock_data[['Open', 'Close']].tail(4))
shifted_Open_Close = stock_data.loc[:,['Open', 'Close']].shift(-1)
print(shifted_Open_Close.tail(4))
print("-------------------------------")

shifted_Stock_Data = stock_data.copy()

print("-------------------------------")
shifted_Stock_Data['Open'] = shifted_Open_Close['Open']
print(shifted_Stock_Data[['Open', 'Close']].tail(4))

              Open   Close
Date                      
2023-08-29  573.95  575.20
2023-08-30  575.30  567.65
2023-08-31  568.65  561.30
2023-09-01  563.00  569.70
              Open   Close
Date                      
2023-08-29  575.30  567.65
2023-08-30  568.65  561.30
2023-08-31  563.00  569.70
2023-09-01     NaN     NaN
-------------------------------
-------------------------------
              Open   Close
Date                      
2023-08-29  575.30  575.20
2023-08-30  568.65  567.65
2023-08-31  563.00  561.30
2023-09-01     NaN  569.70


In [None]:
## would be used later for testing purpose
## last date value
print(shifted_Stock_Data.values[-1 : ])

[[           nan 5.71350000e+02 5.62000000e+02 5.69700000e+02
  5.68120000e+02 1.71022000e+06 3.17770000e+04 9.71610451e+08
  9.62880000e+05 5.63000000e+01 9.35000000e+00 6.70000000e+00]]


In [None]:
# How many samples do we have ?
#stock_data.shape
shifted_Stock_Data.shape

(131, 12)

In [None]:
shifted_Stock_Data.dropna(inplace=True)
shifted_Stock_Data.isnull().sum()

Unnamed: 0,0
Open,0
High,0
Low,0
Close,0
WAP,0
No. of Shares,0
No. of Trades,0
Total Turnover,0
Deliverable Quantity,0
% Deli. Qty to Traded Qty,0


In [None]:
# How many samples do we have ?
#stock_data.shape
shifted_Stock_Data.shape

(130, 12)

Now, the most important and a simple thing :

> Decide and divide data into Dependent and Independent variables

> Using `Date` column may not be useful in predicting **Opening Price**, for that we would have to look at **Time Series Forecasting** approach. In a simple Regression Approach using date to recommend opening price may not be a good idea.

> <font color='green'>Now we have to predict <b>open price</b> so this column is our <u>dependent variable</u> because open price depends on <b>High,Low,Close,.....,Turnover.</b>

In [None]:
# Let's select our features

X = shifted_Stock_Data.drop(['Open'] , axis=1)
y = shifted_Stock_Data.loc[ : ,'Open']

In [None]:
X.head(2) # head() shows the earliest 2 records

Unnamed: 0_level_0,High,Low,Close,WAP,No. of Shares,No. of Trades,Total Turnover,Deliverable Quantity,% Deli. Qty to Traded Qty,Spread H-L,Spread C-O
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-02-21,529.0,522.5,523.25,525.67,317073.0,8357.0,166675913.0,85589.0,26.99,6.5,-3.95
2023-02-22,521.75,512.5,516.35,516.88,355801.0,10340.0,183906944.0,130667.0,36.72,9.25,-4.85


In [None]:
y.head(2) # latest 2 stock prices

Unnamed: 0_level_0,Open
Date,Unnamed: 1_level_1
2023-02-21,521.2
2023-02-22,518.65


In [None]:
y.tail(2) # latest 2 stock prices

Unnamed: 0_level_0,Open
Date,Unnamed: 1_level_1
2023-08-30,568.65
2023-08-31,563.0


In [None]:
# split the entire data into Training and Test
# keep 80% for training and 20% for Testing
# so the test_size = 0.2
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X, y,
                                                 test_size = 0.2,
                                                 random_state = 0)

In [None]:
# Let's fit our DecisionTree Model over the training data.
regressor = DecisionTreeRegressor() # making the object of DecisionTreeRegressor
regressor.fit(x_train,y_train)

In [None]:
# Lets check the performance of the model.
# Hope it is within Acceptable limits ????
# I mean I hope we get the RMSE within 10% of the mean(Close_Price)

# Get the predictions on the test set
y_pred = regressor.predict(x_test)

# Evaluating the Algorithm
from sklearn import metrics
import numpy as np
print('Root Mean Squared Error:',
      np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Root Mean Squared Error: 4.843274480380777


In [None]:
# Now , whats the mean of 'Open_Price' ??
# please note we will find the mean of the Open_Price of the dataset.
# Spliting data into Training and Testing is only for the purpose evaluating the Model
print(stock_data['Open'].mean())
print(0.1 * stock_data['Open'].mean())

566.3251908396948
56.63251908396948


<font color='green'> <b>Analysis : </b> Now the 10% of 566 is ~ <b>56.6</b>. Our RMSE was ~<b>4.84</b>. <br>I am very happy that its within 10% range. That means our <u>model</u> is doing a good job.

<font color='red'> Now lets predict the <u><b>OPEN</b></u> price for tmr. </font> <br><u>Remember</u> My data is upto Yesterday. So your <b>Model</b> can predict future prices i.e <b>Tmr's OPEN_Price</b>.

In [None]:
# Trying to predict Tommorow's rate, i.e 2nd Sept 2023 in my case.

## testing for shifted data
import numpy as np

test_data = [[ 5.71350000e+02, 5.62000000e+02, 5.69700000e+02,
  5.68120000e+02, 1.71022000e+06, 3.17770000e+04, 9.71610451e+08,
  9.62880000e+05, 5.63000000e+01, 9.35000000e+00, 6.70000000e+00 ]]

regressor.predict(test_data)



array([568.8])

<font color='green'><b>Observation</b> : So our Model says that <u>HDFC Bank Ltd(BSE)</u> would Close *today* at ___ on ___. </font>

