<a href="https://colab.research.google.com/github/dhiruvivek/Yes-Bank-ML/blob/main/Model_Train_Test_for_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Yes Bank Stock Closing Price Prediction**

##### **Project Type**    - Regression
##### **Contribution**    - Individual
**Name** - Vivek Tripathi


# **Project Summary -**

﻿
##Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. 
##This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock's closing price of the month.

# **GitHub Link -**

https://github.com/dhiruvivek/Yes-Bank-ML

# **Problem Statement**


Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations.




---


This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock's closing price of the month.



---
We have to determine the independant variable and dependant variable(Closing Price). And to find the impact or influence of Independent variable on dependant variable.



---

We are going to use many Machine Learning Model and will find out which model best fit our Data.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
from numpy import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
url='/content/drive/MyDrive/Yes Bank Stock Closing Price Prediction ML/data_YesBank_StockPrices.csv'
df= pd.read_csv(url)

### Dataset First View

In [None]:
# Dataset First Look
#top 5 rows
df.head()


**Date** -This column is the date of the stock level.

**Open** - This column shows the value of opening price of stock at that particular day.

**High** - This Column shows the Day high of that particular stock.

**Low** - This Column shows the Day Low of that particular stock.

**Close** - This column shows the closing price/last price at that particulat day.



In [None]:
#bottom 5 rows
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

Dataset is too small as there are only Total number of 185 rows and 5 columns.

### Dataset Information

In [None]:
# Dataset Info
df.info()

As we can see from above infomation there is no null value in above dataset.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

There are no duplicate values in dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

**Data set dont have any null value or duplicate value.**


In [None]:
# Visualizing the missing values
import missingno as msno #to visualize the mising values

In [None]:
msno.matrix(df)

As we can see from above chart all the bars are at same level. There is no missing value.

### What did you know about your dataset?

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include="all")

In [None]:
#changing the date column to date time object
from datetime import datetime
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: datetime.strptime(x, '%b-%y')))

### Variables Description 

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
plt.figure(figsize=(7,7))
sns.distplot(df['Close'],color="y")

In [None]:
# Write your code to make your dataset analysis ready.
numeric_features = df.describe().columns
numeric_features

In [None]:
# plot a bar plot for each column except date

for col in numeric_features[1:]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

In [None]:
# Segregating the dataset into dependent & independent variable.
X = df.drop(['Close','Date'],axis=1)         # Independent variables.
Y = df['Close']                              # Dependent variable.

# **With above graphs We can see that High & Low is positively corellated with close.  And even OPEN is negatively corellated with Close.**

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Visualisation of closing price with respect to dates.

plt.figure(figsize=(18,7))
plt.plot(df['Date'], df['Close'],linewidth=5,color='green')
plt.xlabel('Year', fontsize=18)
plt.ylabel('Close Prices', fontsize=18)

plt.title('Closing Prices along different time period', fontsize=15)
plt.grid()
plt.show()

# 2018 onwards the closing stock prices have witnessed a downfall and the reason can be the fraud case.


##### 1. Why did you pick the specific chart?

# This is a line chart. I picked this chart in order to see the graph of price in ralted with the year. Closing price in different time period.

## The peak value of closing price was in year 2017-2019

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Visualisation of skewness of the independent variable dataset
for labels in X:
    plt.figure(figsize=(30,5))
    plt.subplot(1,2,2)
    fig = sns.distplot(df[labels],color='rebeccapurple')
    fig.set_ylabel('Density',fontsize=15)
    fig.set_xlabel(labels,fontsize=15)
    plt.grid()
    plt.show()



##### 1. Why did you pick the specific chart?

I selected this chart to find the skewdness of the Variables.

##### 2. What is/are the insight(s) found from the chart?

## **the independent variable dataset looks positively skewed.**

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Applying log tranformation on independent variable dataset.
for labels in X:
    plt.figure(figsize=(30,5))
    plt.subplot(1,2,2)
    vis = sns.distplot(np.log10(df[labels]),color='rebeccapurple')
    vis.set_ylabel('Density',fontsize=15)
    vis.set_xlabel(labels,fontsize=15)
    plt.grid()
    plt.show()


##### 1. Why did you pick the specific chart?

I choose this chart to find out the effect of LOg transformation 

##### 2. What is/are the insight(s) found from the chart?

After log10 transformation they have become normally distributed.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(14,6))
vis = sns.distplot(Y,color='black')
vis.set_ylabel('Density',fontsize=15)
vis.set_xlabel('Close',fontsize=15)
plt.grid()
plt.show()


     

##### 1. Why did you pick the specific chart?

## Visualisation of skewness of the dependent variable dataset

##### 2. What is/are the insight(s) found from the chart?

dependent variable dataset also looks positively skewed.
     

#### Chart - 5

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(14,6))
vis = sns.distplot(np.log10(Y),color='black')
vis.set_ylabel('Density',fontsize=15)
vis.set_xlabel('Close',fontsize=15)
plt.grid()
plt.show()



##### 1. Why did you pick the specific chart?

## Applying log tranformation on dependent variable dataset.

##### 2. What is/are the insight(s) found from the chart?

## Dependent variable dataset now seems to be normally dirtributed a bit after log transformation.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

for labels in X:
   fig = plt.figure(figsize=(13,6))
   ax = fig.gca()
   feature = df[labels]
   label = df['Close']
   correlation = feature.corr(label)
   plt.scatter(x=feature, y=label,s=45,color='r')
   plt.xlabel(labels,fontsize=17)
   plt.ylabel('Closing Price',fontsize=17)
   ax.set_title('Closing Price - ' + labels + '(' + 'Correlation: ' + str(correlation) + ')',fontsize=24)
   z = np.polyfit(df[labels], df['Close'], 1)
   y_hat = np.poly1d(z)(df[labels])
 
   plt.plot(df[labels], y_hat, "r--", lw=3,color = 'black')
   plt.grid()


plt.show()


##### 1. Why did you pick the specific chart?

### Bivariate analysis.

##### 2. What is/are the insight(s) found from the chart?

# Each of the independent variable is highly correlated to the dependent variable.


#### Chart - 7

In [None]:
# Chart - 7 visualization code


plt.figure(figsize=(15,7))
sns.heatmap(X.corr(),  annot=True, cmap="rocket_r")
plt.show()


     

##### 1. Why did you pick the specific chart?

## Correlation among independent variables.

##### 2. What is/are the insight(s) found from the chart?

## All the variables are highly correlated to each other.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):
 
   # Calculating VIF
   vif = pd.DataFrame()
   vif["Variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
 
   return(vif)

calc_vif(X)



##### 1. Why did you pick the specific chart?

## Multicollinearity detection.

##### 2. What is/are the insight(s) found from the chart?

##VIF scores are high so it implies that associated independent variables are highly collinear to each other in the dataset.
## As all the variables are equally important for closing stock price prediction, so I will not be performing any kind of feature engineering here

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Dataframe to store metrics.
i = 0
eval_metric = pd.DataFrame()
     

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:

# Splitting the dataset.
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.20,random_state=1)

# Training data is 80% of total dataset.
# Test data is 20% of total dataset.
     

In [None]:

# Scaling the data.
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:

# Shape of the training dataset.
x_train.shape

# Rows = 148 & Columns = 3.


In [None]:
# Shape of the test dataset.
x_test.shape

# Rows = 37 & Columns = 3.