<a href="https://colab.research.google.com/github/gaurav4601/capstoneproject2/blob/master/Yes_Bank_Price_ML_Project_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img title="Almabetter" alt="Almabetter" src="https://pbs.twimg.com/profile_images/1649033540149866497/tg4B3SVf_400x400.jpg" width=70px>

## Yes Bank Stock Closing Price Predictor
<img title="Yes Bank" alt="Yes Bank Logo" src="https://logos-download.com/wp-content/uploads/2016/06/Yes_Bank_logo.png" width=120px>


#### **About Project** :
>The Indian financial domain has long been aware of Yes Bank, a prominent bank that has been the subject of much discussion since 2018, due to a fraudulent scheme involving Rana Kapoor. This illicit activity raises a pertinent question about how this event has affected the bank's stock prices, and whether reliable predictive models, such as Time series models, can accurately reflect such situations.
Our dataset includes monthly stock prices of Yes Bank since its inception, encompassing essential information regarding the closing, starting, highest, and lowest stock prices of each month. The primary objective of this study is to investigate and forecast the stock's closing price for each period, utilizing a range of analytical methods for the most accurate results.

<br>
<hr>
<br>

#### **Little Bit😶‍🌫️ about Domain** 

>Stocks represent ownership of a publicly-traded company, which individuals and institutions can purchase in the form of shares. When you buy a share of a stock, you become a shareholder in that company and can potentially benefit from its success. The stock market, meanwhile, is where these publicly-traded companies' shares are bought and sold. It is a market where investors can trade stocks and profit from fluctuations in the stock's price.The stock price refers to the current trading price of a particular stock. This price is subject to change due to market demand, trading volume, company financial performance, and other macroeconomic factors, making the stock market a volatile and uncertain environment. The price may reflect the market's perceived value of the company, based on factors such as its financial performance, leadership, and growth potential.Overall, stock prices and shares are essential components of the stock market, which serves as a venue for investors to buy and sell stocks and generate profits. Companies may benefit from the sale of stocks by raising capital to fund expansion and operations or to finance new projects.

<img text='illustration' src='https://img.freepik.com/free-vector/hand-drawn-stock-market-concept-with-analysts_23-2149163670.jpg?w=900&t=st=1683438353~exp=1683438953~hmac=8b0a3826b6bc810e5f5cc124fc72171ef9b0f84ccffa76196b2fd6a0ab0e3a7e' width=350px>

In [56]:
# importing all required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


#additional as required

import plotly.graph_objs as go 
import plotly.io as pio

# set the default template for all Plotly Express charts to 'plotly_white'
pio.templates.default = 'plotly_white'


>**Data Loading from CSV**

In [3]:
# loading csv file directly from the github-raw

data = pd.read_csv('https://raw.githubusercontent.com/gaurav4601/capstoneproject2/master/data_YesBank_StockPrices.csv')

#copy data to df

df = data.copy()

### 🏛️  **First Lookup Over Data**

In [4]:
# head
df.head()


Unnamed: 0,Date,Open,High,Low,Close
0,Jul-05,13.0,14.0,11.25,12.46
1,Aug-05,12.58,14.88,12.55,13.42
2,Sep-05,13.48,14.87,12.27,13.3
3,Oct-05,13.2,14.47,12.4,12.99
4,Nov-05,13.35,13.88,12.88,13.41


In [5]:
#last rows
df.tail()

Unnamed: 0,Date,Open,High,Low,Close
180,Jul-20,25.6,28.3,11.1,11.95
181,Aug-20,12.0,17.16,11.85,14.37
182,Sep-20,14.3,15.34,12.75,13.15
183,Oct-20,13.3,14.01,12.11,12.42
184,Nov-20,12.41,14.9,12.21,14.67


In [6]:
#shape of data
df.shape

(185, 5)

In [9]:
# sample

df.sample(5)

Unnamed: 0,Date,Open,High,Low,Close
142,May-17,326.0,330.3,275.15,286.38
135,Oct-16,253.41,265.5,245.8,253.52
166,May-19,163.3,178.05,133.05,147.95
81,Apr-12,73.62,76.1,69.11,70.07
120,Jul-15,169.0,175.58,156.45,165.74


In [11]:
# duplicate values

df.duplicated().sum()

0

In [12]:
# check for null or missing values
print(df.isnull().sum())
df.isna().sum()

Date     0
Open     0
High     0
Low      0
Close    0
dtype: int64


Date     0
Open     0
High     0
Low      0
Close    0
dtype: int64

In [13]:
# info about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    185 non-null    object 
 1   Open    185 non-null    float64
 2   High    185 non-null    float64
 3   Low     185 non-null    float64
 4   Close   185 non-null    float64
dtypes: float64(4), object(1)
memory usage: 7.4+ KB


In [14]:
# number summary of dataset

df.describe()

Unnamed: 0,Open,High,Low,Close
count,185.0,185.0,185.0,185.0
mean,105.541405,116.104324,94.947838,105.204703
std,98.87985,106.333497,91.219415,98.583153
min,10.0,11.24,5.55,9.98
25%,33.8,36.14,28.51,33.45
50%,62.98,72.55,58.0,62.54
75%,153.0,169.19,138.35,153.3
max,369.95,404.0,345.5,367.9


### **Notes from First Lookup Over Data**

- We have 185 rows and 5 columns
- No null / Duplicate or Missing Values are thier in Dataset
- All Columns are in Proper Data Types....We May Consider Date Column to Convert Into ```datetime``` as per requirement



<br>
<br>

> **Columns Information**

In the stock market, the terms "open", "high", "low", and "close" refer to different prices of a stock or security at various points in time.

> **Open** refers to the price of a security at the beginning of a trading session, typically the first few minutes of trading.

> **High** refers to the highest price that a security reached during a trading session, whether it was in the opening minutes or later in the day.

> **Low** refers to the lowest price that a security reached during a trading session, whether it was in the opening minutes or later in the day.

> **Close** refers to the price of a security at the end of a trading session, typically at the close of the stock market.

These four terms are commonly used to describe the performance of a stock or security over a particular time period, such as a day, week, or month. The difference between the open and close prices is often used to calculate the daily return of a stock or security, while the difference between the high and low prices is used to measure its volatility or range for the day.


## **Data Wrangling**

>**Data wrangling**, also known as ```data munging```, is the process of cleaning, transforming, and mapping raw data into a usable and structured format for further analysis. It involves various techniques such as data cleaning, data transformation, data normalization, data aggregation, and data enrichment.

Here are the top 5 objectives of data wrangling:

- ```Cleaning data``` : Data cleaning is the process of identifying and correcting or removing inaccuracies, inconsistencies, and errors in data. The primary objective of data cleaning is to ensure the accuracy and completeness of the data.

- ```Transforming data``` : Data transformation involves converting the format or structure of data to make it more usable for analysis. The objective of data transformation is to create a more structured dataset that can be easily analyzed and interpreted.

- ```Integrating data``` : Data integration involves combining data from multiple sources to create a more comprehensive dataset. The objective of data integration is to provide a more complete view of the data and enable more accurate analysis.

- ```Normalizing data``` : Data normalization is the process of organizing data in a consistent and standardized format. The objective of data normalization is to reduce redundancy, improve data consistency, and ensure that data is correctly interpreted and analyzed.

- ```Enriching data``` : Data enrichment involves adding additional data or metadata to existing datasets to provide additional insights or context. The objective of data enrichment is to improve the quality and usefulness of data for analysis and decision-making.

> **Note**

> As per First Lookup Stage We observed that no **```Missing / Duplicated / Null Values```** are their in Dataset

  so we can skip ```Data Wrangling``` this step as of now if needed we will look at it



### **Data Types And Fixing**

Handling data types is an important aspect of machine learning projects. Here are three points that highlight its importance:

- Compatibility with ML algorithms: Machine learning algorithms have different requirements for data types. For instance, some algorithms work better with numerical data, while others work better with categorical data. Thus, it is important to handle data types to ensure that the data is compatible with the chosen algorithm.

- Data quality and accuracy: Incorrect data types can lead to incorrect predictions and inaccurate results. For instance, if a categorical variable is treated as a numerical variable, it may lead to wrong conclusions. Handling data types helps to ensure that the data is accurate and of good quality, which in turn improves the accuracy of the machine learning model.

- Efficient data processing: Handling data types can also improve the efficiency of data processing. Different data types have different sizes and processing requirements. For example, numerical data types require less storage space and can be processed faster than string data types. Thus, handling data types can improve the speed and efficiency of data processing, making it easier to work with large datasets.

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    185 non-null    object 
 1   Open    185 non-null    float64
 2   High    185 non-null    float64
 3   Low     185 non-null    float64
 4   Close   185 non-null    float64
dtypes: float64(4), object(1)
memory usage: 7.4+ KB


> All Columns are in Correct Format and Data Type ....We will Cover ```Date``` column from object to DateTime for Exploring / Drilling Down Data 

In [21]:
# Conver Date Column to Date Time dtype

df["Date"] = pd.to_datetime(df["Date"], format='%b-%y')
     

In [23]:
# Confirming Dtype

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    185 non-null    datetime64[ns]
 1   Open    185 non-null    float64       
 2   High    185 non-null    float64       
 3   Low     185 non-null    float64       
 4   Close   185 non-null    float64       
dtypes: datetime64[ns](1), float64(4)
memory usage: 7.4 KB


### **Feature Engineering**
> Feature engineering is the process of selecting, transforming, and creating new variables or features from existing data to improve the performance of machine learning models. It involves the identification of the most important variables and their relationships with the target variable, as well as the creation of new features that can capture complex patterns in the data.

In [31]:
# extract a month and year from the Date Column

df['Month'] = df.Date.dt.month_name()
df['Year'] = df.Date.dt.year

In [32]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Month,Year
0,2005-07-01,13.0,14.0,11.25,12.46,July,2005
1,2005-08-01,12.58,14.88,12.55,13.42,August,2005
2,2005-09-01,13.48,14.87,12.27,13.3,September,2005
3,2005-10-01,13.2,14.47,12.4,12.99,October,2005
4,2005-11-01,13.35,13.88,12.88,13.41,November,2005


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    185 non-null    datetime64[ns]
 1   Open    185 non-null    float64       
 2   High    185 non-null    float64       
 3   Low     185 non-null    float64       
 4   Close   185 non-null    float64       
 5   Month   185 non-null    object        
 6   Year    185 non-null    int64         
dtypes: datetime64[ns](1), float64(4), int64(1), object(1)
memory usage: 10.2+ KB


## **EDA :Exploratory Data Analysis**
> EDA is cruitial part of any Data Science Project . 


> **“If you torture the data long enough, it will confess to anything”** -
>
> -- <cite> Ronald H. Coase, a renowned British Economist</cite>

In [57]:
#line chart for prices to observer trend

fig = px.line(df,x='Date',y='Close')
fig.update_layout(title='Yes Bank Stock Closing Price Trend 2005-2020')


fig.show()

> As we have time series data we can plot line chart to show the trend over a period

> Key Takeway : As per trend we can see that Stock Marginally Rises From 2015 - 2018....And Sharp Fall notice after 2018

> we can further goes down into the Month Trend of 2017 /2018

In [42]:
#looking for trend in 2017 and 2018

_2017 = df.query('Year == 2017')
_2018 = df.query('Year == 2018')

In [58]:
fig = px.line(_2017,x='Month',y='Close')
fig.update_layout(title='Monthly Trend in 2017')
fig.show()

In [59]:
fig = px.line(_2018,x='Month',y='Close')
fig.update_layout(title='Monthly Trend in 2018')
fig.show()

> Looks like stock is being Volatile in Early 2017... Hit High of ~360 Rs
but in whole 2018 sharp downfile noticed and and never goes upwards then forward


> This is time of Explosition of Scam in YesBank by Founder Rana Kapoor

[Recent News Article ]('https://www.ndtv.com/india-news/yes-bank-fraud-rs-5-000-crore-fraud-by-yes-banks-rana-kapoor-wadhawans-probe-agency-2913012')

### Checking for outliers in Dataset

 > An outlier in data is a data point that is significantly different from other data points in the same dataset. Outliers can occur for various reasons, such as measurement errors, data processing errors, or genuine variability in the data. Outliers can have a significant impact on data analysis and modeling, as they can distort statistical measures such as the mean and standard deviation and affect the accuracy and validity of machine learning models. Outliers can also affect the interpretation of data by leading to incorrect conclusions about the relationships between variables. 

In [55]:
# fig = px.box(df[['Open','High','Low','Close']],title='Box Plot For Oulier Detection')
# fig.show()


import plotly.graph_objs as go

# create a list of boxes
boxes = ['Open', 'High', 'Low', 'Close']

# set the colors for each box
colors = ['blue', 'orange', 'green', 'red']

# create a list of traces
traces = []

# loop through each box and create a Box object with a different color
for i in range(len(boxes)):
    trace = go.Box(
        y=df[boxes[i]],
        name=boxes[i],
        marker=dict(color=colors[i])
    )
    traces.append(trace)

# create the layout object and set the title
layout = go.Layout(title='Box Plot For Oulier Detection', paper_bgcolor='rgba(255, 255, 255, 0)',
    plot_bgcolor='rgba(255, 255, 255, 0)')

# create the figure object and plot the traces
fig = go.Figure(data=traces, layout=layout)
fig.show()

