<a href="https://colab.research.google.com/github/abuazlan19121/Regression---Yes-Bank-Stock-Closing-Price-Prediction/blob/main/Regression_Yes_Bank_Stock_Closing_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regression - Yes Bank Stock Closing Price Prediction**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news due to a fraud case involving Rana Kapoor. This event has significantly impacted the stock prices of the company. The aim is to investigate this impact and determine if time series models or other predictive models can accurately forecast stock prices in such situations.

This dataset contains monthly stock prices of Yes Bank since its inception, including the opening, closing, highest, and lowest stock prices for each month. The main objective is to predict the stock’s closing price of the month.

Understanding Stock
A stock or share (also known as a company’s 'equity') is a financial instrument representing ownership in a company. Units of stock are called "shares." Stocks are predominantly bought and sold on stock exchanges, though private sales can occur as well. They form the foundation of many individual investors' portfolios.

**Dataset Features**

Open: The price at which a security first trades upon the opening of an exchange on a trading day. This may not be the same as the previous day's closing price.

High: The highest price at which a stock traded during a period.

Low: The lowest price at which a stock traded during a period.

Close: The trading price of a stock at the end of a trading day. It is the most recent price until the next trading session. The closing price is calculated as the weighted average price of the last 30 minutes of trading (from 3:00 PM to 3:30 PM in the case of equity).

# **GitHub Link -**

https://github.com/abuazlan19121/Regression---Yes-Bank-Stock-Closing-Price-Prediction

# **Problem Statement**


**Objective:**
The primary goal of this project is to develop a robust and accurate predictive model to forecast the closing prices of Yes Bank's stock. This task requires understanding and capturing the complex dynamics and trends in the stock prices, particularly the historical trend of an increasing price followed by a sudden decline after 2018.

**Key Challenge:**
A significant challenge in this project is addressing the issue of multicollinearity present in the dataset. Multicollinearity arises when there is a high correlation between independent variables, leading to difficulties in model interpretation and potentially affecting prediction accuracy.

**Approach:**
To overcome this challenge, the model will incorporate techniques to handle multicollinearity effectively. This ensures that independent variables are appropriately considered in the prediction process, enhancing the model's interpretability and accuracy.

**Considerations:**
Historical trends and patterns in Yes Bank's stock prices.
Techniques to mitigate multicollinearity in the dataset.
Robust methods for accurate stock price prediction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np  # Used for numerical and complex calculations
import pandas as pd  # Used for data manipulation and analysis

# Visualization libraries
import matplotlib.pyplot as plt  # Used for creating static, animated, and interactive visualizations
import seaborn as sns  # Used for making statistical graphics
import plotly.express as px  # Used for creating interactive visualizations

from numpy import math  # Provides mathematical functions

# Scikit-learn libraries for preprocessing, model selection, and evaluation
from sklearn.preprocessing import MinMaxScaler  # Used for feature scaling
from scipy.stats import zscore  # Used for statistical analysis
from sklearn.model_selection import train_test_split  # Used for splitting data into training and testing sets
from sklearn.linear_model import LinearRegression  # Used for linear regression modeling
from sklearn.metrics import mean_squared_error  # Used for calculating mean squared error
from sklearn.metrics import r2_score  # Used for calculating R^2 score
from sklearn.metrics import mean_absolute_error  # Used for calculating mean absolute error
from sklearn import metrics  # Provides various machine learning evaluation metrics

import warnings  # Used to manage warnings
warnings.filterwarnings('ignore')  # Ignores warning messages

from datetime import datetime  # Used for manipulating dates and times


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

### What did you know about your dataset?

The dataset comprises 185 rows and 5 columns, indicating that it has 185 data points with 5 features each. These data points represent the stock prices of Yes Bank over a period. The features include the date, open price, high price, low price, close price, and volume of the stock.

The Date column, which currently has the datatype 'object,' needs to be converted to DateTime. This change is necessary to perform date-related operations, such as calculating the day of the week, month, or year.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%b-%y"))

In [None]:
df.head()

### What all manipulations have you done and insights you found?

There is no need of data wrangling due to limited columns and we just converted the datatype of Date column to datetime

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import plotly.graph_objects as go  # Import Plotly's graph objects module for creating the candlestick chart

# Create a candlestick chart
fig = go.Figure(go.Candlestick(
    x=df.index,  # Set the x-axis to the Date index
    open=df['Open'],  # Set the open prices
    close=df['Close'],  # Set the close prices
    high=df['High'],  # Set the high prices
    low=df['Low']  # Set the low prices
))

# Update layout of the chart
fig.update_layout(
    title={
        'text': 'Describing The Price Movements',  # Chart title
        'x': 0.5,  # Center the title horizontally
        'y': 0.95,  # Position the title near the top
        'font': {'color': 'white'}  # Set the title font color to white
    },
    xaxis=dict(
        title='Year',  # X-axis title
        title_font={'color': 'white'},  # Set the x-axis title font color to white
        tickfont={'color': 'white'}  # Set the x-axis tick font color to white
    ),
    yaxis=dict(
        title='Price',  # Y-axis title
        title_font={'color': 'white'},  # Set the y-axis title font color to white
        tickfont={'color': 'white'}  # Set the y-axis tick font color to white
    ),
    width=1450,  # Set the width of the chart
    height=1000,  # Set the height of the chart
    plot_bgcolor='rgb(36,40,47)',  # Set the background color of the plot area
    paper_bgcolor='rgb(51,56,66)'  # Set the background color of the entire figure
)

# Display the chart
fig.show()


##### 1. Why did you pick the specific chart?

We chose the candlestick chart for visualizing price movements because it effectively conveys essential information. It provides a clear representation of open, high, low, and close prices, making it an ideal choice for financial analysis, especially for stocks and other assets.

##### 2. What is/are the insight(s) found from the chart?

The analysis of Yes Bank stock prices shows a clear pattern. Before 2018, the stock consistently trended upward, reflecting positive growth and investor optimism. However, after this period, there was a significant decline, mainly due to the fraud case involving Rana Kapoor, the former CEO.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The impact of the Yes Bank fraud case on stock prices is clear in the abrupt trend change. The case led to increased scrutiny and regulatory interventions, causing negative sentiment about the bank's future prospects. As a result, investors reacted by selling their shares, leading to a rapid decline in stock prices.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Create a 2x2 subplot
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot distribution of 'Open' prices
sns.histplot(df['Open'], kde=True, ax=axes[0, 0], color='blue')
axes[0, 0].set_title('Distribution of Open Prices')

# Plot distribution of 'High' prices
sns.histplot(df['High'], kde=True, ax=axes[0, 1], color='green')
axes[0, 1].set_title('Distribution of High Prices')

# Plot distribution of 'Close' prices
sns.histplot(df['Close'], kde=True, ax=axes[1, 0], color='red')
axes[1, 0].set_title('Distribution of Close Prices')

# Plot distribution of 'Low' prices
sns.histplot(df['Low'], kde=True, ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Distribution of Low Prices')

# Adjust layout for better spacing between plots
plt.tight_layout()

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

The chosen chart, which combines histograms and KDE plots, effectively visualizes the distribution of each variable in the dataset. It allows for the examination of central tendency, spread, and the shape of the distributions. The subplots enable easy comparison between variables.

##### 2. What is/are the insight(s) found from the chart?

The distributions of open, high, low, and close prices in the chart are positively skewed. This indicates that most data points are concentrated on the left side of the distributions, with a tail extending towards larger values on the right side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights about the positively skewed distributions of open, high, low, and close prices can inform strategic decision-making and identify potential buying opportunities. However, positive skewness alone does not imply negative growth. Assessing negative growth requires a comprehensive analysis of various factors, including trends, market conditions, and external influences. Concluding specific insights about negative growth based solely on skewness is not justified. Further analysis is needed to evaluate potential negative impacts on business growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
from plotly.subplots import make_subplots

# Create a 1x4 subplot using Plotly
fig = make_subplots(rows=1, cols=4, subplot_titles=('Close Boxplot', 'Open Boxplot', 'High Boxplot', 'Low Boxplot'))

# Add Close Boxplot
fig.add_trace(
    go.Box(y=df['Close'], name='Close', marker_color='blue'),
    row=1, col=1
)

# Add Open Boxplot
fig.add_trace(
    go.Box(y=df['Open'], name='Open', marker_color='green'),
    row=1, col=2
)

# Add High Boxplot
fig.add_trace(
    go.Box(y=df['High'], name='High', marker_color='red'),
    row=1, col=3
)

# Add Low Boxplot
fig.add_trace(
    go.Box(y=df['Low'], name='Low', marker_color='purple'),
    row=1, col=4
)

# Update layout
fig.update_layout(
    title_text='Boxplots of Stock Prices',
    width=1200,
    height=400,
    showlegend=False
)

# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

The specific chart used in the code is a boxplot. It was chosen for its effectiveness in comparing multiple variables, detecting outliers, visualizing distributions, and providing a concise summary of the data. The notch feature adds a confidence interval around the median, enhancing the comparison.

##### 2. What is/are the insight(s) found from the chart?

The presence of outliers in each feature indicates extreme values that deviate significantly from the overall pattern of the data. These outliers can potentially impact the model fitting process and the accuracy of predictions. Therefore, it is crucial to address these outliers before proceeding with model fitting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Handling outliers ensures that the model captures underlying patterns and relationships accurately, leading to more reliable predictions and interpretations. It also improves the model's robustness against extreme observations that may introduce bias or noise. Properly addressing outliers contributes to the overall validity and integrity of the analysis, enhancing the reliability of the model fitting process and subsequent predictions.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create subplots
fig = make_subplots(rows=1, cols=len(df.columns) - 1, subplot_titles=[f'Scatterplot of {col} vs Close' for col in df.columns if col != 'Close'])

# Add scatter plots for each feature against 'Close'
for i, column in enumerate(df.columns):
    if column != 'Close':
        fig.add_trace(
            go.Scatter(
                x=df[column],
                y=df['Close'],
                mode='markers',
                name=column
            ),
            row=1, col=i+1
        )

# Update layout
fig.update_layout(
    title_text='Scatterplots of Features vs Close',
    width=1500,
    height=500,
    showlegend=False
)

# Update axis titles
for i, column in enumerate(df.columns):
    if column != 'Close':
        fig.update_xaxes(title_text=column, row=1, col=i+1)
        fig.update_yaxes(title_text='Close', row=1, col=i+1)

# Show the plot
fig.show()


##### 1. Why did you pick the specific chart?

Using scatter plots with a best fit line allows for visualizing the relationship between numerical features and the 'Close' price. The correlation coefficient quantifies the strength of this relationship, while the best fit line estimates the trend and predictive power. The plot aids interpretation and communication of the relationship to stakeholders, with annotations like the correlation coefficient providing valuable insights. Customization enhances clarity and aesthetics. These plots help identify potential predictors and support analysis and decision-making in stock market analysis.

##### 2. What is/are the insight(s) found from the chart?

Upon analyzing the scatter plots with the best fit line, it is evident that all the independent variables exhibit a linear relationship with the dependent variable, 'Close'. This suggests a consistent and predictable relationship between these variables.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prediction and Forecasting :

With a clear understanding of the linear relationships, regression models can be developed to predict future 'Close' prices based on the values of the independent variables. This assists in forecasting stock performance and informing investment decisions.

Risk Assessment :

Analyzing the strength and direction of the linear relationships helps assess the risk associated with changes in the independent variables. This knowledge is valuable for risk management and portfolio optimization strategies.

Feature Selection :

Identifying the linear relationships enables determination of the most influential independent variables affecting the 'Close' price. This information guides feature selection and variable prioritization in future analyses or model development.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Calculate the correlation matrix of the DataFrame
corr = df.corr()

# Create a heatmap to visualize the correlation matrix
sns.heatmap(
    corr,  # Data to plot
    annot=True,  # Display correlation coefficients on the heatmap
    cmap='coolwarm',  # Color map for heatmap
    fmt='.2f',  # Format for annotation text
    vmin=-1, vmax=1,  # Color scale range
    linewidths=0.5,  # Width of the lines that will divide each cell
    linecolor='black'  # Color of the lines that divide each cell
)

# Add titles and labels to the plot
plt.title('Correlation Heatmap of Features')
plt.show()  # Display the heatmap

##### 1. Why did you pick the specific chart?

We chose it because Heatmap use color intensity to show the magnitude of a phenomenon across two dimensions.

##### 2. What is/are the insight(s) found from the chart?

The presence of high correlations between independent variables in our dataset suggests potential multicollinearity. Multicollinearity can negatively impact model fitting and prediction accuracy, as small changes in one variable can cause unpredictable results in the model.

To assess the extent of multicollinearity, we can calculate the Variation Inflation Factor (VIF). Analyzing VIF values helps determine which variables to retain in the analysis and which may need removal to mitigate multicollinearity. This evaluation is crucial for ensuring the robustness and reliability of our models, supporting accurate predictions and interpretations of variable relationships.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since all correlations are positive, there’s no direct evidence suggesting negative growth.

Pair Plot



In [None]:
# Pair Plot visualization code
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot is a powerful visualization tool used to explore relationships between multiple numerical features in a dataset.

##### 2. What is/are the insight(s) found from the chart?

The variables Open, High, and Low show a strong correlation with the Close variable, indicating a close relationship between the stock's opening, highest, lowest, and closing prices. The Open, High, and Low variables also exhibit a high correlation with each other, suggesting they move in sync and share similar trends. These correlations provide valuable insights for analyzing the Yes Bank stock and can serve as predictors of the closing price.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

There is no missing value in Data

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Create a 2x2 subplot using Plotly
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'Close BoxPlot (Log-transformed)',
    'Open BoxPlot (Log-transformed)',
    'High BoxPlot (Log-transformed)',
    'Low BoxPlot (Log-transformed)'
))

# Add Log-transformed Close Boxplot
fig.add_trace(
    go.Box(
        y=np.log10(df['Close']),
        name='Close',
        marker_color='blue'
    ),
    row=1, col=1
)

# Add Log-transformed Open Boxplot
fig.add_trace(
    go.Box(
        y=np.log10(df['Open']),
        name='Open',
        marker_color='green'
    ),
    row=1, col=2
)

# Add Log-transformed High Boxplot
fig.add_trace(
    go.Box(
        y=np.log10(df['High']),
        name='High',
        marker_color='red'
    ),
    row=2, col=1
)

# Add Log-transformed Low Boxplot
fig.add_trace(
    go.Box(
        y=np.log10(df['Low']),
        name='Low',
        marker_color='purple'
    ),
    row=2, col=2
)

# Update layout
fig.update_layout(
    title_text='Boxplots of Log-transformed Stock Prices',
    width=1200,
    height=800,
    showlegend=False
)

# Update axis titles
fig.update_xaxes(title_text='Log-transformed Values')
fig.update_yaxes(title_text='Log-transformed Values')

# Show the plot
fig.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

The columns exhibited right-skewed distributions, so we applied a log10 transformation to normalize the data and improve visualization. Log10 transformation reduces skewness, compresses the range of values, and diminishes the impact of outliers. Post-transformation, the boxplots reveal no extreme values beyond the whiskers, indicating successful normalization. However, it's important to consider that log transformation may affect interpretability and should be aligned with the analysis goals and data characteristics.

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?

Since our dataset consists exclusively of numerical features, categorical encoding is not required. With no categorical variables present, there is no need to convert them into numerical representations for analysis or modeling.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

#### 2. Feature Selection

##### What all feature selection methods have you used  and why?

Given the small size of the dataset, feature selection may be impractical. Reducing the feature space with a limited number of observations could result in unreliable or biased outcomes. Therefore, it is advisable to retain all available features for analysis and modeling to ensure a comprehensive and accurate representation of the data.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Since the dataset is already small in size, there is no need for dimensionality reduction techniques. With a limited number of observations

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
x=df.drop('Close',axis=1)
y=df['Close']
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=34)

In [None]:
xtrain.shape,xtest.shape,ytrain.shape,ytest.shape

##### What data splitting ratio have you used and why?

To train the model effectively, we are using an 80:20 split ratio, allocating 80% of the data for training and 20% for testing. However, due to the small dataset size, acquiring more data could be beneficial. Increasing the training data size enhances the model's ability to learn and generalize from the data. More data helps improve performance, reduces the risk of overfitting, and provides a more comprehensive view of the underlying patterns and relationships within the dataset.

## ***7. ML Model Implementation***

### ML Model - 1

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***