# **Project Name**    -    *YES Bank : Stock Closing Price*



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 - Faizaan Bhati**

# **Project Summary -**

YES Bank has been one of the most talked-about stocks in the Indian financial sector, especially following the liquidity crisis and the subsequent intervention by the RBI and State Bank of India in 2020. This project aims to conduct a comprehensive Exploratory Data Analysis (EDA) and develop a Regression model to predict the stock's closing price.

The dataset contains monthly stock prices of YES Bank, including the opening price, the highest price of the month, the lowest price, and the closing price. The core of this project involves cleaning the data, handling date-time conversions, and identifying patterns—specifically looking at how the stock plummeted after 2018. We will employ various visualization techniques (Univariate, Bivariate, and Multivariate) to understand the distribution of prices and the correlation between different price points. By the end of this study, we aim to build a predictive model that can assist investors and stakeholders in understanding price movements based on historical trends.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective is to predict the monthly closing price of YES Bank stock. Since stock prices are highly volatile and influenced by market sentiment and financial health, the challenge lies in creating a model that captures the trend accurately despite the significant crash observed in the stock’s history.

**Key Questions to Answer:**

  1. How has the stock price trended over the years?

  2. Is there a strong correlation between the Open, High, and Low prices with the Closing price?

  3. Can we build a model that remains robust even after the 2018-2020 price drop?

#### **Define Your Business Objective?**

The primary business objective is to provide actionable insights into stock price behavior to help investors make informed decisions. By predicting the closing price, the bank and potential investors can better manage risk and identify potential buy/sell signals based on historical volatility.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Scaling and Modeling
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# read data
df = pd.read_csv('/content/drive/MyDrive/data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(df.shape[1], "Columns")
print(df.shape[0], "rows")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Total Duplicate: ", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno

msno.matrix(df)
plt.show

### What did you know about your dataset?

The dataset is relatively small but high in impact. It consists of monthly stock price data. Preliminary observation shows a Date column that needs to be converted from a string format (e.g., 'Jul-05') to a datetime object for proper time-series analysis. There are no categorical variables other than the date; all other features are continuous numerical values representing currency.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(list(df.columns), "\n")

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Date**	(Object/String)	The month and year of the stock record (e.g., "Jul-05"). This needs to be converted to a datetime format for analysis.

**Open**	(Float/Numeric)	The price at which the YES Bank stock started trading at the beginning of that specific month.

**High**	(Float/Numeric)	The maximum price reached by the stock during that month.

**Low**	(Float/Numeric)	The minimum price the stock touched during that month.

**Close**	(Float/Numeric)	The final price at which the stock settled at the end of the month. This is your Target Variable for regression.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df.columns

In [None]:
# 1. Converting Date column to datetime format
# The format '%b-%y' parses 'Jul-05' where %b is month abbreviation and %y is 2-digit year
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

# 2. Sort the data by Date
df.sort_values(by='Date', inplace=True)

# 3. Check for any missing values created during conversion
print(f"Missing values after wrangling:\n{df.isnull().sum()}")

# 4. Create additional features: Month and Year for easier grouping
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# 5. Check the first few rows to verify the changes
df.head()

### What all manipulations have you done and insights you found?

#**Manipulations Performed:**

**Date Transformation:** I converted the Date column from a string (Object) to a proper datetime64 object. This allows us to perform time-series operations, sort chronologically, and extract specific periods.

**Chronological Sorting:** Stock data is often provided in reverse or random order. I sorted the dataset by Date to ensure that our line charts and rolling averages reflect the true progression of time.

**Feature Extraction:** I extracted Year and Month as separate columns. This enables Univariate Analysis (e.g., "Average closing price per year") which is part of the UBM Rule required for your project.

**Integrity Check:** I verified that no null values were introduced during the type conversion.

#**Insights Found:**

**Data Density:** After sorting, it becomes clear that we have a monthly frequency of data points.

**Time Span:** The dataset covers the journey of YES Bank from its early growth stages (starting around 2005) through its peak and the subsequent liquidity crisis.

**Feature Consistency:** All price features (Open, High, Low, Close) are numerical and on a similar scale, meaning they are ready for correlation analysis and scaling before feeding them into a Regression model.Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df['Close'], kde=True, color='blue')
plt.title('Distribution of Closing Price')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.axvline(df['Close'].mean(), color='red', linestyle='--', label=f"Mean: {df['Close'].mean():.2f}")
plt.axvline(df['Close'].median(), color='green', linestyle='-', label=f"Median: {df['Close'].median():.2f}")
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Histogram with a Kernel Density Estimate (KDE) to visualize the distribution of our target variable (Close). This helps in identifying the spread of the data, its skewness, and the presence of outliers.

##### 2. What is/are the insight(s) found from the chart?

The distribution is positively skewed (right-skewed). The mean is significantly higher than the median, indicating that for most months, the stock price stayed low, while a few high-value months (the "glory days" of YES Bank) pulled the average upward.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that the data is skewed suggests that we should apply a Log Transformation before modeling to normalize the distribution. This leads to better regression performance and more accurate price predictions for the client.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8, 5))
sns.boxplot(x=df['Close'], color='cyan')
plt.title('Box Plot of Closing Price')
plt.xlabel('Closing Price')
plt.grid(linestyle='--', alpha=0.5)
plt.show()

##### 1. Why did you pick the specific chart?

The Box Plot is the standard tool for identifying outliers and understanding the quartiles of the dataset.
*italicised text*

##### 2. What is/are the insight(s) found from the chart?

The chart shows several data points beyond the upper whisker. These are the high-trading prices from the 2017-2018 period. In a standard dataset, these might be treated as errors, but here they represent a historical reality of the stock's peak.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying these "outliers" helps us realize that a simple linear model might struggle. We need to decide whether to treat these as extreme values or use a model (like Random Forest or XGBoost) that is robust to outliers to ensure the client gets a realistic prediction.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', y='Close', data=df, marker='o', color='red')
plt.title('Average Closing Price Year-on-Year')
plt.grid(True, alpha=0.3)
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A Line Plot is essential for Bivariate analysis involving time. It allows us to see the relationship between Year (Time) and Close (Price).

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The stock price saw a massive bull run until 2018, followed by a catastrophic collapse. This visually confirms the impact of the 2018 liquidity crisis on the bank's valuation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** This insight is vital for the business objective. It shows that the stock's behavior changed fundamentally after 2018. We can suggest to the client that separate models or "regime-change" detection might be necessary for accurate future forecasting.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df['Open'], color='purple', bins=30)
plt.title('Distribution of Opening Price')
plt.xlabel('Opening Price')
plt.ylabel('Density')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** I used a distribution plot (Histogram + KDE) to see how the opening prices are spread. It helps determine if the "starting" price of the month follows a similar pattern to our target variable (Close).

##### 2. What is/are the insight(s) found from the chart?

**Answer:** Similar to the closing price, the opening price is heavily right-skewed. Most of the data points are clustered between 0 and 100, while the higher prices are infrequent, representing the peak period of the bank.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Answer:** Yes. It confirms that the bias in the data is consistent across all price features. This suggests that the relationship between Open and Close is likely linear, which simplifies our model selection.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
plt.scatter(df['Low'], df['High'], alpha=0.5, color='orange')
plt.plot([df['Low'].min(), df['Low'].max()], [df['Low'].min(), df['Low'].max()], 'k--', lw=2)
plt.title('Low Price vs High Price')
plt.xlabel('Monthly Low')
plt.ylabel('Monthly High')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A scatter plot is the best way to visualize the relationship between two continuous numerical variables. I added a 45-degree reference line to see how much "room" there is between the monthly low and high.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** There is an extremely strong linear correlation between the monthly Low and High. The points stay very close to the diagonal line, meaning that in any given month, the price range is somewhat predictable relative to its baseline.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** Yes. It indicates **Multicollinearity**. Including all these features (Open, High, Low) in a regression model might lead to overfitting because they provide almost identical information. The business can save computational resources by performing feature selection.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 5))
sns.boxplot(x='Month', y='Close', hue='Month', data=df, palette='Set3', legend=False)
plt.title('Stock Closing Price Distribution by Month')
plt.xlabel('Month (1-12)')
plt.ylabel('Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** I chose a grouped Box Plot to check for seasonality. By looking at the distribution of the closing price for each month (Jan–Dec) across all years, we can see if certain months historically perform better.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The median price across all months remains relatively stable, but the variance (size of the boxes) is quite high. There is no clear "January Effect" or significant monthly seasonal pattern that dictates the stock price independently of the year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** Yes. This tells the business that time-based features like "Month" may not be strong predictors compared to price-based features. This helps focus the model-building efforts on price trends rather than calendar cycles.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.regplot(x='Open', y='Close', data=df, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Relationship between Opening and Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A regplot (Regression Plot) combines a scatter plot with a best-fit line. This helps us see how well the Opening price can predict the Closing price.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The relationship is almost perfectly linear. As the opening price increases, the closing price follows suit with very little deviation (low variance around the regression line).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** This is a very positive insight. It suggests that a simple Linear Regression model will likely have high accuracy ($R^2$) for this dataset.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(np.log10(df['Close']), color='green', alpha=0.2)
plt.title('Distribution of Log-Transformed Closing Price')
plt.xlabel('Log(Closing Price)')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** Since the original data was skewed (Chart 1), I applied a log transformation. This chart checks if the transformation successfully normalized the data.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The distribution now looks much more like a Normal Distribution (bell curve).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** Massively. Linear regression assumes that residuals are normally distributed. By using the log-transformed price for training, the resulting model will be much more stable and reliable for the client's investment forecasts.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
df['Price_Range'] = df['High'] - df['Low']

plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', y='Price_Range', data=df, color='orange')
plt.title('Monthly Price Volatility (High - Low) Over Years')
plt.ylabel('Price Spread')
plt.grid(True, alpha=0.3)
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** I created a new feature called Price_Range to measure volatility. This line chart tracks how the gap between the monthly high and low has evolved, which is a key indicator of market uncertainty.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** Volatility was extremely low during the early years but spiked massively between 2018 and 2020. This coincides with the period of financial instability for YES Bank, where the stock price was fluctuating wildly within single months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** Yes. It warns the business that during crisis periods, the "spread" increases, making point-predictions (like a single Closing Price) less reliable. We might suggest providing a "prediction interval" rather than just a single number to the client.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(14, 7))
plt.plot(df['Date'], df['High'], label='High', alpha=0.7)
plt.plot(df['Date'], df['Low'], label='Low', alpha=0.7)
plt.plot(df['Date'], df['Close'], label='Close', color='black', linewidth=2)
plt.title('Comparison of High, Low, and Closing Prices')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** This is a multivariate time-series plot. It allows us to see how the three main price metrics move together over the entire history of the dataset.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** All three metrics move in almost perfect synchronization. The "Low" line acts as a floor and the "High" line as a ceiling, with the "Close" price typically nestled between them. There are no significant periods where one metric decoupled from the others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** It confirms the extreme multicollinearity between features. For the business, this means we can potentially use just one of these features (like Open) to predict Close without losing much information, simplifying the model architecture.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
df['Rolling_Mean'] = df['Close'].rolling(window=12).mean()

plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], label='Actual Close', alpha=0.3)
plt.plot(df['Date'], df['Rolling_Mean'], label='12-Month Moving Average', color='red')
plt.title('Closing Price with 12-Month Moving Average')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** I chose a 12-month moving average to smooth out short-term fluctuations and highlight the long-term trend. This is a classic tool in financial EDA.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The moving average provides a much clearer picture of the "death cross" period where the long-term trend turned sharply negative. It filters out the "noise" of monthly volatility to show the underlying value erosion.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** Yes. Moving averages can be used as engineered features in our regression model to provide the model with "context" about the previous year's performance, leading to more "Production Grade" predictions.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
df['Price_Change_Pct'] = df['Close'].pct_change() * 100

plt.figure(figsize=(10, 6))
sns.histplot(df['Price_Change_Pct'].dropna(), kde=True, color='teal')
plt.title('Distribution of Monthly Percentage Price Change')
plt.xlabel('Percentage Change (%)')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** Looking at absolute prices can be misleading due to the large scale difference (from 400 to 10). Percentage change normalizes this and shows the true "growth" or "decay" rate.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** Most monthly changes are centered around 0%, but there are significant "fat tails"—meaning there are several months with extreme positive or negative growth (±20% or more), which is characteristic of a volatile stock.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** It highlights the "Risk" factor. A business objective focused on stability would find this stock unsuitable, as the high frequency of extreme monthly changes makes it a speculative asset rather than a stable investment.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
df.groupby('Year')['Close'].median().plot(kind='bar', color='skyblue')
plt.title('Median Closing Price per Year')
plt.ylabel('Median Close Price')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A bar chart of medians is better than means for this dataset because of the skewness we identified in Chart 1. It provides a more accurate "typical" price for each year.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The "Golden Era" of YES Bank is clearly visible between 2016 and 2018. The sudden drop in 2019 and 2020 is staggering, with the median price falling to levels not seen since the bank's inception.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** This simplifies the narrative for stakeholders. It visually proves that the bank's valuation has undergone a total reset, and any business strategy must be based on this "new normal" rather than hoping for a return to 2017 levels.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 visualization code
plt.figure(figsize=(10, 8))
sns.heatmap(df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Price Features')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A heatmap is the definitive way to visualize the correlation matrix. It allows us to see at a glance how strongly each independent variable relates to the target variable (Close).

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The correlation between all price variables is almost 1.00 (Perfect Correlation). This confirms that Open, High, and Low are nearly perfect predictors of Close.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 visualization code
sns.pairplot(df[['Open', 'High', 'Low', 'Close']], diag_kind='kde')
plt.suptitle('Pair Plot of Stock Price Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** The Pair Plot allows us to see both the distribution of each variable and the scatter plots between all combinations in one view. It is the ultimate "Multivariate" summary.

##### 2. What is/are the insight(s) found from the chart?

**Answer:** All scatter plots show a strictly linear relationship. There are no non-linear curves or complex clusters. This reinforces the decision to use Linear Regression as a primary model.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Answer:**

Based on the extensive Exploratory Data Analysis of the YES Bank stock, I suggest the following strategy for the client:

**Address Multicollinearity:** The features Open, High, and Low are nearly 100% correlated with each other. For a "Production Grade" model, the client should not use all of them simultaneously as it leads to overfitting. Instead, use a single price feature or perform Principal Component Analysis (PCA) to reduce dimensionality.

**Log-Transformation is Key:** Due to the massive price drop post-2018, the raw data is heavily skewed. To make accurate predictions, the client must apply a Log Transformation to the target variable (Close). This normalizes the variance and allows a Linear Regression model to capture the trend more effectively.

**Incorporate Time-Lagged Features:** Stock prices are not just dependent on the day's open, but also on historical trends. I suggest the client incorporate Moving Averages (12-month) and Lagged Features (previous month's close) into the model. This provides the "context" of the stock's decline which a simple snapshot cannot provide.

**Risk Mitigation:** The volatility analysis (Chart 9) shows that the stock's spread has widened significantly since 2018. The business objective should shift from "High Growth" to "Risk Management." I recommend using a Random Forest or XGBoost regressor alongside Linear Regression, as these ensemble methods handle outliers and sudden "regime changes" (like the 2018 crash) much better.

# **Conclusion**

The pair plot confirms that Open, High, Low, and Close prices are highly correlated and move together consistently, which is expected in stock market data. This suggests that these variables carry similar information and may lead to multicollinearity in predictive modeling.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***