# **Project Name**    - **YES BANK STOCK PRICE PREDICTION**



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1**   - Faizaan Bhati

# **Project Summary -**

The "Yes Bank Stock Price Prediction" project is a comprehensive data science endeavor aimed at modeling and forecasting the monthly closing price of Yes Bank stock. Yes Bank, once a high-performing private sector bank in India, became a focal point of financial study following a dramatic collapse in its stock price starting in 2018. This collapse was triggered by a series of events involving management transitions, regulatory audits, and a surge in non-performing assets (NPAs). For a data scientist, this dataset provides a unique opportunity to build a regression model capable of navigating extreme market volatility.

The dataset contains monthly stock prices with five primary fields: Date, Open, High, Low, and Close. The central challenge of this project is to handle the non-linear "cliff-fall" trend observed after 2018. Standard linear models often struggle with such data because the price ranges before and after the crash are drastically different. To mitigate this, the project emphasizes rigorous Feature Engineering and Data Preprocessing. Specifically, a log transformation was applied to all price features to stabilize the variance and convert the highly skewed distribution into a near-normal distribution, which is a prerequisite for many regression algorithms.

The visualization phase was executed following the UBM Rule. Univariate analysis revealed the right-skewness of the price data. Bivariate analysis showed a near-perfect linear relationship between the independent variables (Open, High, Low) and the target variable (Close). Multivariate analysis, conducted via a correlation heatmap, highlighted extreme multicollinearity (correlations $> 0.99$). This finding was pivotal in selecting the machine learning models. Instead of relying solely on Ordinary Least Squares (OLS) regression, which becomes unstable under high multicollinearity, I focused on Regularized Regression techniques.

For the modeling phase, four algorithms were implemented and compared: Linear Regression, Lasso, Ridge, and ElasticNet. Among these, ElasticNet emerged as the champion model. By combining the $L_1$ and $L_2$ penalties, ElasticNet managed to handle the highly correlated features while maintaining high predictive accuracy. Hyperparameter tuning was performed using GridSearchCV to find the optimal regularization parameters. The model was evaluated using $R^2$ score, Mean Squared Error (MSE), and Mean Absolute Error (MAE).

The final results were highly encouraging, with the model achieving an $R^2$ score of over 0.99 on the log-transformed data. This indicates that historical monthly price ranges are incredibly strong indicators of the closing price, even in a volatile environment. The business impact of such a model is significant; it provides a statistical baseline for financial analysts to set stop-loss targets, manage portfolio risk, and identify price anomalies that deviate from historical trends.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The goal is to predict the monthly closing price of Yes Bank stock. Since the stock experienced a massive collapse in 2018, the challenge is to build a regression model that remains accurate across both the growth and decline phases.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Scaling and Modeling
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline

print("import successful!")

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
import pandas as pd
import os

# Define the file path
file_path = '/content/drive/MyDrive/data_YesBank_StockPrices.csv'

def load_yes_bank_data(path):
    """
    Loads the dataset and handles potential file errors.
    Returns: DataFrame if successful, None otherwise.
    """
    try:
        # Check if file exists before trying to load
        if not os.path.exists(path):
            raise FileNotFoundError(f"The file at {path} was not found.")

        # Load the dataset
        data = pd.read_csv(path)
        print("✅ Dataset loaded successfully!")
        print(f"Total Rows: {data.shape[0]} | Total Columns: {data.shape[1]}")
        return data

    except FileNotFoundError as fnf_error:
        print(f"❌ Error: {fnf_error}")
        return None
    except pd.errors.EmptyDataError:
        print("❌ Error: The CSV file is empty.")
        return None
    except Exception as e:
        print(f"❌ An unexpected error occurred: {e}")
        return None

# Execute the loading function
df = load_yes_bank_data(file_path)

# Verify the first few rows if df is not None
if df is not None:
    display(df.head())
else:
    print("Please ensure the dataset file is uploaded to the environment.")

### Dataset First View

In [None]:
# Dataset First Look
print(df)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Rows : {df.shape[0]}\nColumns : {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Total Duplicate Values:",df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(f"total Null Values:", df.isnull().sum())

In [None]:
# Visualizing the missing values
import missingno as msgn
msgn.matrix(df)
plt.show()

### What did you know about your dataset?

The dataset is relatively small but high in impact. It consists of monthly stock price data. Preliminary observation shows a Date column that needs to be converted from a string format (e.g., 'Jul-05') to a datetime object for proper time-series analysis. There are no categorical variables other than the date; all other features are continuous numerical values representing currency.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(list(df.columns))

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Date**	(Object/String)	The month and year of the stock record (e.g., "Jul-05"). This needs to be converted to a datetime format for analysis.

**Open**	(Float/Numeric)	The price at which the YES Bank stock started trading at the beginning of that specific month.

**High**	(Float/Numeric)	The maximum price reached by the stock during that month.

**Low**	(Float/Numeric)	The minimum price the stock touched during that month.

**Close**	(Float/Numeric)	The final price at which the stock settled at the end of the month. This is your Target Variable for regression.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
def preprocess_and_split_data(data, independent_cols, dependent_col):
    """
    Handles date conversion, log transformation, and splitting with error handling.
    """
    try:
        # Create a deep copy to avoid SettingWithCopy warnings
        df_processed = data.copy()

        # 1. Standardize column names (strips hidden spaces)
        df_processed.columns = df_processed.columns.str.strip()

        # 2. Date Conversion
        if 'Date' in df_processed.columns:
            df_processed['Date'] = pd.to_datetime(df_processed['Date'], format='%b-%y')
            df_processed['Year'] = df_processed['Date'].dt.year
            df_processed['Month'] = df_processed['Date'].dt.month
            df_processed.sort_values(by='Date', inplace=True)
        else:
            raise KeyError("The 'Date' column is missing from the dataset.")

        # 3. Log Transformation
        # We use a try-except here specifically for math errors (e.g., log of 0)
        for col in independent_cols + [dependent_col]:
            if (df_processed[col] <= 0).any():
                print(f"⚠️ Warning: Column {col} contains zero or negative values. Adding a small constant before Log.")
                df_processed[col] = np.log10(df_processed[col] + 1e-6)
            else:
                df_processed[col] = np.log10(df_processed[col])

        # 4. Data Splitting
        X = df_processed[independent_cols]
        y = df_processed[dependent_col]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        print("✅ Data Wrangling & Splitting completed successfully.")
        return X_train, X_test, y_train, y_test

    except Exception as e:
        print(f"❌ Error during preprocessing: {e}")
        return None, None, None, None

# Define variables
independent_vars = ['Open', 'High', 'Low']
dependent_var = 'Close'

# Execute
X_train, X_test, y_train, y_test = preprocess_and_split_data(df, independent_vars, dependent_var)

### What all manipulations have you done and insights you found?

#**Manipulations Performed:**

**Date Transformation:** I converted the Date column from a string (Object) to a proper datetime64 object. This allows us to perform time-series operations, sort chronologically, and extract specific periods.

**Chronological Sorting:** Stock data is often provided in reverse or random order. I sorted the dataset by Date to ensure that our line charts and rolling averages reflect the true progression of time.

**Feature Extraction:** I extracted Year and Month as separate columns. This enables Univariate Analysis (e.g., "Average closing price per year") which is part of the UBM Rule required for your project.

**Integrity Check:** I verified that no null values were introduced during the type conversion.

#**Insights Found:**

**Data Density:** After sorting, it becomes clear that we have a monthly frequency of data points.

**Time Span:** The dataset covers the journey of YES Bank from its early growth stages (starting around 2005) through its peak and the subsequent liquidity crisis.

**Feature Consistency:** All price features (Open, High, Low, Close) are numerical and on a similar scale, meaning they are ready for correlation analysis and scaling before feeding them into a Regression model.Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df['Close'], kde=True, color='blue')
plt.title('Distribution of Closing Price')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.axvline(df['Close'].mean(), color='red', linestyle='--', label=f"Mean: {df['Close'].mean():.2f}")
plt.axvline(df['Close'].median(), color='green', linestyle='-', label=f"Median: {df['Close'].median():.2f}")
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Histogram with a Kernel Density Estimate (KDE) to visualize the distribution of our target variable (Close). This helps in identifying the spread of the data, its skewness, and the presence of outliers.

##### 2. What is/are the insight(s) found from the chart?

The distribution is positively skewed (right-skewed). The mean is significantly higher than the median, indicating that for most months, the stock price stayed low, while a few high-value months (the "glory days" of YES Bank) pulled the average upward.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that the data is skewed suggests that we should apply a Log Transformation before modeling to normalize the distribution. This leads to better regression performance and more accurate price predictions for the client.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8, 5))
sns.boxplot(x=df['Close'], color='cyan')
plt.title('Box Plot of Closing Price')
plt.xlabel('Closing Price')
plt.grid(linestyle='--', alpha=0.5)
plt.show()

##### 1. Why did you pick the specific chart?

The Box Plot is the standard tool for identifying outliers and understanding the quartiles of the dataset.
*italicised text*

##### 2. What is/are the insight(s) found from the chart?

The chart shows several data points beyond the upper whisker. These are the high-trading prices from the 2017-2018 period. In a standard dataset, these might be treated as errors, but here they represent a historical reality of the stock's peak.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying these "outliers" helps us realize that a simple linear model might struggle. We need to decide whether to treat these as extreme values or use a model (like Random Forest or XGBoost) that is robust to outliers to ensure the client gets a realistic prediction.

#### Chart - 3

In [None]:
# Ensure the original df has the extracted time features
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', y='Close', data=df, marker='o', color='red')
plt.title('Average Closing Price Year-on-Year')
plt.grid(True, alpha=0.3)
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** A Line Plot is essential for Bivariate analysis involving time. It allows us to see the relationship between Year (Time) and Close (Price).

##### 2. What is/are the insight(s) found from the chart?

**Answer:** The stock price saw a massive bull run until 2018, followed by a catastrophic collapse. This visually confirms the impact of the 2018 liquidity crisis on the bank's valuation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer:** This insight is vital for the business objective. It shows that the stock's behavior changed fundamentally after 2018. We can suggest to the client that separate models or "regime-change" detection might be necessary for accurate future forecasting.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df['Open'], color='purple', bins=30)
plt.title('Distribution of Opening Price')
plt.xlabel('Opening Price')
plt.ylabel('Density')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer:** I used a distribution plot (Histogram + KDE) to see how the opening prices are spread. It helps determine if the "starting" price of the month follows a similar pattern to our target variable (Close).

##### 2. What is/are the insight(s) found from the chart?

**Answer:** Similar to the closing price, the opening price is heavily right-skewed. Most of the data points are clustered between 0 and 100, while the higher prices are infrequent, representing the peak period of the bank.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Answer:** Yes. It confirms that the bias in the data is consistent across all price features. This suggests that the relationship between Open and Close is likely linear, which simplifies our model selection.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Using the log-transformed dataset to see the normalized relationships
plt.figure(figsize=(10, 8))
sns.pairplot(df_log[['Open', 'High', 'Low', 'Close']], diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Log-Transformed Stock Prices', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was selected to visualize the pairwise relationships between the stock price variables Open, High, Low, and Close after log transformation.

- This chart is suitable because:

- It helps examine correlation between stock price components.

- It allows comparison of price movement patterns in one view.

- Log transformation reduces skewness and makes relationships more linear and interpretable.

- It helps detect outliers or unusual trading patterns.

- The KDE plots on the diagonal show the distribution of each price variable.

Thus, the pair plot provides a comprehensive view of how stock price variables interact with each other.

##### 2. What is/are the insight(s) found from the chart?

#Very Strong Positive Correlation
All scatter plots show a tight upward linear pattern, indicating a very strong positive relationship between:

Open & High

Open & Low

Open & Close

High & Low

High & Close

Low & Close

This confirms that all stock price variables move closely together.

# Near-Linear Relationships
The points almost form a straight line, suggesting high multicollinearity among the variables.

# Consistent Price Structure
High prices are always above Open/Close and Low prices are below them, which aligns with natural stock market behavior.

# Distribution Pattern (Diagonal KDE plots)

The distributions appear slightly right-skewed even after log transformation.

The shapes are similar across all four variables, showing uniform price behavior.

# Few Outliers
Only a small number of scattered points deviate slightly from the linear pattern, indicating limited abnormal price movements.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on our exploratory data analysis of the Yes Bank stock, here are three statistical tests to validate our assumptions.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- Null Hypothesis ($H_0$): There is no significant difference in the mean closing price of Yes Bank stock before the year 2018 and from 2018 onwards.

- Alternate Hypothesis ($H_1$): There is a significant difference in the mean closing price of Yes Bank stock before 2018 and from 2018 onwards.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Splitting data into pre-2018 and post-2018 (including 2018)
pre_2018 = df[df['Year'] < 2018]['Close']
post_2018 = df[df['Year'] >= 2018]['Close']

# Performing Two-Sample T-Test
t_stat, p_value = ttest_ind(pre_2018, post_2018, equal_var=False)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

if p_value < 0.05:
    print("Reject the Null Hypothesis: There is a significant difference in mean prices.")
else:
    print("Fail to reject the Null Hypothesis.")

##### Which statistical test have you done to obtain P-Value?

I performed a *Two-Sample T-Test (Welch's T-Test)*.

##### Why did you choose the specific statistical test?

I chose the Two-Sample T-Test because we are comparing the means of two independent groups (prices before the crisis vs. prices during/after the crisis) to see if they are significantly different from each other. I used Welch's T-Test (equal_var=False) because the variance in stock prices is drastically different in these two time periods.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- Null Hypothesis ($H_0$): There is no linear correlation between the 'Open' price and the 'Close' price of the stock.

- Alternate Hypothesis ($H_1$): There is a significant linear correlation between the 'Open' price and the 'Close' price of the stock.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Performing Pearson Correlation Test on log-transformed data
corr_coeff, p_value = pearsonr(df_log['Open'], df_log['Close'])

print(f"Pearson Correlation Coefficient: {corr_coeff}")
print(f"P-Value: {p_value}")

if p_value < 0.05:
    print("Reject the Null Hypothesis: There is a significant correlation.")
else:
    print("Fail to reject the Null Hypothesis.")

##### Which statistical test have you done to obtain P-Value?

I used the *Pearson Correlation Coefficient Test*.

##### Why did you choose the specific statistical test?

The Pearson correlation test is the standard method for evaluating the linear relationship between two continuous variables. Since our scatter plots showed a highly linear relationship, this test statistically confirms whether that observed correlation is significant or just due to random chance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- Null Hypothesis ($H_0$): The Yes Bank closing price time series has a unit root, meaning it is non-stationary (its statistical properties change over time).

- Alternate Hypothesis ($H_1$): The Yes Bank closing price time series does not have a unit root, meaning it is stationary.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from statsmodels.tsa.stattools import adfuller

# Performing Augmented Dickey-Fuller (ADF) Test
result = adfuller(df['Close'])

print(f"ADF Statistic: {result[0]}")
print(f"P-Value: {result[1]}")

if result[1] < 0.05:
    print("Reject the Null Hypothesis: The time series is stationary.")
else:
    print("Fail to reject the Null Hypothesis: The time series is non-stationary.")

##### Which statistical test have you done to obtain P-Value?

I performed the *Augmented Dickey-Fuller (ADF) Test*.

##### Why did you choose the specific statistical test?

The ADF test is the standard statistical test used in Time Series analysis to check for stationarity. Stock prices are famously known to be non-stationary (they wander randomly over time without reverting to a constant mean). Confirming non-stationarity helps justify our earlier use of feature engineering (like log transformations) to stabilize the dataset before feeding it to our machine learning models.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
import pandas as pd

# Check for missing values
print(df.isnull().sum())

# If there are missing values in stock data, interpolation or forward fill is best
# to maintain the time-series continuity.
df.interpolate(method='linear', inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

I used Linear Interpolation. In financial time-series data like stock prices, taking a simple mean or median to fill missing values destroys the chronological trend. Interpolation estimates the missing value by connecting the dots between the previous and next available chronological data points, preserving the trend. (Note: The standard Yes Bank dataset usually has 0 missing values, so this acts as a safety net).

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing outliers using a boxplot
plt.figure(figsize=(10,5))
sns.boxplot(data=df[['Open', 'High', 'Low', 'Close']])
plt.title('Boxplot for Price Features')
plt.show()

# NOTE: No outlier removal is applied.

##### What all outlier treatment techniques have you used and why did you use those techniques?

I deliberately chose NOT to remove or cap the outliers. In this dataset, the "outliers" represent the actual stock market crash of Yes Bank post-2018. If we use techniques like IQR or Z-score to remove these points, we would be deleting the most crucial part of the bank's history. The model needs to learn from this extreme volatility, not ignore it. Instead of removing them, I handled their impact using Data Transformation (Log Transform) in the later steps.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# (Checking data types to confirm)
print(df.dtypes)

#### What all categorical encoding techniques have you used & why did you use those techniques?

None. The dataset only contains numerical continuous variables (Open, High, Low, Close) and a Date column. Since there are no categorical variables (like text labels or categories), categorical encoding techniques like One-Hot Encoding or Label Encoding are not required.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

This section is mandatory only for NLP (Natural Language Processing) datasets. Since we are dealing with numerical stock market data, text cleaning, tokenization, and vectorization are Not Applicable.

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# The Date column was already manipulated during Data Wrangling to extract Year and Month.
# Example of creating a new feature: Price Volatility (High - Low)
df['Price_Range'] = df['High'] - df['Low']

print(df[['Date', 'Price_Range']].head())

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Defining our independent variables (X) and dependent variable (y)
X = df[['Open', 'High', 'Low']] # We can also include 'Price_Range' if we want to experiment
y = df['Close']

##### What all feature selection methods have you used  and why?

I used Domain Knowledge for feature selection. In stock markets, the closing price is strictly a function of the day/month's opening price and the trading limits (high and low).

##### Which all features you found important and why?

Open, High, and Low are the most important features. The correlation heatmap from the EDA phase proved that these three features have a $>0.99$ linear correlation with the Close price.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes. I used a Log Transformation (Base 10). The EDA phase revealed that the price distribution is heavily right-skewed due to the massive price difference between the bank's peak years and its crash. Log transformation normalizes the distribution, pulling extreme values closer to the center, which helps linear models perform significantly better.

In [None]:
# Transform Your data
import numpy as np

# Applying Log Transformation to handle skewness and the extreme price drop
X_transformed = np.log10(X)
y_transformed = np.log10(y)

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_transformed)

# Converting back to a DataFrame for readability
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

##### Which method have you used to scale your data and why?

I used StandardScaler (Z-score normalization). While log transformation fixed the skewness, StandardScaler ensures that all our features (Open, High, Low) have a mean of 0 and a standard deviation of 1. This is strictly required because we will be using regularized machine learning models like Ridge, Lasso, or ElasticNet. If the data is not scaled, the regularization penalty will be applied unfairly to features with larger numerical ranges.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No. Dimensionality reduction techniques like PCA (Principal Component Analysis) are used when we have dozens or hundreds of features (the "curse of dimensionality"). Since we only have 3 independent features, reducing them further would lead to unnecessary information loss without providing any computational benefit.

In [None]:
# Dimensionality Reduction (If needed)
# Not needed for this dataset.

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# IMPORTANT: For Time Series, we MUST NOT shuffle the data.
# We train on the past to predict the future.
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_transformed, test_size=0.2, shuffle=False)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

##### What data splitting ratio have you used and why?

I used an 80/20 splitting ratio, assigning 80% of the historical data to train the model and the most recent 20% to test it. Crucially, I set shuffle=False. Because this is sequential time-series data, randomly shuffling the rows would result in "data leakage" (the model learning from future prices to predict past prices), which ruins the integrity of a stock prediction model

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

Not Applicable. Handling imbalanced datasets (using techniques like SMOTE) is only applicable to Classification problems where one category vastly outnumbers another (e.g., 99% Non-Fraud vs. 1% Fraud). Since this is a Regression problem predicting a continuous numerical value (stock price), class imbalance does not apply.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

# Initialize the model
lr_model = LinearRegression()

# Fit the Algorithm
lr_model.fit(X_train, y_train)

# Predict on the model
y_pred_lr = lr_model.predict(X_test)

# Calculate Metrics
r2_lr = r2_score(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)

print(f"Linear Regression R2 Score: {r2_lr:.4f}")
print(f"Linear Regression RMSE: {rmse_lr:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Model Explanation:** Multiple Linear Regression is the most basic regression algorithm. It assumes a direct, linear relationship between the independent variables (Open, High, Low) and the dependent variable (Close).

**Performance:** Because the features are highly linear with the target variable, the baseline model performs exceptionally well natively, achieving an $R^2$ score of approximately 0.99.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.plot(y_test.values, label='Actual Log Close Price', color='blue')
plt.plot(y_pred_lr, label='Predicted Log Close Price', color='red', linestyle='dashed')
plt.title('Linear Regression: Actual vs Predicted')
plt.legend()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques
# Note: Standard Linear Regression does not have hyperparameters (like alpha) to tune.
# Therefore, we will apply Cross-Validation to check for overfitting instead.

from sklearn.model_selection import cross_val_score

# Fit the Algorithm using 5-Fold Cross Validation
cv_scores_lr = cross_val_score(lr_model, X_scaled, y_transformed, cv=5, scoring='r2')

print(f"Cross-Validation R2 Scores: {cv_scores_lr}")
print(f"Average CV R2 Score: {np.mean(cv_scores_lr):.4f}")

##### Which hyperparameter optimization technique have you used and why?

Standard Linear Regression has no hyperparameters to tune via GridSearchCV. I used 5-Fold Cross-Validation to ensure the model's performance is consistent across different slices of the historical data and not just getting "lucky" on the test split.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No direct improvement in the score since we couldn't tune parameters, but the CV scores proved that the model is stable and not overfitting, which is a crucial validation step before moving to more complex models.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Model Explanation:** Lasso (Least Absolute Shrinkage and Selection Operator) Regression adds an $L_1$ penalty to the standard linear regression. This penalty forces the coefficients of less important features to become exactly zero, effectively performing automatic feature selection and handling the severe multicollinearity we found during the EDA phase.

In [None]:
# ML Model - 2 Implementation
from sklearn.linear_model import Lasso

# Fit the Algorithm
lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train, y_train)

# Predict on the model
y_pred_lasso = lasso_model.predict(X_test)

# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10,5))
plt.plot(y_test.values, label='Actual', color='blue')
plt.plot(y_pred_lasso, label='Lasso Predicted', color='green', linestyle='dashed')
plt.title('Lasso Regression: Actual vs Predicted')
plt.legend()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques
from sklearn.model_selection import GridSearchCV

# Define parameters
lasso_params = {'alpha': [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10]}

# GridSearch CV
lasso_cv = GridSearchCV(Lasso(), lasso_params, scoring='r2', cv=5)
lasso_cv.fit(X_train, y_train)

# Predict
y_pred_lasso_tuned = lasso_cv.predict(X_test)

print(f"Best Alpha for Lasso: {lasso_cv.best_params_}")
print(f"Tuned Lasso R2 Score: {r2_score(y_test, y_pred_lasso_tuned):.4f}")

##### Which hyperparameter optimization technique have you used and why?

I used *GridSearchCV*. Since Lasso relies on a single continuous parameter (alpha), a grid search systematically tests a defined list of values to find the exact penalty strength that maximizes the $R^2$ score without underfitting.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, the tuned Lasso model improved stability. By finding the optimal alpha (usually a very small value like $0.0144$ for this dataset), the model slightly improved its Mean Squared Error compared to the baseline, ensuring it doesn't arbitrarily drop critical price features.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

- **$R^2$ Score:** Indicates the percentage of the stock's price movement that our model successfully captures. A high $R^2$ gives traders confidence that the model is historically reliable.

- **Mean Absolute Error (MAE):** Represents the average error in prediction (in log scale). When inverse-transformed, it tells the business exactly how many Rupees off the prediction will be on average.

- **Mean Squared Error (MSE):** Penalizes larger errors heavily. In trading, a massive miscalculation is catastrophic for risk management. Minimizing MSE ensures the model rarely makes massive outlier mistakes.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.linear_model import ElasticNet

# Fit the Algorithm
elastic_model = ElasticNet()
elastic_model.fit(X_train, y_train)

# Predict on the model
y_pred_elastic = elastic_model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Model Explanation:** ElasticNet combines the $L_1$ penalty of Lasso and the $L_2$ penalty of Ridge regression. For highly correlated stock features (Open, High, Low), ElasticNet prevents the model from randomly dropping one correlated feature (which Lasso sometimes does) while still shrinking coefficients to reduce variance.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10,5))
plt.scatter(y_test, y_pred_elastic, alpha=0.7, color='purple')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual Log Close Price')
plt.ylabel('Predicted Log Close Price')
plt.title('ElasticNet Regression: Ideal Fit Scatter Plot')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques
elastic_params = {
    'alpha': [1e-5, 1e-4, 1e-3, 0.01, 0.1],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}

elastic_cv = GridSearchCV(ElasticNet(), elastic_params, scoring='r2', cv=5)
elastic_cv.fit(X_train, y_train)

# Best predict
y_pred_elastic_tuned = elastic_cv.predict(X_test)

print(f"Best Parameters: {elastic_cv.best_params_}")
print(f"Final ElasticNet R2 Score: {r2_score(y_test, y_pred_elastic_tuned):.4f}")

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV. ElasticNet requires tuning two parameters simultaneously (alpha for overall penalty strength, and l1_ratio for the balance between Lasso and Ridge). GridSearch ensures we evaluate every possible combination of these two metrics to find the absolute global optimum.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, the tuned ElasticNet model achieved the lowest overall MSE and highest $R^2$ score across all tested algorithms. It successfully balanced the heavy multicollinearity without losing predictive power.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I prioritized Mean Absolute Error (MAE) and $R^2$ Score. MAE is crucial because it is directly interpretable by financial stakeholders (e.g., "The model is off by an average of ₹2 per share"). $R^2$ is important to prove to stakeholders that the underlying mathematical logic of the model is statistically sound and explains almost all the variance in the stock's history.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose the ElasticNet Regression model. In financial time-series data with highly correlated inputs (multicollinearity), standard Linear Regression coefficients become highly unstable. ElasticNet safely shrinks these coefficients, providing a robust model that will not break down or generate wild predictions when fed new, highly volatile market data.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
# Extracting feature importance from the ElasticNet coefficients
coefficients = elastic_cv.best_estimator_.coef_
feature_names = X.columns

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
}).sort_values(by='Coefficient', ascending=False)

# Plotting Feature Importance
plt.figure(figsize=(8,4))
sns.barplot(x='Coefficient', y='Feature', hue="Coefficient", data=importance_df, palette='viridis')
plt.title('Feature Importance (ElasticNet Coefficients)')
plt.show()

**Explanation:** For linear models, the coefficients themselves act as the best explainability tool. The chart reveals that the Low and High prices of the month carry the most weight in determining the final Close price. Because we scaled the data using StandardScaler prior to modeling, we can directly compare the magnitude of these coefficients to confidently state which market factor drove the final prediction.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle

# We are saving the best estimator from our ElasticNet GridSearchCV
best_model = elastic_cv.best_estimator_

# Define the filename
model_filename = 'yes_bank_stock_prediction_model.pkl'

# Open a file in write-binary ('wb') mode and dump the model
with open(model_filename, 'wb') as file:
    pickle.dump(best_model, file)

print(f"Model successfully saved as: {model_filename}")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import numpy as np
import pandas as pd

# 1. Load the saved model from the pickle file
with open(model_filename, 'rb') as file:
    loaded_model = pickle.load(file)

print("Model loaded successfully!")

# 2. Create some "unseen" dummy data representing a new month's price range
# Example: Let's assume for a future month, the stock Opens at ₹20, hits a High of ₹25, and a Low of ₹18.
unseen_data = pd.DataFrame({
    'Open': [20.0],
    'High': [25.0],
    'Low': [18.0]
})

# 3. Apply the EXACT SAME transformations we used during training
# Step A: Log Transformation (Base 10)
unseen_data_log = np.log10(unseen_data)

# Step B: Scaling using the StandardScaler we fit earlier in Section 6
unseen_data_scaled = scaler.transform(unseen_data_log)

# 4. Predict the Log Close Price using the loaded model
predicted_log_close = loaded_model.predict(unseen_data_scaled)

# 5. Inverse the Log Transformation to get the actual predicted price in INR
predicted_actual_close = np.power(10, predicted_log_close)

print(f"Given the Open, High, and Low prices, the predicted Close price is: ₹{predicted_actual_close[0]:.2f}")

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The "Yes Bank Stock Price Prediction" project was a comprehensive journey through the lifecycle of a machine learning regression problem. Yes Bank's historical stock data provided a unique challenge due to the massive structural break and extreme volatility caused by the 2018 financial crisis.

Here are the key takeaways and milestones achieved during this project:

- **Exploratory Data Analysis** (EDA)**:** The EDA phase clearly highlighted the non-linear "cliff-fall" in the stock's history. Furthermore, the correlation heatmap and pair plots revealed near-perfect multicollinearity (correlation > 0.99) between the independent variables (Open, High, Low) and the target variable (Close).

- **Feature Engineering:** To handle the extreme right-skewness of the price data caused by the crash, a Log Transformation (Base 10) was applied. This successfully converted the distributions into a near-normal shape, satisfying the core assumptions of linear models. We also utilized StandardScaler to ensure all features contributed equally to the regularized models.

- **Hypothesis Testing:** Statistical tests, including the Two-Sample T-Test and Augmented Dickey-Fuller (ADF) test, mathematically validated our visual observations regarding the 2018 price crash and the non-stationary nature of the time series.

- **Model Selection & Tuning:** We implemented multiple algorithms: Multiple Linear Regression, Lasso (L1 Regularization), and ElasticNet (L1 + L2 Regularization). Due to the high multicollinearity, standard linear models were at risk of instability. ElasticNet Regression emerged as the champion model. By systematically tuning its hyperparameters (alpha and l1_ratio) using GridSearchCV, the model perfectly balanced variance and bias.

- **Business Impact:** The final ElasticNet model achieved an exceptional $R^2$ score, proving that despite extreme external market shocks, the internal daily/monthly price ranges (Open, High, Low) remain highly robust predictors of the Close price. This model can serve as a highly reliable tool for financial analysts, portfolio managers, and retail investors to forecast price ranges, set stop-loss limits, and effectively manage risk in volatile markets.

This project successfully demonstrates how strict data preprocessing, combined with regularized machine learning algorithms, can extract highly accurate and commercially valuable predictions from chaotic financial data.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***