<a href="https://colab.research.google.com/github/Vasu-Rocks/AI-ML-Project/blob/main/YesBank_Labmentix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression/linear Regression/Random Forest
##### **Contribution**    - Individual
##### **Name**            - Vasu Goyal


# **Project Summary -**

This project aims to predict the closing stock prices of Yes Bank, a prominent financial institution in India's banking sector . The analysis spans from the bank's inception until November 2020, encompassing the significant crisis period related to the fraud case involving Rana Kapoor that began in 2018 . Through comprehensive exploratory data analysis, feature engineering, and implementation of regression models, this project provides insights into the factors affecting Yes Bank's stock performance and delivers accurate predictions of closing prices .

The data analysis reveals distinct phases in Yes Bank's market performance: a growth period from 2005 to 2017, followed by extreme volatility and eventual collapse post-2018 . This decline coincided with the emergence of fraud allegations against the bank's founder, highlighting how corporate governance issues can significantly impact stock performance . The project employs multiple machine learning algorithms including linear regression, decision trees, and ensemble methods to capture both linear and non-linear relationships within the data .

The model evaluations focus on metrics such as RMSE, MAE, and R-squared values to identify the most effective prediction approach . Feature importance analysis reveals that historical price indicators and volatility metrics provide the strongest predictive signals for future stock prices . The project also examines seasonal patterns and autocorrelation in the data to enhance prediction accuracy .

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Yes Bank has been a significant player in the Indian financial domain but faced major challenges since 2018 due to the fraud case involving founder Rana Kapoor . This project aims to analyze how these events impacted stock prices and develop a predictive model for closing prices that can effectively handle such extraordinary market circumstances . Specifically, the task is to use historical stock data (opening, highest, lowest, and closing prices) to create a regression model that accurately predicts monthly closing prices, especially during periods of extreme volatility .

The key challenges include managing the high volatility, identifying relevant features beyond price data alone, and building a model robust enough to maintain accuracy during both stable and turbulent market periods . This prediction task has significant business implications for investment strategies, risk management, and understanding how corporate governance issues affect market valuation of financial institutions .

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


### Dataset Loading

In [None]:
# Mount to drive
from google.colab import drive
drive.mount('/content/drive')

# Load the Yes Bank stock price dataset
file_path = '/content/drive/MyDrive/data_YesBank_StockPrices (1).csv'
df = pd.read_csv(file_path)
print("Dataset loaded successfully!")
print("Shape of dataset:",{df.shape})


### Dataset First View

In [None]:
# Display first 5 rows
print(df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])


### Dataset Information

In [None]:
# Dataset information
print("Dataset Info:")
print(df.info())


#### Duplicate Values

In [None]:
# Check for duplicate values
duplicates = df.duplicated().sum()
print("Number of duplicate rows:",duplicates)


#### Missing Values/Null Values

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)


In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset contains historical stock price data for Yes Bank. It includes features such as date, open, high, low, close, and volume. There are no duplicates, but some missing values may be present.

The Yes Bank stock price dataset provides a comprehensive view of the bank's market performance from July 2005 to November 2020, spanning 185 monthly records. The dataset is remarkably clean with no missing values, containing five key variables: Date, Open, High, Low, and Close prices. The data shows extreme price volatility, particularly after 2018 when fraud allegations emerged against the bank's leadership.

Key observations:

Time span: 15+ years of monthly stock data

Price range: From ₹9.98 (minimum close) to ₹367.90 (maximum close)

No missing values or duplicates

High correlation between price variables (>0.97)

Significant structural break around 2018

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Date: The date of the stock price record.

Open: The opening price of the stock.

High: The highest price during the day.

Low: The lowest price during the day.

Close: The closing price of the stock (target variable).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(col, ":", df[col].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

# Create additional features
df['Daily_Range'] = df['High'] - df['Low']
df['Price_Change'] = df['Close'] - df['Open']
df['Volatility'] = df['Daily_Range'] / df['Open'] * 100

# Create period indicator
df['Crisis_Period'] = df['Date'] >= '2018-01-01'



### What all manipulations have you done and insights you found?

Missing values were handled by forward filling. The date column was converted to datetime format for better time series analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Closing Price Over Time

In [None]:
# Plot the closing price trend
plt.figure(figsize=(12,6))
plt.plot(df['Date'], df['Close'], color='blue')
plt.title('Yes Bank Stock Closing Price (2005-2020)')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
# To improve Layout
plt.tight_layout()
# To displays
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen to display the complete 15-year price history, clearly showing long-term trends and major inflection points . This visualization type excels at revealing the temporal evolution of the stock price .

##### 2. What is/are the insight(s) found from the chart?

The chart reveals three distinct phases: steady growth (2005-2014), accelerated growth (2014-2017) reaching peaks over 350, and catastrophic decline (2018-2020) . The collapse coincides precisely with the Rana Kapoor fraud allegations . The stock lost over 90% of its peak value during the decline .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights demonstrate the critical relationship between corporate governance and stock performance, providing valuable lessons for investors and regulators . The dramatic decline highlights that traditional financial metrics alone are insufficient for evaluating bank stocks without considering governance factors . Recovery would require significant governance reforms to rebuild market trust .

#### Chart - 2: Model Comparison

In [None]:
# Features and target
features = ['Open', 'High', 'Low']
target = 'Close'

# Train-test split (last 35 months as test)
train = df.iloc[:-35]
test = df.iloc[-35:]

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]
dates_test = test['Date']

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Plot actual vs predicted for both models
plt.figure(figsize=(12,6))
plt.plot(dates_test, y_test.values, label='Actual', color='blue', linewidth=2)
plt.plot(dates_test, y_pred_lr, label='Linear Regression', color='orange', linestyle='--')
plt.plot(dates_test, y_pred_rf, label='Random Forest', color='green', linestyle='--')
plt.title('Model Comparison: Actual vs Predicted Closing Price')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart comparing actual and predicted values clearly demonstrates how well each model tracks the true stock prices . This visualization allows direct visual assessment of prediction accuracy for both models across the test period .

##### 2. What is/are the insight(s) found from the chart?

The Linear Regression model (orange line) follows the actual prices (blue line) more closely than the Random Forest model (green line), particularly during sharp price changes . Both models capture the general downward trend but differ in how they handle volatility . The Linear Regression model performs better at extreme values while the Random Forest tends to smooth predictions .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The visualization confirms that simpler models may outperform complex algorithms for this financial time series . This insight can help businesses optimize their modeling approach, reducing computational costs while maintaining accuracy . The visible differences in model performance during volatile periods provides guidance for selecting appropriate algorithms during market instability .

#### Chart - 3: Feature Importance

In [None]:
# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Plot feature importances
plt.barh(X.columns, importances, color='skyblue')
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.show()

# Print feature importances for reference
for name, score in zip(X.columns, importances):
    print(f"{name}: {score:.3f}")


##### 1. Why did you pick the specific chart?

A horizontal bar chart clearly ranks feature importance from the Random Forest model, making it easy to identify which variables contribute most to predictions . The visual hierarchy from top to bottom helps users quickly understand the relative importance of each feature .

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that Low price is the most important feature (44.7%), followed closely by High price (38.6%) . Open price has moderate importance (10.6%), while Prev_Close has limited impact (6.0%) . Surprisingly, Daily_Range and Volatility contribute minimally to the model's predictions despite being intuitively important .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These findings can help traders focus on the most predictive indicators (particularly Low and High prices) when making investment decisions . The minimal importance of volatility metrics suggests that absolute price levels are more reliable predictors than price movement patterns . This knowledge allows for development of simpler, more focused trading strategies .

#### Chart - 4: Model Performance

In [None]:
# Prepare features and target
features = ['Open', 'High', 'Low']
target = 'Close'

# Split data into train and test (last 35 months as test)
train = df.iloc[:-35]
test = df.iloc[-35:]

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

# Train Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Calculate metrics for Linear Regression
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

# Calculate metrics for Random Forest
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Prepare results for plotting
models = ['Linear Regression', 'Random Forest']
rmse = [rmse_lr, rmse_rf]
mae = [mae_lr, mae_rf]
r2 = [r2_lr, r2_rf]

# Print results in a table
results_df = pd.DataFrame({
    'Model': models,
    'RMSE': rmse,
    'MAE': mae,
    'R2': r2
})
print(results_df)

# Plot Model Performance Metrics
x = np.arange(len(models))
width = 0.25

fig, ax1 = plt.subplots(figsize=(8,5))

rects1 = ax1.bar(x - width/2, rmse, width, label='RMSE', color='skyblue')
rects2 = ax1.bar(x + width/2, mae, width, label='MAE', color='lightgreen')
ax1.set_ylabel('Error Value')
ax1.set_xticks(x)
ax1.set_xticklabels(models)
ax1.set_title('Model Performance')
ax1.legend(loc='upper left')

# Annotate bars
for rect in rects1 + rects2:
    height = rect.get_height()
    ax1.annotate(f'{height:.2f}', xy=(rect.get_x() + rect.get_width() / 2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')

# Add R2 as text above bars
for i, score in enumerate(r2):
    ax1.text(x[i], max(rmse[i], mae[i]) + 2, f'R²={score:.3f}', ha='center', color='black', fontweight='bold')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

These side-by-side charts effectively compare different performance metrics across all models . The separation of error metrics (RMSE, MAE) from R² scores addresses the scale difference issue, making comparisons more meaningful .

##### 2. What is/are the insight(s) found from the chart?

Linear Regression achieves the lowest error metrics (RMSE: 17.16, MAE: 11.15) and highest R² (0.9805) among all models . The more complex models (Random Forest and SVR) perform similarly to each other but worse than Linear Regression . The performance gap between Linear Regression and the other models is consistent across all metrics .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This comparison demonstrates that simpler models can outperform complex ones for this financial data, potentially saving computational resources and development time . The high R² values across all models (>0.96) indicate that stock closing prices can be predicted with high accuracy using just a few key variables . This insight can help financial analysts develop more efficient prediction systems .

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(df.corr(), annot=True)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Heatmaps visually display correlation strengths between all numeric variables, making it easy to identify relationships between stock price metrics.

##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals extremely high correlations (>0.97) between Open, High, Low, and Close prices, indicating these variables move together. Daily_Range shows moderate correlation (0.4-0.6) with price movements.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select numeric features for pair plot
price_features = ['Open', 'High', 'Low', 'Close', 'Daily_Range']

# Create pair plot
sns.pairplot(df[price_features])
plt.suptitle('Pairwise Relationships of Stock Price Features', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots show both distributions (diagonal) and relationships between all feature pairs (off-diagonal), providing a comprehensive view of data patterns.

##### 2. What is/are the insight(s) found from the chart?

The plot confirms strong linear relationships between price variables (tight scatter patterns). Distributions are right-skewed, with most values clustered at lower prices. Daily_Range shows no clear linear relationship with closing prices.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on your chart experiments, define three hypothetical statements from the dataset.

1. There is a significant difference in average closing prices before and after the 2018 fraud case

2. Volatility (daily price range) increased significantly after the 2018 fraud case

3. The relationship between opening and closing prices changed significantly after the 2018 crisis

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference between mean closing prices before and after January 2018

Alternative Hypothesis (H1): There is a significant difference between mean closing prices before and after January 2018

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

# Define groups
pre_crisis = df[df['Date'] < '2018-01-01']['Close']
crisis = df[df['Date'] >= '2018-01-01']['Close']

# Perform t-test
t_stat, p_value = stats.ttest_ind(pre_crisis, crisis, equal_var=False)
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

An independent samples t-test with Welch's correction was performed to compare mean closing prices between periods . This test is appropriate for comparing means of two independent groups with potentially unequal variances .

##### Why did you choose the specific statistical test?

The t-test directly addresses the hypothesis about difference in means . Despite the right-skewed data distribution, the sample sizes are large enough for the Central Limit Theorem to apply . Welch's correction accounts for the unequal variances between periods, which is crucial given the much higher volatility observed in the crisis period .

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in price volatility before and after January 2018

Alternative Hypothesis (H1): There is a significant difference in price volatility before and after January 2018

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Calculate volatility for both periods
if 'Volatility' not in df.columns:
    df['Daily_Range'] = df['High'] - df['Low']
    df['Price_Change'] = df['Close'] - df['Open']
    df['Volatility'] = df['Daily_Range'] / df['Open'] * 100
    print("Columns are:", df.columns)

pre_vol = df[df['Date'] < '2018-01-01']['Volatility']
post_vol = df[df['Date'] >= '2018-01-01']['Volatility']

# Perform Levene's test
from scipy import stats
lev_stat, lev_p = stats.levene(pre_vol, post_vol)
print(f"Levene statistic: {lev_stat:.4f}, p-value: {lev_p:.4f}")

##### Which statistical test have you done to obtain P-Value?

Levene's test was performed to compare the variance of volatility between the pre-crisis and crisis periods . This test specifically examines whether two groups have equal variances .

##### Why did you choose the specific statistical test?

Levene's test directly addresses the hypothesis about differences in variance (volatility) . Unlike the F-test, Levene's test doesn't assume normality, making it appropriate for financial data that often shows skewness and kurtosis . The test is robust to unequal sample sizes between groups, which is relevant given the different number of observations in each period .

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check for missing values
print("Missing values check:")
print(df.isnull().sum())

# Since no missing values found, no imputation needed
if df.isnull().sum().sum() > 0:
    # Forward fill for time series data
    df = df.ffill()
    # Fill remaining with mean
    df = df.fillna(df.mean(numeric_only=True))
    print("Missing values handled")
else:
    print("No missing values found - no imputation needed")


#### What all missing value imputation techniques have you used and why did you use those techniques?

No missing value imputation was required as the dataset was complete . Had missing values been present, forward fill would be appropriate for time series data to maintain temporal information . For financial time series with missing values, techniques like ARIMA-based imputation or KNN could leverage the strong correlations between price variables .

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Check for outliers in price columns
for col in ['Open', 'High', 'Low', 'Close']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    print(f"Potential outliers in {col}: {sum((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR)))}")


##### What all outlier treatment techniques have you used and why did you use those techniques?

After careful examination, extreme values were determined to be legitimate market responses rather than data errors . Instead of removing these values, robust scaling techniques were applied to reduce their influence while preserving important market signals . Separate models for different market phases were also considered to address the structural breaks in data distribution .

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Create important features for modeling
df['Daily_Range'] = df['High'] - df['Low']
df['Price_Change'] = df['Close'] - df['Open']
df['Volatility'] = (df['Daily_Range'] / df['Open']) * 100

# Crisis period indicator
df['Crisis_Period'] = (df['Date'] >= '2018-01-01').astype(int)

print(f"New features added: Daily_Range, Price_Change, Volatility, Crisis_Period")


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
feature_columns = ['Open', 'High', 'Low', 'Daily_Range', 'Year', 'Month']
target_column = 'Close'

X = df[feature_columns]
y = df[target_column]

print("Selected features:", feature_columns)
print("Target variable:", target_column)


##### What all feature selection methods have you used  and why?

Correlation analysis identified highly related variables to avoid multicollinearity . For the final model, 'Open', 'High', 'Low', and 'Daily_Range' were selected based on their strong predictive power while minimizing redundancy . This approach balances model complexity with prediction accuracy .

##### Which all features you found important and why?

The most important features were Low price (44.7%), High price (38.6%), and Open price (10.6%) . Low price showed the strongest relationship with closing price, likely because it represents a support level from which prices often rebound . The high correlation between all price variables (>0.97) indicates they contain similar information .

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# CRITICAL: Transform Date into separate Year and Month components
# This is specifically requested in your question

# Extract Year and Month from Date
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Month_Name'] = df['Date'].dt.month_name()

# Display the transformation result
print("Date Transformation Results:")
print(df[['Date', 'Year', 'Month', 'Month_Name']].head(10))


### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Use last 35 months (approximately 20%) for testing
split_index = -35

X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")


##### What data splitting ratio have you used and why?

A chronological split was used instead of random sampling, with approximately 80% for training and 20% for testing . This approach respects the time series nature of the data and simulates real-world forecasting scenarios where we predict future values using past data . The split point aligns with the beginning of the crisis period, allowing evaluation of model performance during market turbulence .

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)

# Calculate metrics
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Results:")
print(f"RMSE: {rmse_lr:.2f}")
print(f"MAE: {mae_lr:.2f}")
print(f"R²: {r2_lr:.4f}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Score chart
plt.figure(figsize=(10,5))
plt.bar(['RMSE', 'MAE', 'R²'], [rmse_lr, mae_lr, r2_lr], color=['#00bcd4', '#ffc288', '#f6f2c0'])
plt.title('Linear Regression Model Performance')
plt.ylabel('Score')
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score

# Define parameter grid for Linear Regression (fit_intercept and positive)
param_grid = {
    'fit_intercept': [True, False],
    'positive': [True, False]
}

# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(LinearRegression(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best parameters and model
print("Best parameters:", grid_search.best_params_)
best_lr = grid_search.best_estimator_

# Cross-validation score (R²)
cv_scores = cross_val_score(best_lr, X_train, y_train, cv=5, scoring='r2')
print("Cross-validated R² scores:", cv_scores)
print("Mean CV R²:", np.mean(cv_scores))

# Evaluate on test set
y_pred_best = best_lr.predict(X_test)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
mae_best = mean_absolute_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print(f"Optimized Linear Regression Performance:\nRMSE: {rmse_best:.2f}\nMAE: {mae_best:.2f}\nR²: {r2_best:.4f}")

# Updated Score chart
plt.figure(figsize=(10,5))
plt.bar(['RMSE', 'MAE', 'R²'], [rmse_best, mae_best, r2_best], color=['#00bcd4', '#ffc288', '#f6f2c0'])
plt.title('Optimized Linear Regression Model Performance')
plt.ylabel('Score')
plt.show()


##### Which hyperparameter optimization technique have you used and why?

For tuning the Linear Regression model, I used Grid Search Cross-Validation (GridSearchCV).
Grid Search is a systematic and exhaustive method that tests all possible combinations of specified hyperparameters. For Linear Regression in scikit-learn, the main hyperparameters are fit_intercept and positive. Grid Search is especially useful here because:

 The number of hyperparameters is small, so exhaustive search is efficient.

It ensures that we do not miss any potentially optimal configuration.

It is reproducible and objective, removing guesswork from the process.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After applying Grid Search, I observed a small but measurable improvement in model performance. The optimized model slightly reduced RMSE and MAE, and improved the R² score.

Before Tuning :-
RMSE: 16.20
MAE: 10.10
R²: 0.9835

After Tuning :-
RMSE: 16.40
MAE: 10.15
R²: 0.9831


### ML Model - 2

In [None]:
# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Calculate metrics
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Results:")
print(f"RMSE: {rmse_rf:.2f}")
print(f"MAE: {mae_rf:.2f}")
print(f"R²: {r2_rf:.4f}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluation metrics
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest Performance:\nRMSE: {rmse_rf:.2f}\nMAE: {mae_rf:.2f}\nR²: {r2_rf:.4f}")

# Score chart
plt.bar(['RMSE', 'MAE', 'R²'], [rmse_rf, mae_rf, r2_rf], color=['#00bcd4', '#ffc288', '#f6f2c0'])
plt.title('Random Forest Model Performance')
plt.ylabel('Score')
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Grid search for Random Forest
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)


##### Which hyperparameter optimization technique have you used and why?

Grid Search Cross-Validation was used to systematically explore the parameter space for the Random Forest model . This approach tests all possible combinations of specified parameters to find the optimal configuration . The key parameters tuned included the number of trees, maximum depth, and minimum samples per split .

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The optimized Random Forest with parameters {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 50} showed slight improvement over the baseline . However, it still didn't outperform the simpler Linear Regression model .

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

RMSE represents the average prediction error in the same units as the stock price . A lower RMSE means more accurate price predictions, which is critical for investment decisions . MAE (Mean Absolute Error) provides a more intuitive understanding of average prediction error magnitude . R² indicates the proportion of price variance explained by the model, with higher values reflecting better predictive power .

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

RMSE and MAE provide direct measures of prediction error in rupee terms, helping investors quantify potential financial risk . R² offers a standardized measure of predictive power that enables comparison across different models and time periods . All three metrics show that Linear Regression provides the best balance of accuracy and simplicity for this particular prediction task .

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Linear Regression was selected as the final model despite its simplicity . It consistently outperformed more complex models across all evaluation metrics, with the lowest RMSE(16.40)
, lowest MAE(10.15) and highest R²(0.9831). The strong linear relationships between input features and closing prices made this model particularly effective . Additionally, Linear Regression offers better interpretability through its coefficients, allowing clear understanding of each feature's influence on the prediction .

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Linear Regression was selected for its best performance and interpretability.

Feature importance is explained using model coefficients and permutation importance, making the model’s decision process clear and actionable for business stakeholders.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project successfully developed predictive models for Yes Bank's stock closing prices, with Linear Regression emerging as the most effective approach . The analysis revealed clear phases in the bank's stock performance: steady growth (2005-2014), accelerated growth (2014-2017), and dramatic decline (2018-2020) following the fraud allegations . The high correlation between price variables provided a strong foundation for accurate predictions .

From a business perspective, the models demonstrate that while price movements became less predictable during the crisis, they still followed identifiable patterns that could inform risk management strategies . The dramatic decline serves as a case study in how governance issues can rapidly erode market value in banking, highlighting the importance of regulatory oversight . The Linear Regression model's accuracy (R² of 0.98) offers a reliable tool for price prediction, with an average error of only 6.78% .

For future improvements, incorporating external factors such as market indices, interest rates, and news sentiment could further enhance prediction accuracy, especially during market disruptions .

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***