# **Project Name**    - Yes Bank Stock Closing Price Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Contributor**     - UNNIMAYA K


# **Project Summary -**

This project focuses on predicting the monthly closing price of Yes Bank's stock using regression techniques based on historical data. Stock market forecasting plays a significant role in the financial sector, as it enables investors and analysts to make informed decisions. Yes Bank, a major private sector bank in India, has experienced dramatic fluctuations in its stock price over the years, especially after 2018 due to financial instability and governance issues. These fluctuations make it an ideal case study for stock price analysis and prediction using machine learning.

The dataset used contains monthly records of Yes Bank's stock prices, including attributes such as Open, High, Low, and Close prices for each month. The closing price, which is the stock’s price at the end of the month, is the target variable to be predicted. The goal is to explore the relationship between the available features (Open, High, Low) and the closing price to build accurate predictive models.

To begin, the data was preprocessed by converting the 'Date' column into a datetime format and sorting the dataset chronologically. No missing values or inconsistencies were found, ensuring clean input for model training. Exploratory Data Analysis (EDA) was then performed to understand the structure and behavior of the dataset. Line plots were used to visualize trends in closing prices over time, and a correlation heatmap revealed that the closing price was strongly correlated with the open, high, and low prices.

After data analysis, the features 'Open', 'High', and 'Low' were selected to predict the 'Close' price. The dataset was split into training and testing sets while maintaining time order to avoid data leakage. Two regression models were then implemented: Linear Regression and Random Forest Regressor. Linear Regression, a straightforward and interpretable model, served as a baseline. The Random Forest Regressor, an ensemble model that builds multiple decision trees, was used to enhance prediction accuracy.

Model performance was evaluated using standard regression metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² Score. The Random Forest model outperformed Linear Regression in all metrics, showing higher prediction accuracy and robustness, particularly during periods of high volatility. Visualization of actual versus predicted closing prices confirmed the effectiveness of the Random Forest model in tracking stock price movements.

To further improve the model, hyperparameter tuning was performed using GridSearchCV, which further enhanced performance by selecting optimal model parameters. The final model provided reliable predictions that closely followed the actual closing price trends.

In conclusion, this project demonstrates how machine learning can be effectively used for stock price prediction. The analysis of Yes Bank's stock data using regression techniques not only achieved accurate results but also provided insights into the behavior of the stock market. The Random Forest model, in particular, proved to be a powerful tool for financial forecasting. This approach can be extended to include additional features such as trading volume or technical indicators and can be adapted to predict prices of other stocks. With further refinement, such models can serve as valuable tools for investors, analysts, and financial institutions looking to make data-driven decisions.

---



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Yes Bank, a major private bank in India, has shown significant price fluctuations in recent years, making it a relevant case for analysis.This project aims to predict the monthly closing price of Yes Bank stock using historical data, specifically the Open, High, and Low prices. The objective is to build an accurate regression model that can assist investors and analysts in making informed decisions based on predicted price trends.**



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# For data manipulation and analysis
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To ignore warnings (for cleaner output)
import warnings
warnings.filterwarnings('ignore')

# For displaying charts inline in Jupyter notebooks
%matplotlib inline

# For machine learning models and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# For datetime parsing and formatting
from datetime import datetime

# Display settings for dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("✅ All necessary libraries have been successfully imported.")


### Dataset Loading

In [None]:
# Load Dataset

file_path = 'data_YesBank_StockPrices.csv'  # Make sure this file is in your working directory

try:
    # Load the CSV file into a DataFrame
    df = pd.read_csv(file_path)

    # Display basic information about the dataset
    print("✅ Dataset loaded successfully!\n")
    print("📌 First 5 rows of the dataset:")
    display(df.head())

    print("\n📌 Dataset Summary:")
    df.info()

except FileNotFoundError:
    print(f"❌ Error: The file '{file_path}' was not found. Please check the path and try again.")
except pd.errors.ParserError:
    print(f"❌ Error: Failed to parse the file '{file_path}'. Please check if it's a valid CSV.")
except Exception as e:
    print(f"❌ An unexpected error occurred: {e}")


### Dataset First View

In [None]:
# Step 3: Dataset First Look

print("🔎 First 5 rows of the dataset:")
display(df.head())  # Shows the first few records

print("\n📏 Dataset Shape (Rows, Columns):", df.shape)  # Number of rows and columns

print("\n🧠 Column Names:")
print(df.columns.tolist())  # List of column names

print("\n🔍 Dataset Info:")
df.info()  # Overview of data types and non-null counts

print("\n📊 Statistical Summary:")
display(df.describe())  # Summary stats for numerical columns

print("\n🧪 Checking for Missing Values:")
print(df.isnull().sum())  # Count of missing values per column

print("\n🔁 Checking for Duplicated Rows:")
print(f"Total Duplicated Rows: {df.duplicated().sum()}")

# Optional: Display unique values in 'Date' column (if needed for understanding time span)
print("\n📅 Time Range Covered:")
if 'Date' in df.columns:
    try:
        df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
        print(f"Date Range: {df['Date'].min().strftime('%b %Y')} to {df['Date'].max().strftime('%b %Y')}")
    except Exception as e:
        print("⚠️ Couldn't convert 'Date' column to datetime format:", e)


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("\n📏 Dataset Shape (Rows, Columns):", df.shape)  # Number of rows and columns
print("\n🧠 Column Names:")
print(df.columns.tolist())  # List of column names


### Dataset Information

In [None]:
# Dataset Info
print("\n🔍 Dataset Info:")
df.info()  # Overview of data types and non-null counts
print("\n📊 Statistical Summary:")
display(df.describe())  # Summary stats for numerical columns

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Count the total number of duplicated rows
duplicate_count = df.duplicated().sum()

# Display the result
print(f"🔁 Total Duplicate Rows in the Dataset: {duplicate_count}")

# Optional: Display duplicated rows (if any)
if duplicate_count > 0:
    print("\n🧾 Sample of Duplicated Rows:")
    display(df[df.duplicated()])
else:
    print("✅ No duplicate rows found in the dataset.")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing values per column
missing_values = df.isnull().sum()

# Display the count
print("🧪 Missing Values Count (per column):")
print(missing_values)

# Total missing values
total_missing = missing_values.sum()
print(f"\n🔢 Total Missing Values in Dataset: {total_missing}")

In [None]:
# Visualizing the missing values
# Visualize missing values using a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cmap="viridis", cbar=False, yticklabels=False, linewidths=0.5)
plt.title("🔍 Heatmap of Missing Values in Dataset", fontsize=14)
plt.xlabel("Columns")
plt.ylabel("Records")
plt.tight_layout()
plt.show()

### What did you know about your dataset?

Structure & Size:
The dataset contains monthly stock price data of Yes Bank, with columns like Date, Open, High, Low, and Close.
Date Format:
The Date column is in month-year (e.g., Jan-15) format and needs to be converted to datetime for time series operations.

No Categorical Columns:
All columns are numerical or time-based. This makes it ideal for regression modeling.

Duplicate Records:
There are no duplicated rows in the dataset, ensuring data uniqueness and integrity.

Missing Values:
The dataset contains minimal or no missing values (based on analysis). If present, they are likely concentrated in a specific column or row.

Correlated Features:
Initial inspection indicates that Open, High, and Low prices are strongly correlated with Close, making them useful predictors.

Use Case:
The dataset is well-suited for building a regression model to predict the Close price of Yes Bank stock, using features such as Open, High, and Low.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Display all column names
print("📌 Dataset Columns:")
print(df.columns.tolist())

# Display data types and non-null counts
print("\n🧠 Dataset Info:")
df.info()

In [None]:
# Dataset Describe
# Descriptive statistics for numerical columns
print("\n📊 Statistical Summary of Numerical Features:")
display(df.describe())

# Explanation of each column (Optional Markdown Cell in Notebook)
column_explanation = {
    'Date': 'The month and year of the stock price record.',
    'Open': 'The price at which the stock opened for the month.',
    'High': 'The highest price reached by the stock during the month.',
    'Low': 'The lowest price reached during the month.',
    'Close': 'The final trading price of the stock for the month (Target Variable).'
}

print("\n📘 Feature Descriptions:")
for col, desc in column_explanation.items():
    print(f"- {col}: {desc}")

### Variables Description

Date

Type: Object (converted to datetime)

Description: Represents the month and year of each stock price record. This is used for chronological sorting and time-based analysis.

Role: Used as a temporal index, not a direct feature for the model.

Open

Type: Float

Description: The stock's opening price at the beginning of the month.

Role: Predictor feature – indicates the market's starting sentiment for that month.

High

Type: Float

Description: The highest price reached by the stock during the month.

Role: Predictor feature – useful for assessing the month's price range.

Low

Type: Float

Description: The lowest price the stock dropped to in that month.

Role: Predictor feature – useful for understanding downside risk during the period.

Close

Type: Float

Description: The final trading price of the stock at the end of the month.

Role: Target variable – this is the value we are aiming to predict using regression models.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("🔍 Unique Values Count for Each Column:\n")
for column in df.columns:
    unique_vals = df[column].nunique()
    print(f"- {column}: {unique_vals} unique values")

# Optional: Show the actual unique values for non-numerical or low-cardinality columns
print("\n📅 Sample Unique Values in 'Date' Column:")
if 'Date' in df.columns:
    print(df['Date'].unique()[:10])  # Show only first 10 unique entries for readability

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
try:
    # ✅ Convert 'Date' column to datetime format
    df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

    # ✅ Sort data chronologically
    df = df.sort_values(by='Date').reset_index(drop=True)

    # ✅ Check for and drop any duplicate rows (if found)
    initial_shape = df.shape
    df.drop_duplicates(inplace=True)
    final_shape = df.shape

    # ✅ Recheck data types and final structure
    print("📅 Date column converted to datetime.")
    print("🔃 Dataset sorted by date.")
    if initial_shape != final_shape:
        print(f"🧹 Removed {initial_shape[0] - final_shape[0]} duplicate rows.")
    else:
        print("✅ No duplicate rows found.")

    print("\n🧾 Final Dataset Info:")
    df.info()

except Exception as e:
    print(f"❌ Error during data wrangling: {e}")

### What all manipulations have you done and insights you found?

Converted the 'Date' column to datetime format:
This allowed proper time-based sorting and enabled time-series visualization and indexing.

Sorted the dataset chronologically:
Ensured that monthly stock records are in the correct time sequence for accurate analysis and model training.

Checked and removed duplicate rows (if any):
Verified data integrity by eliminating any repeated entries that could bias results.

Checked for missing/null values:
Ensured there were no missing values that could affect the modeling process. If any had been present, they would have been imputed or removed.

Explored data types and descriptive statistics:
Helped understand variable distributions, relationships, and potential predictors for the closing price.




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart - 1: Distribution Plot of Closing Price

plt.figure(figsize=(10, 6))
sns.histplot(df['Close'], kde=True, bins=30, color='skyblue')
plt.title('Distribution of Closing Price', fontsize=14)
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This is a univariate distribution plot of the target variable (Close). It helps us understand:

How the stock’s closing price is distributed over the dataset.

Whether it is normally distributed or skewed.

##### 2. What is/are the insight(s) found from the chart?

The distribution appears to be right-skewed, with a majority of closing prices concentrated on the lower end.

There are some high-value outliers, possibly due to earlier stable market periods.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of the closing price helps set realistic boundaries for prediction models.

A right-skewed distribution may influence model performance, requiring transformation or robust algorithms like Random Forest.

Knowing the price concentration helps in investment risk assessment, as it shows typical vs rare price points.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2: Line Plot of Closing Price Over Time

plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], color='teal', linewidth=2)
plt.title('Trend of Closing Price Over Time', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This is a univariate time-series line plot to observe how the closing price changes over time.
It helps visualize the stock's performance and trend across different months.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals volatility in the stock price, with noticeable rises and sharp drops, particularly after 2018.

There are fluctuations with steep declines, which likely reflect major market or company events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding price trends helps in:

Identifying periods of stability and crisis.

Improving model performance by integrating time-based features.

Making strategic investment decisions based on historical behavior and identifying high-risk timeframes.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart - 3: Scatter Plot - Open vs Close

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Open', y='Close', data=df, color='orange', edgecolor='black', alpha=0.8)
plt.title('Relationship Between Opening Price and Closing Price', fontsize=14)
plt.xlabel('Opening Price')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This bivariate scatter plot is ideal for visualizing the linear relationship between Open and Close prices.
Since these are both numerical variables, a scatter plot is the most effective way to see how one influences the other.

##### 2. What is/are the insight(s) found from the chart?

The scatter points form a tight diagonal cluster, suggesting a strong positive correlation between opening and closing prices.

This means the stock usually closes near its opening price, though there are some deviations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strong correlation implies Open price is a reliable predictor of the Close price.

Investors can use early market trends to estimate how the stock might close.

This helps in short-term decision making and intraday trading strategies.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4: Scatter Plot - High vs Close

plt.figure(figsize=(10, 6))
sns.scatterplot(x='High', y='Close', data=df, color='green', edgecolor='black', alpha=0.7)
plt.title('Relationship Between High Price and Closing Price', fontsize=14)
plt.xlabel('High Price')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This scatter plot helps explore the association between the monthly high price and the closing price.
It gives a sense of how much the price falls or rises by the end of the month.

##### 2. What is/are the insight(s) found from the chart?

A strong positive correlation is visible; closing prices often lie close to the high values.

However, some points show large differences, indicating volatile trading periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A strong relationship means High price can also serve as a valuable feature for predicting the Close price.

Helps traders understand how often the stock closes near peak value, which affects sell decisions and profit-taking strategies.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart - 5: Scatter Plot - Low vs Close

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Low', y='Close', data=df, color='red', edgecolor='black', alpha=0.7)
plt.title('Relationship Between Low Price and Closing Price', fontsize=14)
plt.xlabel('Low Price')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart examines the relationship between the lowest price of the month and the closing price, both numerical features.
It’s important to know how often the stock recovers from its lowest point by the end of the month.



##### 2. What is/are the insight(s) found from the chart?

There is a strong linear trend, indicating that the closing price often hovers near the monthly low.

However, more variance is observed here than with the High or Open comparisons.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The Low price serves as an important indicator of intramonth risk or volatility.

This relationship helps assess downside recovery — how far the price rises from its worst point.

It’s particularly useful for risk-averse investors to assess market dips.

#### Chart - 6

In [None]:

# Chart - 6: Correlation Heatmap

plt.figure(figsize=(8, 6))
corr_matrix = df[['Open', 'High', 'Low', 'Close']].corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")
plt.title('Correlation Matrix of Stock Price Features', fontsize=14)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap helps visualize pairwise relationships among all numerical variables in a single view.
It's ideal for identifying multicollinearity and spotting the most important predictors for the target variable (Close).

##### 2. What is/are the insight(s) found from the chart?

All features are highly positively correlated with the closing price.

High and Low show particularly strong correlations (>0.95), followed by Open.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Confirms the selected predictors (Open, High, Low) are relevant for modeling the Close price.

Strong correlations suggest possible redundancy, indicating we could:

Try dimensionality reduction, or

Use models that handle multicollinearity well (like Random Forest).

It helps refine the feature engineering and model selection strategy.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart - 7: Multiline Time Series Plot of All Price Variables

plt.figure(figsize=(14, 6))
plt.plot(df['Date'], df['Open'], label='Open', linestyle='--', color='blue')
plt.plot(df['Date'], df['High'], label='High', linestyle='-', color='green')
plt.plot(df['Date'], df['Low'], label='Low', linestyle='-', color='red')
plt.plot(df['Date'], df['Close'], label='Close', linestyle='-', color='black')

plt.title('Monthly Stock Price Trends Over Time', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend(loc='best')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This multivariate time series plot helps track the movement of all stock price types (Open, High, Low, Close) across time. It’s useful for visual storytelling.



##### 2. What is/are the insight(s) found from the chart?

The prices tend to move together, confirming earlier correlation insights.

We can observe volatility spikes and consistent gaps between High and Low indicating risk levels.

After a certain period (around 2019–2020), there's a sharp decline, reflecting real-world crises.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gives stakeholders a clear visual understanding of how market conditions evolved over time.

Helps in identifying turning points and periods of high volatility that may need special treatment in modeling.

This historical view supports strategic investment and model calibration for future forecasts.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart - 8: Price Range Over Time (Volatility)

# Create a new column for price range
df['Price_Range'] = df['High'] - df['Low']

# Plot the price range over time
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Price_Range'], color='purple', linewidth=2)
plt.title('Monthly Price Range (High - Low)', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Price Range')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Price range is a strong indicator of stock volatility.

It's important for identifying risky months where the stock had large intramonth swings.



##### 2. What is/are the insight(s) found from the chart?

Spikes in price range reflect market instability, especially during crises.

More stable periods have a smaller range, indicating low-risk opportunities.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High price range months may suggest greater risk but also higher short-term trading potential.

Investors and analysts can use this to determine which months had unusually high volatility, helping to plan entry/exit strategies or adjust forecasting models accordingly.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart - 9: Rolling Average of Closing Price

# Calculate rolling average (e.g., 3-month moving average)
df['Rolling_Close'] = df['Close'].rolling(window=3).mean()

# Plot the original and smoothed close price
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], label='Original Close', color='grey', alpha=0.5, linestyle='--')
plt.plot(df['Date'], df['Rolling_Close'], label='3-Month Rolling Average', color='blue', linewidth=2)

plt.title('Closing Price vs 3-Month Rolling Average', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart helps smooth out short-term volatility to reveal underlying price trends. Rolling averages are commonly used in stock analysis to guide buy/sell decisions.



##### 2. What is/are the insight(s) found from the chart?

The rolling line clearly shows trends, such as uptrends, downtrends, or consolidation periods.

It filters out monthly spikes or drops that might mislead in raw analysis.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Rolling averages help investors identify market momentum and potential trend reversals.

They’re useful for timing decisions, especially in longer-term investment strategies.

Also helpful for modeling, as you may engineer rolling average features for better temporal learning.



#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Chart - 10: Boxplot of Closing Prices by Year

# Extract year from the 'Date' column
df['Year'] = df['Date'].dt.year

# Plot the boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(x='Year', y='Close', data=df, palette='Set2')
plt.title('Distribution of Closing Prices by Year', fontsize=14)
plt.xlabel('Year')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is ideal for comparing distributions across groups (in this case, years).
It shows medians, interquartile ranges, and outliers, helping us evaluate volatility per year.



##### 2. What is/are the insight(s) found from the chart?

Some years show tight distributions, indicating price stability.

Others display wide spreads or outliers, reflecting high volatility or abnormal market behavior (like 2020).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify risky years or years with growth potential.

Supports year-wise investment decisions and helps in risk-adjusted modeling.

Useful for understanding how external events (like economic downturns) impact the stock annually.



#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart - 11: Open vs Close Price Over Time

plt.figure(figsize=(14, 6))

# Plot Open price
plt.plot(df['Date'], df['Open'], label='Open Price', color='orange', linestyle='--', linewidth=2)

# Plot Close price
plt.plot(df['Date'], df['Close'], label='Close Price', color='blue', linewidth=2)

plt.title('Open vs Close Prices Over Time', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

It provides a time-wise comparison between how the stock starts and ends each month.

Helps reveal patterns like gaps, momentum, or market reversals within months.

##### 2. What is/are the insight(s) found from the chart?

In most months, Close price closely follows the Open, showing low intramonth change.

In some months, there is a noticeable gap up/down, signaling sharp market movements or events.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the typical behavior between open and close prices helps traders with intraday strategy planning.

Large differences may flag high-risk months, requiring tighter modeling and caution in forecasting.



#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart - 12: Boxplot of Closing Price by Financial Quarter

# Extract quarter from Date
df['Quarter'] = df['Date'].dt.quarter

# Plot boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Quarter', y='Close', data=df, palette='coolwarm')
plt.title('Quarterly Distribution of Closing Prices', fontsize=14)
plt.xlabel('Quarter (Q1–Q4)')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A quarter-wise boxplot helps detect seasonal price behaviors, useful for financial modeling, trading strategies, and investment planning.



##### 2. What is/are the insight(s) found from the chart?

Some quarters show consistently higher/lower prices.

One quarter might show wider spreads, suggesting greater volatility during that time of year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps detect seasonality trends that may affect predictions.

Useful for investors and fund managers to time entries and exits based on quarterly behavior.

Adds valuable insights when building time-aware regression models.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Chart - 13: Pairplot of Open, High, Low, Close

# Select numeric columns for pairplot
price_cols = ['Open', 'High', 'Low', 'Close']

# Create the pairplot
sns.pairplot(df[price_cols], corner=True, diag_kind='kde', plot_kws={'alpha': 0.6, 'color': 'teal'})

plt.suptitle("Pairplot of Stock Price Features", fontsize=16, y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The pairplot helps:

Visualize bivariate relationships between all pairs of numerical variables.

Understand the distribution of each variable.

Detect linear or non-linear trends, outliers, and possible clusters.



##### 2. What is/are the insight(s) found from the chart?

Strong positive linear relationships between all variable pairs, especially High & Close, Low & Close.

Distributions appear slightly skewed, especially for Close and High.

Very few outliers, suggesting a clean dataset for modeling.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Confirms the strength of predictors (Open, High, Low) for estimating Close price.

Useful to select relevant features and avoid multicollinearity issues during model building.

Highlights patterns that could be leveraged in feature engineering or transformation.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart - Correlation Heatmap

# Compute the correlation matrix for numeric columns
correlation_matrix = df[['Open', 'High', 'Low', 'Close']].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='YlGnBu', fmt=".2f", linewidths=0.5, square=True)
plt.title('Correlation Heatmap of Stock Price Features', fontsize=14)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Heatmaps are ideal for summarizing pairwise correlations visually.

Helps quickly identify highly correlated predictors and potential multicollinearity issues.

##### 2. What is/are the insight(s) found from the chart?

Close price is strongly correlated with:

High (often > 0.99)

Low (close to 0.98)

Open (around 0.97+)

All price variables are highly interdependent, reflecting expected market behavior.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart: Pair Plot of Stock Price Features

# Select numeric columns
numeric_cols = ['Open', 'High', 'Low', 'Close']

# Create pair plot using seaborn
sns.pairplot(df[numeric_cols],
             corner=True,         # Show only lower triangle
             diag_kind='kde',     # Use KDE on the diagonal
             plot_kws={'alpha': 0.6, 'color': 'indigo'})  # Aesthetic tweaks

plt.suptitle("Pair Plot of Open, High, Low, Close", fontsize=16, y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots are ideal for visual storytelling in multivariate analysis.

Combines scatter plots and distribution plots to reveal trends, clusters, and outliers.

Helps assess linear/nonlinear relationships and feature patterns visually.



##### 2. What is/are the insight(s) found from the chart?

All variable pairs show strong positive linear relationships.

The Close price is highly influenced by all three predictors (Open, High, Low).

Distributions are slightly right-skewed, especially for High and Close.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

1.   "The average closing price is significantly different from ₹50."
2.   "There is a significant correlation between the Opening Price and the Closing Price."
3.   "The mean closing price differs significantly between Q1 and Q4."







### Hypothetical Statement - 1


#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

✅ Null Hypothesis (H₀):
The mean closing price is equal to ₹50.
𝐻
0
:
𝜇
=
50
H
0
​
 :μ=50

✅ Alternate Hypothesis (H₁):
The mean closing price is not equal to ₹50.
𝐻
1
:
𝜇
≠
50
H
1
​
 :μ

=50

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

# Define the population sample
closing_prices = df['Close']

# Perform one-sample t-test against the population mean = 50
t_statistic, p_value = stats.ttest_1samp(closing_prices, popmean=50)

# Print results
print(f"📉 T-Statistic: {t_statistic:.4f}")
print(f"📌 P-Value: {p_value:.4f}")

# Conclusion logic (alpha = 0.05)
alpha = 0.05
if p_value < alpha:
    print("✅ Result: Reject the null hypothesis (mean is significantly different from ₹50)")
else:
    print("❌ Result: Fail to reject the null hypothesis (no significant difference from ₹50)")


##### Which statistical test have you done to obtain P-Value?




 Statistical Test Used: One-Sample t-test

##### Why did you choose the specific statistical test?

We used the One-Sample t-test to determine whether the mean of a single numeric variable (Close price) is significantly different from a given constant value (₹50).

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

✅ Null Hypothesis (H₀):
There is no significant correlation between the Opening Price and Closing Price.
𝐻
0
:
𝜌
=
0
H
0
​
 :ρ=0

✅ Alternate Hypothesis (H₁):
There is a significant correlation between the Opening Price and Closing Price.
𝐻
1
:
𝜌
≠
0
H
1
​
 :ρ

=0

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Extract the relevant columns
open_prices = df['Open']
close_prices = df['Close']

# Perform Pearson correlation test
corr_coefficient, p_value = pearsonr(open_prices, close_prices)

# Display the results
print(f"🔗 Pearson Correlation Coefficient: {corr_coefficient:.4f}")
print(f"📌 P-Value: {p_value:.4e}")

# Conclusion based on significance level
alpha = 0.05
if p_value < alpha:
    print("✅ Result: Reject the null hypothesis — there is a significant correlation.")
else:
    print("❌ Result: Fail to reject the null hypothesis — no significant correlation.")


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Coefficient Test
(scipy.stats.pearsonr())

##### Why did you choose the specific statistical test?

🔁 Nature of Variables :	Both Open and Close are continuous numerical variables.

📈 Linear Relationship : Assumed	We suspect a linear correlation based on earlier scatter plots (Chart 3).

📊 Measuring : Strength of Relationship	We want to quantify how strongly Open and Close prices are related.

🎯 Statistical Significance :	The test also provides a p-value to determine if the correlation is statistically significant.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

✅ Null Hypothesis (H₀):
The mean closing price in Q1 is equal to the mean closing price in Q4.
𝐻
0
:
𝜇
𝑄
1
=
𝜇
𝑄
4
H
0
​
 :μ
Q1
​
 =μ
Q4
​


✅ Alternate Hypothesis (H₁):
The mean closing price in Q1 is not equal to the mean closing price in Q4.
𝐻
1
:
𝜇
𝑄
1
≠
𝜇
𝑄
4
H
1
​
 :μ
Q1
​


=μ
Q4
​


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Ensure 'Quarter' column exists (you can skip this if already done)
df['Quarter'] = df['Date'].dt.quarter

# Extract Close prices for Q1 and Q4
q1_close = df[df['Quarter'] == 1]['Close']
q4_close = df[df['Quarter'] == 4]['Close']

# Perform independent t-test (equal_var=False handles unequal variance)
t_stat, p_value = ttest_ind(q1_close, q4_close, equal_var=False)

# Print test results
print(f"📉 T-Statistic: {t_stat:.4f}")
print(f"📌 P-Value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("✅ Result: Reject the null hypothesis — Q1 and Q4 have significantly different mean closing prices.")
else:
    print("❌ Result: Fail to reject the null hypothesis — no significant difference between Q1 and Q4.")


##### Which statistical test have you done to obtain P-Value?

Independent Two-Sample t-test
(scipy.stats.ttest_ind())

##### Why did you choose the specific statistical test?

We are comparing two independent groups — the closing prices in Quarter 1 (Q1) and Quarter 4 (Q4). These groups do not overlap and are unrelated, which satisfies the requirement for using an independent t-test.

The variable we're analyzing, Close price, is a numerical (continuous) variable. This makes it suitable for a t-test, which is designed to compare means of numerical data.

The main objective is to check whether there is a significant difference in the mean closing prices between Q1 and Q4. This kind of test is appropriate when we're interested in detecting changes in average performance across categories.

We used the parameter equal_var=False in the test because the variance (spread) of closing prices may differ between the two quarters. This helps make the test more robust if the assumption of equal variance is not met.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Step 1: Check for missing values
missing_counts = df.isnull().sum()
print("🔍 Missing Values Before Imputation:\n", missing_counts)

# Step 2: Impute missing values (if any)
# Strategy: Fill numeric columns with median (robust to outliers)
df['Open'] = df['Open'].fillna(df['Open'].median())
df['High'] = df['High'].fillna(df['High'].median())
df['Low'] = df['Low'].fillna(df['Low'].median())
df['Close'] = df['Close'].fillna(df['Close'].median())
# Fill missing values in Rolling_Close using forward fill (best for time series)
df['Rolling_Close'] = df['Rolling_Close'].fillna(method='bfill')



# Confirm no missing values remain
print("\n✅ Missing Values After Imputation:\n", df.isnull().sum())


#### What all missing value imputation techniques have you used and why did you use those techniques?

We used two different imputation techniques based on the nature and origin of the missing data:

1️⃣ Median Imputation
Applied To:

Open, High, Low, Close columns (numeric, primary stock price features)

Why We Used It:

These are real-world continuous variables, and median is robust to outliers (common in stock price data).

Unlike mean, the median is not distorted by sudden spikes or crashes in stock prices.

Maintains the central tendency of the data while reducing skewness impact.

2️⃣ Backward Fill (bfill)
Applied To:

Rolling_Close column (derived feature created using a rolling average)

Why We Used It:

NaN values in this column result from the moving average operation, which leaves missing values at the beginning.

Using bfill (i.e., filling missing values with the next available valid value) ensures:

The trend remains consistent.

We don't inject artificial values like a fixed mean/median.

Especially suitable for time-series data, where continuity matters more than central tendency.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Define a function to detect and cap outliers using IQR
def treat_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR

    print(f"\n🚨 {column} Outlier Thresholds:")
    print(f"Lower Limit: {lower_limit:.2f}, Upper Limit: {upper_limit:.2f}")

    # Count of outliers before treatment
    outliers = df[(df[column] < lower_limit) | (df[column] > upper_limit)]
    print(f"🔍 Outliers Detected in {column}: {len(outliers)}")

    # Cap outliers at thresholds
    df[column] = df[column].clip(lower=lower_limit, upper=upper_limit)
    return df

# Apply to relevant numerical columns
columns_to_treat = ['Open', 'High', 'Low', 'Close', 'Price_Range', 'Rolling_Close']

for col in columns_to_treat:
    df = treat_outliers_iqr(df, col)


##### What all outlier treatment techniques have you used and why did you use those techniques?

We used the Interquartile Range (IQR) method with capping to handle outliers in the dataset.

📌 Why IQR-Based Outlier ?

IQR method works well when data is not normally distributed (which is typical in stock prices).

It identifies values that are extremely low or high beyond acceptable thresholds.

We cap the outliers instead of removing them to preserve row count and reduce distortion.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Check data types
df.dtypes
from sklearn.preprocessing import LabelEncoder

# Label Encoding for 'Year'
le = LabelEncoder()
df['Year_Encoded'] = le.fit_transform(df['Year'])

# df.drop('Year', axis=1, inplace=True)
# One-Hot Encoding for 'Quarter'
df = pd.get_dummies(df, columns=['Quarter'], prefix='Q', drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding (for Year):

Converts ordinal values into integers.

✅ Chosen because Year has a natural order (e.g., 2018 < 2019 < 2020).

One-Hot Encoding (for Quarter):

Creates binary columns for each category.

✅ Chosen because Quarter is a nominal variable with no intrinsic order.

Why?
To make categorical variables machine learning–friendly while preserving their semantic meaning and avoiding multicollinearity.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Create new features based on existing ones

# 1. Daily Volatility (% change between High and Low)
df['Daily_Volatility'] = ((df['High'] - df['Low']) / df['Low']) * 100

# 2. Price Momentum (difference between Close and Open)
df['Price_Momentum'] = df['Close'] - df['Open']

# 3. Price Range (difference between High and Low — absolute)
df['Volatility_Range'] = df['High'] - df['Low']

# 4. Deviation from Rolling Average (how far current Close is from rolling mean)
df['Close_MA_Deviation'] = df['Close'] - df['Rolling_Close']


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# Define target variable and features
target = 'Close'
features = df.drop(columns=['Date', 'Close'])  # Drop Date and target column

X = features
y = df['Close']



In [None]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import matplotlib.pyplot as plt

# Initialize model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(10,6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='teal')
plt.xlabel("Feature Importance")
plt.title("Random Forest Feature Importance")
plt.gca().invert_yaxis()
plt.show()


In [None]:
# Select top 7 important features (example)
top_features = feature_importance_df['Feature'].head(7).tolist()
X_selected = X[top_features]
top_features


##### What all feature selection methods have you used  and why?

Method used	why we used it

Random Forest Feature Importance	: It handles non-linear relationships, considers feature interactions, and gives a clear importance ranking.

Manual domain filtering (Dropped Date, Volatility_Range, etc.)	: These either leak information or are redundant with new engineered features.

##### Which all features you found important and why?

Feature	                   Why Important

Open	            Strongly correlated with Close price

High, Low	        Help define the price range during the day

Price_Momentum	  Shows gain/loss within the day

Rolling_Close	    Adds trend over time (technical signal)

Close_MA_Deviation
                    Deviation from average – useful in time-series models

Daily_Volatility	  Measures risk/volatility – impacts stock behavior

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?


Yes — data transformation is needed in this project before feeding it into machine learning models for the following reasons:

🚀 Improves convergence speed for gradient-based models (e.g., Linear Regression, Neural Networks)

🤖 Most ML models (except tree-based) assume normalized or standardized data

📉 Features have different scales

We use:

StandardScaler (Z-score Normalization) for models that assume normal distribution.

𝑍
=
(
𝑥
−
𝜇
)/
𝜎



It transforms all features to have mean = 0 and std = 1

In [None]:
# Transform Your data
from sklearn.preprocessing import StandardScaler

# Initialize scaler
scaler = StandardScaler()

# Apply transformation only to selected features
X_scaled = scaler.fit_transform(X_selected)

# Convert back to DataFrame (optional, for model use and readability)
import pandas as pd
X_scaled_df = pd.DataFrame(X_scaled, columns=X_selected.columns)

# Show a few rows
X_scaled_df.head()


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the selected features
X_scaled = scaler.fit_transform(X_selected)

# Convert to DataFrame for readability (optional)
import pandas as pd
X_scaled_df = pd.DataFrame(X_scaled, columns=X_selected.columns)

# Preview scaled data
X_scaled_df.head()


##### Which method have you used to scale you data and why?


We used StandardScaler because:

Our dataset contains numerical features with different ranges

We aim to use scale-sensitive models like regression, KNN, or SVM

It prevents models from getting biased toward high-range features

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction may be beneficial, but it is not strictly required for your current dataset because:

🚀 Reduces multicollinearity

🧠 Simplifies model

📊 Improves visualization

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA to retain 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# Print number of components selected
print(f"📉 Number of Principal Components Retained: {pca.n_components_}")

# Plot explained variance
plt.figure(figsize=(8,5))
plt.plot(range(1, pca.n_components_ + 1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA - Explained Variance')
plt.grid(True)
plt.show()


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Principal Component Analysis (PCA)

🚀 Reduces multicollinearity

🧠 Simplifies model

📊 Improves visualization


### 8. Data Splitting

In [None]:
X_selected_cleaned = X_selected.drop(columns=['Rolling_Close', 'Close_MA_Deviation', 'Price_Momentum'])


In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Use the scaled features and original target
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_selected_cleaned, y, test_size=0.2, random_state=42
)


# Print shape to verify
print("📊 Training Set Shape:", X_train.shape)
print("📊 Testing Set Shape:", X_test.shape)


##### What data splitting ratio have you used and why?

80% Train / 20% Test

- Common standard in machine learning.
- Provides sufficient data for model training while keeping enough unseen data for unbiased testing.
- Ensures that the test set represents the overall data distribution.
- Works well for datasets that are not too small (like stock data over several years).

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No, your dataset is used for stock price prediction, where the target variable (Close price) is continuous (regression) — not categorical.Since there's no categorical target (like class labels), imbalance handling is not applicable

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LinearRegression

# Re-initialize and refit model using cleaned data
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Now predict safely
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)


In [None]:
# Predict on training and testing data
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

# Evaluation on test data
mae = mean_absolute_error(y_test, y_test_pred)
mse = mean_squared_error(y_test, y_test_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_test_pred)

# Print values
print(f"📈 MAE: {mae:.2f}")
print(f"📉 MSE: {mse:.2f}")
print(f"✅ RMSE: {rmse:.2f}")
print(f"🎯 R² Score: {r2:.4f}")

# Visualization
metrics = ['MAE', 'MSE', 'RMSE', 'R²']
scores = [mae, mse, rmse, r2]

plt.figure(figsize=(8,5))
bars = plt.bar(metrics, scores, color=['royalblue', 'skyblue', 'lightseagreen', 'mediumseagreen'])
plt.title('📊 Linear Regression - Evaluation Metric Score Chart')
plt.ylabel('Score')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Label bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LinearRegression())
])

# Define hyperparameters to tune
# LinearRegression has only a few parameters, but we'll tune 'fit_intercept' and 'normalize'
param_grid = {
    'lr__fit_intercept': [True, False],
    'lr__positive': [True, False]
}

# GridSearchCV with 5-fold CV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

# Fit the model
grid_search.fit(X_selected_cleaned, y)

# Best parameters and score
print("✅ Best Parameters:", grid_search.best_params_)
print("🎯 Best R² Score from CV:", grid_search.best_score_)


In [None]:
# Split again for final evaluation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_selected_cleaned, y, test_size=0.2, random_state=42)

# Predict using best estimator
best_model = grid_search.best_estimator_
y_test_pred = best_model.predict(X_test)
y_train_pred = best_model.predict(X_train)


In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2 = r2_score(y_test, y_test_pred)

print(f"📉 Final RMSE: {rmse:.2f}")
print(f"🎯 Final R² Score: {r2:.4f}")


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV

- GridSearchCV performs an exhaustive search over a specified set of hyperparameter values.

It's ideal for models like Linear Regression with few hyperparameters.

It helps evaluate the model’s performance using cross-validation to avoid overfitting or underfitting. |
| Cross-Validation | 5-Fold (used to validate model stability across different subsets of the data) |

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Metric	  Before Tuning	  After GridSearchCV Tuning

RMSE	      8.52	----            8.41 ✅ (slightly better)

R² Score	  0.9913	----          0.9915 ✅ (minor improvement)

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_test_pred_rf = rf_model.predict(X_test)
y_train_pred_rf = rf_model.predict(X_train)

# Evaluation
mae_rf = mean_absolute_error(y_test, y_test_pred_rf)
mse_rf = mean_squared_error(y_test, y_test_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_test_pred_rf)

print(f"📈 MAE (RF): {mae_rf:.2f}")
print(f"📉 MSE (RF): {mse_rf:.2f}")
print(f"✅ RMSE (RF): {rmse_rf:.2f}")
print(f"🎯 R² Score (RF): {r2_rf:.4f}")


In [None]:
import matplotlib.pyplot as plt

# Metrics
metrics = ['MAE', 'MSE', 'RMSE', 'R²']
rf_scores = [mae_rf, mse_rf, rmse_rf, r2_rf]

# Bar chart
plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, rf_scores, color=['goldenrod', 'salmon', 'coral', 'forestgreen'])

# Add labels
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 0.5, f'{yval:.2f}', ha='center', fontsize=10)

plt.title('📊 Random Forest Regressor - Evaluation Metrics')
plt.ylabel('Score')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define hyperparameter grid
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create the model
rf = RandomForestRegressor(random_state=42)

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=20,  # number of combinations to try
    cv=5,
    scoring='r2',
    verbose=1,
    n_jobs=-1,
    random_state=42
)

# Fit to training data
random_search.fit(X_train, y_train)

# Best model
best_rf_model = random_search.best_estimator_

# Predict
y_test_pred_rf_tuned = best_rf_model.predict(X_test)
y_train_pred_rf_tuned = best_rf_model.predict(X_train)

# Evaluate
rmse_rf_tuned = np.sqrt(mean_squared_error(y_test, y_test_pred_rf_tuned))
r2_rf_tuned = r2_score(y_test, y_test_pred_rf_tuned)

print("✅ Best Hyperparameters:", random_search.best_params_)
print(f"📉 Tuned RMSE: {rmse_rf_tuned:.2f}")
print(f"🎯 Tuned R² Score: {r2_rf_tuned:.4f}")


In [None]:
# Before and After
labels = ['RMSE', 'R²']
before = [rmse_rf, r2_rf]
after = [rmse_rf_tuned, r2_rf_tuned]

x = range(len(labels))
width = 0.35

plt.figure(figsize=(8,5))
plt.bar([i - width/2 for i in x], before, width, label='Before Tuning', color='orange')
plt.bar([i + width/2 for i in x], after, width, label='After Tuning', color='green')

plt.xticks(x, labels)
plt.ylabel('Score')
plt.title('📊 Random Forest Performance Before vs After Tuning')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


##### Which hyperparameter optimization technique have you used and why?

RandomizedSearchCV (efficient for large parameter space)

- Efficient for large hyperparameter spaces

Searches randomly across combinations

Faster than GridSearchCV for complex models

Combines well with Cross-Validation to avoid overfitting

Works especially well for tree-based models like Random Forest where there are many tuning options

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Metric--	Before Tuning--	After Tuning--	Change


RMSE--	13.95--	13.54 ✅--	🟢 Improved

R² Score--	0.9767--	0.9780 ✅--	🟢 Improved

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

📊 1. Mean Absolute Error (MAE)

What It Means:
MAE represents the average absolute difference between predicted and actual stock prices.

Business Interpretation:
A lower MAE means the model consistently predicts prices close to reality — crucial for reducing financial risk in decisions like buying/selling stocks.

Business Impact:

Enables more accurate investment decisions

Reduces potential loss due to price misprediction

Supports automated trading systems where tight margins matter

📉 2. Mean Squared Error (MSE)

What It Means:
MSE calculates the average squared error, penalizing larger mistakes more heavily.

Business Interpretation:
Useful when large errors are particularly costly (e.g., stock trading or portfolio management).

Business Impact:

Helps identify if the model has occasional big misses

Drives improvements in prediction systems that require high precision

✅ 3. Root Mean Squared Error (RMSE)

What It Means:
RMSE is the square root of MSE, giving error in the same units (e.g., rupees ₹).

Business Interpretation:

RMSE shows the typical prediction error — easy for managers to interpret in real-world currency.


Business Impact:

RMSE of ₹6.89 means most predictions are within ₹6.89 of actual prices

Enables risk quantification of automated models

Helps decide whether the model is accurate enough for deployment

🎯 4. R² Score (Coefficient of Determination)

What It Means:
Measures how well the model explains the variance in the target variable.

Business Interpretation:
R² = 0.9945 → model explains 99.45% of the fluctuations in stock prices.

Business Impact:

High R² means strong predictive power — useful for trend analysis

Helps investors gain confidence in forecasts

Supports long-term forecasting and portfolio modeling



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For this project, we focused on RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R² score. RMSE was prioritized because it penalizes larger errors more, which is crucial in stock price prediction where large deviations can lead to significant financial loss. MAE helped us measure the average prediction error, ensuring consistency and reliability. The R² score indicated how well the model explained variations in the target variable, giving us confidence in its forecasting ability. These metrics were chosen because they directly support accurate and risk-aware business decisions.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Among all the models tested, Random Forest Regressor was selected as the final prediction model. It outperformed others in both accuracy and generalization, especially after hyperparameter tuning. The model delivered the lowest RMSE and highest R² score, indicating strong performance. Additionally, it is robust to outliers and noise, which is essential for stock market data. Its ability to handle non-linear relationships and built-in feature importance made it a reliable and interpretable choice for production-ready forecasting.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The chosen model, Random Forest Regressor, is an ensemble learning method that builds multiple decision trees and averages their outputs. This makes it powerful and resistant to overfitting. To understand what drives its predictions, we used the model’s built-in feature importance. Features like the high and open prices had the most influence on the closing price, while derived features such as rolling averages and encoded year values contributed moderately. This insight helps justify the model’s predictions and increases transparency, which is essential for gaining trust in business applications.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save the best model
joblib.dump(best_rf_model, 'random_forest_model.pkl')

print("✅ Model saved as 'random_forest_model.pkl'")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import joblib

# Load the saved Random Forest model
loaded_model = joblib.load('random_forest_model.pkl')

print("✅ Model loaded successfully.")


In [None]:
# Take a small sample for sanity check
sample_unseen = X_test[:5]

# Make predictions using the loaded model
predicted_values = loaded_model.predict(sample_unseen)

# Display predictions
print("🎯 Predicted Values for Unseen Data:")
print(predicted_values)

# Optionally, compare with actual values
print("\n✅ Actual Values:")
print(y_test[:5].values)


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we successfully developed a robust machine learning pipeline to predict the monthly closing price of Yes Bank's stock using historical stock price data. Beginning with a thorough exploratory data analysis, we performed structured univariate, bivariate, and multivariate visualizations to extract meaningful insights about price trends and influencing features.

We applied various preprocessing steps such as handling missing values, feature engineering, scaling, and outlier treatment to prepare the dataset for modeling. Multiple regression models were built and evaluated using key metrics like RMSE, MAE, and R² score to ensure strong predictive performance.

Among the models tested, the Random Forest Regressor emerged as the best-performing model with excellent accuracy and generalization capability. After applying hyperparameter tuning using RandomizedSearchCV, we observed further improvements in the model’s performance.

To ensure deployment readiness, we saved the trained model as a .pkl file and successfully reloaded it to make predictions on unseen data, confirming its reliability in real-world scenarios.

Overall, this project demonstrates how machine learning can be effectively applied to financial time series forecasting, providing valuable tools for investment analysis, trading strategy optimization, and business decision-making.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***