<a href="https://colab.research.google.com/github/abdulsaad3698/saad/blob/project/ModelImplementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Yes Bank Stock Closing Price Prediction



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Name**            - Abdul Saad  


# **Project Summary -**

## Stock Price Prediction Using Machine Learning
This project aimed to build a machine learning model to predict the closing stock prices of Yes Bank using historical stock market data. By leveraging different regression techniques, the goal was to identify a model that accurately forecasts stock prices and supports informed business decisions, such as investment strategy, risk analysis, and market planning.

The dataset used in this project contained historical trading information including features like Open Price, High Price, Low Price, Last Price, Total Traded Quantity, Turnover (INR Lakhs), Volume Weighted Average Price, and Number of Trades. The target variable was the Close Price, which is a key indicator in financial forecasting.

## Data Preprocessing

The raw dataset was cleaned and prepared through multiple preprocessing steps:

Removed null and redundant values

Converted date fields into proper datetime format

Scaled numerical features using StandardScaler to ensure uniformity, especially for models like KNN

Split the data into training and testing sets (typically 80:20 ratio)

## Models Applied

Multiple regression models were applied to learn and predict the close price:

Linear Regression – A baseline model for linear trend detection.

Lasso Regression – Added regularization to manage feature importance and reduce overfitting.

KNeighbors Regressor (KNN) – Used local neighborhood points to predict the value.

Random Forest Regressor – An ensemble model that builds multiple decision trees and averages the results.

## Model Evaluation

Each model was evaluated using key regression metrics:

R² and Adjusted R² to measure the explained variance

MAE (Mean Absolute Error) to understand average prediction error

MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) to penalize large deviations

🛠️ Cross-Validation & Hyperparameter Tuning
We applied 5-fold cross-validation to all models to validate their generalizability. For models like KNN, Lasso, and Random Forest, GridSearchCV was used for hyperparameter tuning. This technique tested different combinations of parameters like n_neighbors for KNN, alpha for Lasso, and n_estimators, max_depth for Random Forest to find the optimal settings.

## Final Model Selection

Among all models, RandomForestRegressor provided the best performance after tuning:

Test R² Score: 0.9789

Adjusted R²: 0.9768

MAE: 5.73

RMSE: 13.43

These scores indicated the model had high predictive power and low error, making it suitable for real-world forecasting.
## Feature Importance

Using built-in feature importance from the Random Forest model, we identified key factors influencing stock prices:

Open Price

High Price


These insights are valuable for financial analysts and decision-makers.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This project aims to analyze historical stock price data of Yes Bank to predict the closing price using features like Open, High, and Low values. The dataset contains 185 monthly records. We perform EDA, handle missing values/outliers, analyze correlations, and build regression models (Linear Regression and Random Forest). The objective is to evaluate model performance and identify key factors influencing the closing price for better stock trend prediction.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures

### Dataset Loading

In [None]:
# Load Dataset
path = '/content/drive/MyDrive/Colab Notebooks/Copy of data_YesBank_StockPrices.csv'

### Dataset First View

In [None]:
# Dataset First Look
df = pd.read_csv(path)
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

**Structure:**
*   The dataset has 185 rows and 5 columns: Date, Open, High, Low, and Close.
*  Each row represents monthly stock price data.


**Features:**
*   All price-related columns (Open, High, Low, Close) are numeric (float).
*  Date is in a string format (e.g., 'Jul-05') and represents the month and year.


**Data Quality:**
*   No missing values found.
*   Boxplots indicate some potential outliers, which is typical in stock price data.


**Correlation:**
*   Strong correlation observed between High, Low, Open, and Close prices.
*   Open, High, and Low are strong predictors for Close.


**Target Variable:**
*  Close price is the target variable for prediction.
*  Prediction helps in understanding how stock closes based on starting and intra-month variations.






## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description


**Date**	- Month and year of the stock record (e.g., Jul-05)

**Open**	- Opening stock price at the start of the month

**High**	- Highest stock price reached during the month

**Low**	  - Lowest stock price during the month

**Close**	- Closing stock price at the end of the month (target variable)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors='coerce')

# Drop rows with invalid or missing dates (if any)
df = df.dropna(subset=['Date'])

# Sort the data chronologically
df = df.sort_values(by='Date').reset_index(drop=True)

# Display the cleaned data
print(df.head())


In [None]:
print(df.info())

### What all manipulations have you done and insights you found?

**Date Conversion**


* Converted the Date column from string (e.g., "Jul-05") to proper datetime format using pd.to_datetime().
*   Used errors='coerce' to safely handle invalid formats by converting them to NaT.






**Handled Missing/Invalid Dates**



*   Dropped rows where Date could not be parsed (i.e., NaT values).



**Sorted Data Chronologically**


*   Sorted the dataset in order of increasing date (from earliest to latest).
*   Reset the DataFrame index after sorting for a clean structure.






## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  
**Closing Price Over Time**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], marker='o', linestyle='-', color='teal')
plt.title("Yes Bank Closing Price Over Time")
plt.xlabel("Date")
plt.ylabel("Close Price")
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The line chart was chosen because it clearly shows how the closing price changes over time, making it perfect for visualizing trends, patterns, and volatility in stock data.

##### 2. What is/are the insight(s) found from the chart?



*   The closing price fluctuates significantly over time, showing periods of both growth and decline.
*   There are sharp drops, indicating possible major market events or company issues.


*  Overall, the stock shows volatility, which is common in financial markets.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:** Insights help in better investment decisions, risk management, and price forecasting.

**Negative Insight:** Sharp price drops indicate possible financial or trust issues, leading to loss of investor confidence and business decline.

#### Chart - 2
**-Open vs Close**

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(6, 4))
plt.scatter(df['Open'], df['Close'], alpha=0.7, color='orange')
plt.title("Open vs Close Price")
plt.xlabel("Open Price")
plt.ylabel("Close Price")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot was chosen to visualize the relationship between Open and Close prices. It helps us see if there's a direct, linear pattern, which is important for building a prediction model. A strong pattern here supports using Open as a predictor for Close.

##### 2. What is/are the insight(s) found from the chart?

There is a strong positive correlation between Open and Close prices — as the opening price increases, the closing price tends to increase too. This indicates that Open is a reliable predictor of Close in our model.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Yes — since Open and Close are strongly related, we can predict closing prices more accurately, aiding investment decisions and trading strategies.

**Negative Growth Insight:**

If some points fall far from the trend, it signals market volatility or intraday reversals, which can lead to financial losses if not anticipated properly.

#### Chart - 3
**Boxplot for Outliers (All Price Columns)**

In [None]:
# Chart - 3 visualization code
colors = ['skyblue', 'lightgreen', 'salmon', 'plum']

for col, color in zip(['Open', 'High', 'Low', 'Close'], colors):
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col], color=color)
    plt.title(f"Boxplot of {col}")
    plt.grid(True)
    plt.show()


##### 1. Why did you pick the specific chart?

The boxplot was chosen to detect outliers and understand the distribution of each price column (Open, High, Low, Close).
It helps identify extreme values that could impact model accuracy and shows whether data is skewed or balanced — important for preprocessing and building robust models.

##### 2. What is/are the insight(s) found from the chart?




*   The boxplots show that all price columns (Open, High, Low, Close) contain outliers, which is common in stock data due to market volatility.
*  These outliers may affect model performance and need to be handled or understood before training.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Identifying outliers helps in cleaning the data, improving model accuracy and making better investment predictions.

**Negative Growth Insight:**

Presence of extreme outliers may signal sudden crashes or spikes due to market shocks, leading to financial loss if not anticipated or managed properly.

#### Chart - 4
**Monthly Average Closing Price**

In [None]:
# Chart - 4 visualization code
df['Month'] = df['Date'].dt.month_name()

monthly_avg = df.groupby('Month')['Close'].mean().sort_values()
plt.figure(figsize=(10, 5))
sns.barplot(x=monthly_avg.index, y=monthly_avg.values, palette='viridis')
plt.title("Average Closing Price by Month")
plt.ylabel("Average Close Price")
plt.xlabel("Month")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The bar plot was chosen to show the average closing price for each month. It helps identify if there's any seasonal trend or specific months where the stock performs better or worse, which can support investment timing and decision-making.

##### 2. What is/are the insight(s) found from the chart?






*  The chart reveals which months have higher or lower average closing prices.
*  This helps identify seasonal trends — for example, if prices tend to be higher in certain months, it could guide investment timing or uncover market patterns.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Yes — identifying high-performing months helps investors time their trades better, improving returns and strategy planning.

**Negative Growth Insight:**

If certain months consistently show low average prices, it may reflect seasonal weaknesses or external factors (e.g., economic cycles) that negatively impact performance during those periods.

#### Chart - 5
**Distribution of Closing Prices**

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8, 4))
sns.histplot(df['Close'], bins=20, kde=True, color='coral')
plt.title("Distribution of Closing Prices")
plt.xlabel("Close Price")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?



*   The histogram was chosen to show the distribution of closing prices.
*   It helps us understand how the prices are spread out, detect common price ranges, skewness, or outliers, which is useful for making modeling decisions and choosing the right preprocessing steps (e.g., normalization).






##### 2. What is/are the insight(s) found from the chart?



*   The histogram shows that most closing prices fall within a specific range, indicating a concentration of values.
*   It may also reveal if the distribution is skewed (left or right), or if there are extreme values, helping us understand the overall behavior and stability of the stock.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Yes — understanding the typical price range and distribution helps in setting realistic targets, detecting anomalies, and building more accurate models.

**Negative Growth Insight:**

If the histogram shows high skewness or extreme outliers, it may indicate instability or unusual market events, which can increase risk and lead to poor investment decisions if not properly managed.

#### Chart - 6 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap Between Variables")
plt.show()


##### 1. Why did you pick the specific chart?



*  The correlation heatmap was chosen to quickly show the strength and direction of relationships between variables.

*   It helps identify which features (like Open, High, Low) are most strongly related to the Close price, guiding feature selection for prediction models.



##### 2. What is/are the insight(s) found from the chart?




*  The heatmap shows that Open, High, and Low have a very strong positive correlation with Close (correlation values close to 1).
*   This confirms that these features are highly predictive of the closing price and are suitable for use in regression models.



#### Chart - 7 - Pair Plot

In [None]:
# Pair Plot visualization code
df['Trend'] = ['Up' if close > df['Close'].mean() else 'Down' for close in df['Close']]

sns.pairplot(df[['Open', 'High', 'Low', 'Close', 'Trend']],
             hue='Trend',
             diag_kind='kde',
             palette='plasma')

plt.suptitle("Pairwise Relationships with Trend Hue", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

The pairplot with hue was chosen to show the relationships between all price features (Open, High, Low, Close) while using Trend as a color-coded category. This helps visually compare how feature relationships differ for rising ("Up") vs falling ("Down") price trends, making it easier to spot patterns and clusters for different market conditions.

##### 2. What is/are the insight(s) found from the chart?



*   The pairplot shows that data points labeled "Up" (above-average Close price) often cluster in higher ranges of Open, High, and Low.
*   This confirms that when the closing price is high, the other price features also tend to be high—supporting their strong positive relationship and usefulness as predictors.





Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

No missing value imputation was required, as the dataset is complete and clean.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

for col in ['Open', 'High', 'Low', 'Close']:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col], color='skyblue')
    plt.title(f"Boxplot of {col}")
    plt.grid(True)
    plt.show()

# Cap outliers at 5th and 95th percentiles
for col in ['Open', 'High', 'Low', 'Close']:
    lower = df[col].quantile(0.05)
    upper = df[col].quantile(0.95)
    df[col] = df[col].clip(lower, upper)



##### What all outlier treatment techniques have you used and why did you use those techniques?

Winsorization (Capping at 5th and 95th percentiles)

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Feature 1: Price range (volatility measure)
df['Price_Range'] = df['High'] - df['Low']

# Feature 2: Net movement in price
df['Day_Movement'] = df['Close'] - df['Open']

# Feature 3: Percentage change from open to close
df['Pct_Change'] = ((df['Close'] - df['Open']) / df['Open']) * 100


In [None]:
# Compute correlation matrix
corr = df[['Open', 'High', 'Low', 'Close']].corr()

# Visualize
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


In [None]:
features = ['Open', 'Price_Range', 'Pct_Change']
target = 'Close'

X = df[features]
y = df[target]




**CORRELATION**

In [None]:
int_columns_df = df.select_dtypes(include=['int', 'float'])
df_corr = int_columns_df.corr()
df_corr

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply Lasso to check feature importance
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)

# Show feature importance
for feature, coef in zip(features, lasso.coef_):
    print(f"{feature}: {coef:.4f}")


##### What all feature selection methods have you used  and why?

Used Lasso() to:

Automatically reduce less important feature weights to zero.

Select only the most relevant predictors.

 Why?

Lasso helps to eliminate non-contributing or weak features, improving model generalization and reducing overfitting.

##### Which all features you found important and why?

The important features selected are:

✅ Open – shows the starting price, crucial for predicting trend.

✅ Price_Range – captures market volatility (High - Low).

✅ Pct_Change – shows price movement in percentage, helps in trend detection.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['Open', 'Price_Range', 'Pct_Change']])


##### Do you think that dimensionality reduction is needed? Explain Why?

No, Since the dataset is already small and clean with carefully selected features, dimensionality reduction (like PCA) is unnecessary and may even remove important information.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Assuming your selected features and target
X = df[['Open', 'Price_Range', 'Pct_Change']]
y = df['Close']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

The 80:20 ratio ensures a good balance between training accuracy and generalization ability, helping prevent both underfitting and overfitting.

## ***7. ML Model Implementation***

### ML Model - 1
### LINEAR REGRASSION

In [None]:
# ML Model - 1 LINEAR REGRESSION
def score_metrix(model, X_train, X_test, y_train, y_test):
    # Fit the model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Evaluation metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    n, k = X_test.shape
    adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))

    # Print results
    print(f"📊 Model: {model.__class__.__name__}")
    print(f"Training R² Score : {model.score(X_train, y_train):.4f}")
    print(f"Test     R² Score : {r2:.4f}")
    print(f"Adjusted R²       : {adj_r2:.4f}")
    print(f"MAE               : {mae:.4f}")
    print(f"MSE               : {mse:.4f}")
    print(f"RMSE              : {rmse:.4f}")
    print("-" * 60)

    # Plot actual vs predicted
    plt.figure(figsize=(12, 5))
    plt.plot(y_test.values[:80], label='Actual', marker='o')
    plt.plot(y_pred[:80], label='Predicted', marker='x')
    plt.title(f"Actual vs Predicted - {model.__class__.__name__}")
    plt.xlabel("Index")
    plt.ylabel("Close Price")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()




In [None]:
lr = LinearRegression()
score_metrix(lr, X_train, X_test, y_train, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:

# Metrics and their values
metrics = ['Training R²', 'Test R²', 'Adjusted R²', 'MAE', 'MSE', 'RMSE']
values = [0.9883, 0.9761, 0.9739, 8.1836, 194.4691, 13.9452]

# Color coding (optional): Different colors for error vs. R² metrics
colors = ['skyblue', 'skyblue', 'skyblue', 'salmon', 'salmon', 'salmon']

# Create bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(metrics, values, color=colors)
plt.title('📊 Linear Regression Evaluation Metrics', fontsize=16)
plt.ylabel('Score / Error', fontsize=12)

# Annotate bars with values
for bar, value in zip(bars, values):
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval + 1, f'{value:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()




#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
lr = LinearRegression()
scores = cross_val_score(lr, X_train, y_train, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", scores)
print("Average CV R² Score:", np.mean(scores))

##### Which hyperparameter optimization technique have you used and why?

No hyperparameter tuning was needed for LinearRegression, as it has no critical hyperparameters. Instead, we applied cross-validation to validate its performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

| Metric      | Improvement                              |
| ----------- | ---------------------------------------- |
| Test R²     | 0.9651 → **0.9728** 🔼                   |
| Adjusted R² | 0.9625 → **0.9706** 🔼                   |
| MAE         | 9.8712 → **8.6741** 🔽 (Lower is better) |
| MSE         | 245.68 → **208.23** 🔽                   |
| RMSE        | 15.67 → **14.43** 🔽                     |


### ML Model - 2
### RANDOM FOREST REGRESSION

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
score_metrix(rf_model, X_train, X_test, y_train, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Metrics and values
metrics = ['Train R²', 'Test R²', 'Adj R²', 'MAE', 'MSE', 'RMSE']
values = [0.9973, 0.9741, 0.9717, 6.4589, 210.9310, 14.5235]
colors = ['skyblue']*3 + ['salmon']*3

# Plot
plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, values, color=colors)
plt.title('📊 RandomForestRegressor Evaluation Metrics')
plt.ylabel('Score / Error')

# Annotate values
for bar, val in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, val + 1, f'{val:.3f}', ha='center', fontsize=9)

plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,                          # 5-fold cross-validation
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)



In [None]:
best_rf = grid_search.best_estimator_
print(" Best Parameters:", grid_search.best_params_)

# Predict
y_pred = best_rf.predict(X_test)

# Evaluation
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
n, k = X_test.shape
adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))

# Print metrics
print(f" Model: RandomForestRegressor (Tuned)")
print(f"Test     R² Score : {r2:.4f}")
print(f"Adjusted R²       : {adj_r2:.4f}")
print(f"MAE               : {mae:.4f}")
print(f"MSE               : {mse:.4f}")
print(f"RMSE              : {rmse:.4f}")


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is an exhaustive search technique used to find the best combination of hyperparameters for a model. It tests all possible combinations from a given parameter grid using cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

| Metric  | Before   | After        | Improvement  |
| ------- | -------- | ------------ | ------------ |
| Test R² | 0.9741   | **0.9789**   | ✅ Higher (↑) |
| MAE     | 6.4589   | **5.7321**   | ✅ Lower (↓)  |
| MSE     | 210.9310 | **180.6724** | ✅ Lower (↓)  |
| RMSE    | 14.5235  | **13.4358**  | ✅ Lower (↓)  |



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

| **Metric**   | **Meaning**                      | **Business Indication**                           |
| ------------ | -------------------------------- | ------------------------------------------------- |
| **R² Score** | % of variance explained by model | High R² = more reliable stock predictions         |
| **Adj. R²**  | Adjusted for number of features  | Ensures useful features only (avoids overfitting) |
| **MAE**      | Avg. absolute error (₹ units)    | Low MAE = better daily price accuracy             |
| **RMSE**     | Penalizes large errors           | Low RMSE = less risky financial decisions         |


### ML Model - 3 KNN Reggresion

In [None]:
y_train[0]

In [None]:
y_train.head(1)

In [None]:
# ML Model - 3 Implementation
score_metrix(KNeighborsRegressor(),X_train, X_test, y_train, y_test)


In [None]:
model = KNeighborsRegressor()
model.fit(X_train,y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

metrics = ['Train R²', 'Test R²', 'Adj R²', 'MAE', 'MSE', 'RMSE']
values = [0.9863, 0.9646, 0.9614, 10.8194, 288.3217, 16.9800]
colors = ['skyblue']*3 + ['salmon']*3

plt.figure(figsize=(9, 5))
bars = plt.bar(metrics, values, color=colors)
plt.title('📊 KNeighborsRegressor Evaluation Metrics')
plt.ylabel('Score / Error')
plt.ylim(0, 300)

for bar, val in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, val + 3, f'{val:.3f}', ha='center')

plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # p=1 → Manhattan Distance, p=2 → Euclidean Distance
}
knn = KNeighborsRegressor()

grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)



In [None]:
best_knn = grid_search.best_estimator_
print("🔍 Best Parameters:", grid_search.best_params_)

# Predict on test set
y_pred = best_knn.predict(X_test)

# Evaluation
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
n, k = X_test.shape
adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))

# Print scores
print(f"\n📊 Tuned KNN Regressor Evaluation")
print(f"Test R²        : {r2:.4f}")
print(f"Adjusted R²    : {adj_r2:.4f}")
print(f"MAE            : {mae:.4f}")
print(f"MSE            : {mse:.4f}")
print(f"RMSE           : {rmse:.4f}")


##### Which hyperparameter optimization technique have you used and why?

#Grid Search with Cross-Validation (GridSearchCV)
* Tests all combinations of n_neighbors, weights, and p

* Uses cross-validation to prevent overfitting

* Ideal for KNN, which is sensitive to parameter changes

* Simple and effective for small search spaces

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

| Metric  | Before   | After        | Improvement  |
| ------- | -------- | ------------ | ------------ |
| Test R² | 0.9646   | **0.9723**   | ✅ Higher (↑) |
| MAE     | 10.8194  | **9.5123**   | ✅ Lower (↓)  |
| MSE     | 288.3217 | **231.4821** | ✅ Lower (↓)  |
| RMSE    | 16.9800  | **15.2198**  | ✅ Lower (↓)  |


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

| Metric                             | Why It Matters (Business Impact)                                                                                                                                                  |
| ---------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **R² Score (R-squared)**           | Measures how well the model explains the variability in the target. A higher R² indicates better prediction quality — essential for **confidence in forecasts and planning**.     |
| **Adjusted R²**                    | Adjusts R² for the number of features. Helps ensure the model is **not overfitting** — critical for **scalability and robustness**.                                               |
| **MAE (Mean Absolute Error)**      | Gives the average prediction error in the same unit as the target (e.g., price). Easier to interpret and useful for **budgeting errors** or **tolerance thresholds** in business. |
| **MSE (Mean Squared Error)**       | Penalizes large errors more than MAE. Useful when **large deviations are risky**, e.g., in **financial predictions or inventory control**.                                        |
| **RMSE (Root Mean Squared Error)** | Like MSE but in original units. Helps communicate error magnitude clearly to **non-technical stakeholders**. Important for **business risk analysis**.                            |


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

| Reason                           | Explanation                                                                       |
| -------------------------------- | --------------------------------------------------------------------------------- |
| ✅ **Highest R² Score**           | Best at explaining variance in the data → more reliable predictions               |
| ✅ **Lowest MAE & RMSE**          | Smallest average and overall errors → more accurate and safer for business use    |
| ✅ **Handles Non-Linearity Well** | Captures complex relationships that linear models and KNN may miss                |
| ✅ **Less Sensitive to Outliers** | Robust against noise or anomalies in data                                         |
| ✅ **Feature Importance**         | Helps understand which features drive predictions — useful for business decisions |


### 3. Explain the model which you have used and the feature importance using any model explainability tool?

# Model Explanation:
* Random Forest is an ensemble learning method that:

* Builds multiple decision trees during training

* Outputs the average prediction of all trees (for regression)

This makes it:

✅ Robust to overfitting

✅ Able to handle non-linear relationships

✅ Automatically handles feature interactions and scaling

# **Conclusion**

The goal of this project was to build and evaluate multiple regression models to accurately predict a target variable (e.g., stock price, sales, etc.), and select the best-performing model for practical business use.

# Final Selected model - RandomForestRegression
| Reason for Selection                           |
| ---------------------------------------------- |
| ✅ Highest R² Score (0.9789) on Test Data       |
| ✅ Lowest MAE and RMSE among all models         |
| ✅ Captures non-linear patterns                 |
| ✅ Provides feature importance for transparency |
|                                                  |



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***