# **Project Name**    - Yes Bank Stock Closing Price Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -** Asma Khan
##### **Team Member 2 -** Rohit Sagar Chavan
##### **Team Member 3 -** Mukesh Regar
##### **Team Member 4 -** Shuvadip Sahu

# **Project Summary -**

This project aims to analyze and predict the monthly closing prices of Yes Bank, a significant player in the Indian financial sector. The dataset spans from the bank's inception to the present, encompassing monthly stock prices, including opening, closing, highest, and lowest values for each month. The primary goal is to predict the monthly closing price of Yes Bank's stock using various machine learning and time series models.

The project begins with comprehensive data exploration and preprocessing using Pandas for efficient data manipulation and aggregation. This involves cleaning the data, handling missing values, and ensuring it is in a suitable format for analysis. Visualizations will be created using Matplotlib and Seaborn to understand the trends, seasonal patterns, and potential anomalies in the stock prices, particularly around significant events like the fraud case revelation. These visualizations help in gaining insights into the stock’s behavior and its correlation with various factors over time.

For the computational aspect, NumPy will be utilized for performing efficient numerical operations on the dataset. This includes operations like normalization, transformation, and other preprocessing steps that are crucial for preparing the data for modeling. The core of the project involves leveraging Scikit Learn for model training, optimization, and evaluation. Various time series models, such as ARIMA, SARIMA, and Prophet, along with machine learning models like Linear Regression, Random Forest, and Gradient Boosting, will be employed to predict the stock's closing price. Model performance will be evaluated using metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²) to determine the best-performing model.

The project architecture involves several key stages. Initially, data collection and preprocessing form the foundation, ensuring the dataset is clean and structured. Next, exploratory data analysis (EDA) provides valuable insights into the stock price trends and their driving factors. Following this, feature engineering and selection help in identifying the most relevant features that influence the stock’s closing price. The modeling phase includes training various models and fine-tuning their parameters to enhance predictive accuracy. Finally, model evaluation and validation are conducted to assess the models' performance and select the best one for predicting the future closing prices.

Overall, this project integrates data analysis, visualization, and predictive modeling to understand and forecast the monthly closing prices of Yes Bank's stock.

# **GitHub Link -**

https://github.com/asmakhan0212

# **Problem Statement**


**Predicting Monthly Closing Stock Prices for Yes Bank
.**

The objective of this project is to predict the monthly closing stock price of Yes Bank. Given the historical stock price data since its inception, including the opening, closing, highest, and lowest prices for each month, the goal is to develop a predictive model that can accurately forecast the closing price.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error,mean_absolute_error
import itertools

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
pathway = '/content/drive/MyDrive/Colab Notebooks/data_YesBank_StockPrices.csv'
df = pd.read_csv(pathway)

In [None]:
df

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull())

### What did you know about your dataset?

The dataset comprises monthly stock prices of Yes Bank, including the closing, starting, highest, and lowest stock prices from its inception to the present. The primary objective is to predict the stock's closing price for each month.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* **Date** - month of the candles
* **Open** - opening price of the month
* **Hight** - highest price of the month
* **Low** - lowest price of the month
* **close**- closing price of the month

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns:
  print("No. of unique values in ",i,"is",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#Data Inspection
print(df.head())
print(df.info())
print(df.describe())


In [None]:
#Data Cleaning
print(df.isnull().sum())

In [None]:
# visualize Outliers
plt.figure(figsize=(12, 6))
df[['Open', 'High', 'Low', 'Close']].boxplot()
plt.xlabel('Price Type')
plt.ylabel('Price')
plt.title('Box Plot of Stock Prices')
plt.grid(True)
plt.show()

In [None]:
#Set the Date column as index

df.set_index('Date',inplace = True)

### What all manipulations have you done and insights you found?

 'Date' column has been moved to the index position, and it is no longer a column in the DataFrame.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Box Plot
sns.boxplot(df['Close'])
plt.title('Boxplot of Close')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen to visualize the distribution of the closing price and identify potential outliers.

##### 2. What is/are the insight(s) found from the chart?

The box plot shows the distribution of Yes Bank's closing prices. We can see the median closing price, the quartiles, and potential outliers, which are the data points beyond the whiskers of the box plot.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying outliers can help in refining the predictive model, leading to more accurate predictions of Yes Bank's closing prices. This can be valuable for making informed investment decisions. However, further analysis is needed to determine if these outliers represent actual market trends or data anomalies.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.boxplot(df['Open'])
plt.title('Boxplot of Open')
plt.show()


##### 1. Why did you pick the specific chart?

Similar to the closing price, a box plot was used to visualize the distribution of the opening price and identify potential outliers.

##### 2. What is/are the insight(s) found from the chart?

The box plot for 'Open' shows the distribution of Yes Bank's opening prices. It helps visualize the median opening price, quartiles, and potential outliers (data points beyond the whiskers). This gives an idea of the typical range of opening prices and any extreme values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of opening prices, including outliers, can help traders and investors make more informed decisions. Identifying patterns and anomalies in opening prices can be useful for developing trading strategies or risk management.

#### Chart - 3

In [None]:
# Chart - 3 Line chart for "High"
sns.lineplot(df['High'])
plt.title('Lineplot of High')
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen to visualize the trend of Yes Bank's highest stock price over time.

##### 2. What is/are the insight(s) found from the chart?

The line chart shows the fluctuations in the highest price of Yes Bank's stock over time. It reveals periods of growth, decline, and potential volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the historical trend of the highest stock price can be useful for investors and traders to identify potential buying or selling opportunities.

#### Chart - 4

In [None]:
# Chart - 4 Area Chart
df.plot.area(y='High')
plt.title('Area Chart of High')
plt.show()

##### 1. Why did you pick the specific chart?

An area chart was chosen to visualize the cumulative trend of the highest stock price over time, emphasizing the magnitude of change.

##### 2. What is/are the insight(s) found from the chart?

The area chart shows the cumulative growth and decline of Yes Bank's highest stock price. It highlights periods of significant price changes and the overall trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, visualizing the cumulative trend can help investors understand the overall performance of the stock and identify periods of substantial growth or decline.

#### Chart - 5

In [None]:
# Chart - 5 histogram
sns.histplot(df['Low'])
plt.title('Histogram of Low')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen to visualize the distribution of Yes Bank's lowest stock prices, showing the frequency of different price ranges.

##### 2. What is/are the insight(s) found from the chart?

The histogram reveals the distribution of the lowest stock prices, indicating the most common price ranges and potential outliers. It helps understand the frequency of low prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of low prices can be useful for investors to assess the risk associated with the stock and make informed decisions.

#### Chart - 6

In [None]:
# Chart - 6 Horizontal bar chart
df.plot.barh(y='High')
plt.title('Horizontal Bar Chart of High')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to visualize the closing prices over time in a horizontal format, allowing for easier comparison of prices across different dates.

##### 2. What is/are the insight(s) found from the chart?

The chart displays the closing prices for each date horizontally, making it easy to compare the closing prices across different periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the horizontal bar chart can help identify periods of high and low closing prices, which can be useful for traders and investors to understand historical price patterns.

#### Chart - 7

In [None]:
# Chart - 7 Scatter plot
sns.scatterplot(x='High', y='Low', data=df)
plt.title('Scatter Plot of High and Low')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen to visualize the relationship between Yes Bank's highest and lowest stock prices, exploring potential correlations or patterns.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows the relationship between the highest and lowest prices for each period. It helps identify any correlation or trends between these two variables.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the relationship between high and low prices can be useful for traders to assess the volatility and potential trading range of the stock.

#### Chart - 8

In [None]:
# Chart - 8 Linr Chart For "Low"
sns.lineplot(df['Low'])
plt.title('Lineplot of Low')
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen to visualize the trend of Yes Bank's lowest stock price over time.

##### 2. What is/are the insight(s) found from the chart?

The line chart shows the fluctuations in the lowest price of Yes Bank's stock over time. It reveals periods of growth, decline, and potential volatility in the lower price range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the historical trend of the lowest stock price can be useful for investors and traders to identify potential buying opportunities or assess downside risk.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(6, 4))
sns.lineplot(data=df, x='Close', y='Open')
plt.title("Open vs close")
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are excellent for showing trends over time, such as changes in sales, temperature, stock prices, etc. They make it easy to identify upward, downward, or cyclical trends.

##### 2. What is/are the insight(s) found from the chart?

In this chart, we compare the stock's opening and closing prices.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(6, 4))
sns.lineplot(data=df, x='Close', y='High')
plt.title("High vs close")
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are excellent for showing trends over time, such as changes in sales, temperature, stock prices, etc. They make it easy to identify upward, downward, or cyclical trends.

##### 2. What is/are the insight(s) found from the chart?

In this chart, we compare the stock's high and closing prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(6,4))
sns.lineplot(data=df, x = 'Close',y='High')
plt.title('Low vs Close')
plt.show()


##### 1. Why did you pick the specific chart?

Line charts are excellent for showing trends over time, such as changes in sales, temperature, stock prices, etc. They make it easy to identify upward, downward, or cyclical trends.

##### 2. What is/are the insight(s) found from the chart?

In this chart, we compare the stock's Low and closing prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.show()


##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

High Correlation:
Open, High, Low, and Close are strongly correlated.
Low/No Correlation:
Date and Daily Return have weak or negligible correlations with other features.
Negative Correlation:
 Daily Return has weak negative correlations with Open, High, and Low.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, diag_kind="kde")
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is a great way to visualize the relationship between all pairs of variables in a dataset. It can help identify patterns, correlations, and potential outliers.

##### 2. What is/are the insight(s) found from the chart?

Execute the code yourself to see the output. From the visualization, you can observe the relationships between variables like 'Open', 'High', 'Low', and 'Close'. Look for patterns like positive or negative correlations, and identify any potential outliers.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats
sample_mean = df['Close'].mean()
hypothesized_mean = 150

t_statistic, p_value = stats.ttest_1samp(a=df['Close'], popmean=hypothesized_mean)

print("T-statistic:", t_statistic)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here: To obtain the p-value, I performed a one-sample t-test.

This test is appropriate because we are comparing the mean of a single sample (the closing prices of Yes Bank stock) to a known value (the hypothesized mean of 150).



##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

variance_open = df['Open'].var()
variance_close = df['Close'].var()

# Perform the F-test
f_statistic = variance_open / variance_close
df1 = len(df['Open']) - 1
df2 = len(df['Close']) - 1
p_value = 1 - stats.f.cdf(f_statistic, df1, df2)

print("F-statistic:", f_statistic)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here: To obtain the p-value for the second hypothesis (comparing variances), I performed an F-test.

This test is suitable for comparing the variances of two independent samples, which in this case are the opening prices and the closing prices of Yes Bank stock.



##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

correlation_coefficient, p_value = stats.pearsonr(df['Open'], df['Close'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here: For the third hypothesis (testing the correlation), I used the Pearson correlation coefficient test to obtain the p-value.

This test is commonly used to assess the linear relationship between two continuous variables, which in this scenario are the opening and closing prices of Yes Bank stock.



##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values ​​in the data set

### 2. Handling Outliers

In [None]:
# visualize Outliers
plt.figure(figsize=(12, 6))
df[['Open', 'High', 'Low', 'Close']].boxplot()
plt.xlabel('Price Type')
plt.ylabel('Price')
plt.title('Box Plot of Stock Prices')
plt.grid(True)
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

This outlier is significant for this data set

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Not applicable in this dataset as there are no categorical columns


#### What all categorical encoding techniques have you used & why did you use those techniques?

Not applicable in this dataset as there are no categorical columns


#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
#  lineplot for Constant Feature vs Target
plt.figure(figsize=(6, 4))
sns.lineplot(data=df, x='Open', y='Close')
plt.title("open vs close")
plt.show()

In [None]:
# lineplot for Constant Feature vs Target
plt.figure(figsize=(6, 4))
sns.lineplot(data=df, x='Low', y='Close')
plt.title("Low vs close")
plt.show()

In [None]:
#  lineplot for Constant Feature vs Target
plt.figure(figsize=(6, 4))
sns.lineplot(data=df, x='High', y='Close')
plt.title("High vs close")
plt.show()

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(Dataframe):
    # Handle missing values (NaN) by dropping rows containing them
    Dataframe = Dataframe.dropna()

    # Handle infinite values (inf) by replacing them with a large finite number
    Dataframe = Dataframe.replace([np.inf, -np.inf], 1e9)

    vif_data = pd.DataFrame()
    vif_data["feature"] = Dataframe.columns
    vif_data["VIF"] = [variance_inflation_factor(Dataframe.values, i) for i in range(len(Dataframe.columns))]
    return vif_data

vif_sample_df = calculate_vif(df.copy()) # Create a copy of df to avoid modifying the original

vif_sample_df

##### What all feature selection methods have you used  and why?

we have used  variance inflation factor

##### Which all features you found important and why?

Answer Here.

### 6. Data Scaling

In [None]:
# Scaling your data
# Min-Max Scaling
min_max_scaler = MinMaxScaler(feature_range=(0,1))
data_min_max_scaled = min_max_scaler.fit_transform(df)
print(data_min_max_scaled)

##### Which method have you used to scale you data and why?

I have used min-max scaling for the data Min-max scaling is giving values ​​between specific limits

### 8. Data Splitting

In [None]:
# select the feature and target varible
x = df[['Open', 'High','Low']]  # Features
y = df['Close']  # Target variable

In [None]:
#split the data into train and test
#x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

We took the last 30 data points for the test and the remainder for training.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Applying ARIMA model (choosing a simple (1,1,1) order for demonstration)
train = df['Close'][:-10]
test = df['Close'][-10:]
arima_model = ARIMA(train)
arima_result = arima_model.fit()
arima_forecast = arima_result.forecast(steps=10)


In [None]:
# Predict the same number of steps as the length of the test data
arima_forecast = arima_result.forecast(steps=len(test))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Evaluate ARIMA model
arima_mae = mean_absolute_error(test, arima_forecast)
arima_rmse = np.sqrt(mean_squared_error(test, arima_forecast))
print(arima_mae)
print(arima_rmse)

In [None]:
#vizualisation of the model 1
# Plot the results
plt.figure(figsize=(27, 8))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, arima_forecast, label='ARIMA Forecast')
plt.legend()
plt.title('ARIMA Model Forecast')
plt.xticks(rotation=90)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the p, d, q parameters to take any value between 0 and 3
p = d = q = range(0, 3)

# Generate all different combinations of p, d and q triplets
pdq = list(itertools.product(p, d, q))

# Find the best ARIMA model parameters
best_aic = np.inf
best_order = None
best_mdl = None

for param in pdq:
    try:
        tmp_mdl = ARIMA(train, order=param).fit()
        tmp_aic = tmp_mdl.aic
        if tmp_aic < best_aic:
            best_aic = tmp_aic
            best_order = param
            best_mdl = tmp_mdl
    except:
        continue

print(f'Best ARIMA order: {best_order}')
print(f'Best AIC: {best_aic}')


In [None]:
print(f'Best ARIMA order: {best_order}')
print(f'Best AIC: {best_aic}')

In [None]:
# Applying ARIMA model (choosing a simple (1,1,1) order for demonstration)
arima_model = ARIMA(train,order=(2, 2, 1))
arima_result = arima_model.fit()
arima_forecast = arima_result.forecast(steps=10)

In [None]:
# Predict the same number of steps as the length of the test data
arima_forecast = arima_result.forecast(steps=len(test))

In [None]:
arima_mae = mean_absolute_error(test, arima_forecast)
arima_rmse = np.sqrt(mean_squared_error(test, arima_forecast))
print(f'ARIMA MAE: {arima_mae}')
print(f'ARIMA RMSE: {arima_rmse}')

In [None]:
# Plot the results
plt.figure(figsize=(27, 8))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, arima_forecast, label='ARIMA Forecast')
plt.legend()
plt.title('ARIMA Model Forecast')
plt.xticks(rotation=90)
plt.show()


##### Which hyperparameter optimization technique have you used and why?

I have hyperparameters p d q for this model, give me the best ordering of p d q for ARIMA model


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Fit SARIMA model
sarima_model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
sarima_model_fit = sarima_model.fit(disp=False)

In [None]:
# Forecast
sarima_forecast = sarima_model_fit.get_forecast(steps=len(test))
sarima_forecast_values = sarima_forecast.predicted_mean

In [None]:
print(sarima_forecast_values)

In [None]:
sarima_mae = mean_absolute_error(test, sarima_forecast_values)
sarima_rmse = np.sqrt(mean_squared_error(test, sarima_forecast_values))

print(f'SARIMA MAE: {sarima_mae}')
print(f'SARIMA RMSE: {sarima_rmse}')

In [None]:
# Plot the results
plt.figure(figsize=(27, 8))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, sarima_forecast_values, label='SARIMA Forecast')
plt.legend()
plt.title('SARIMA Model Forecast')
plt.xticks(rotation=90)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the p, d, q, P, D, Q parameters to take any value between 0 and 3
p = d = q = range(0, 2)
P = D = Q = range(0, 2)
s = [12]  # Seasonality is fixed to 12 (monthly data)

# Generate all different combinations of p, d, q, P, D, Q, s triplets
pdq = list(itertools.product(p, d, q))
seasonal_pdq = list(itertools.product(P, D, Q, s))

best_aic = np.inf
best_order = None
best_seasonal_order = None
best_mdl = None

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            tmp_mdl = SARIMAX(train, order=param, seasonal_order=param_seasonal).fit(disp=False)
            tmp_aic = tmp_mdl.aic
            if tmp_aic < best_aic:
                best_aic = tmp_aic
                best_order = param
                best_seasonal_order = param_seasonal
                best_mdl = tmp_mdl
        except:
            continue


In [None]:
print(f'Best SARIMA order: {best_order}')
print(f'Best Seasonal order: {best_seasonal_order}')
print(f'Best AIC: {best_aic}')

In [None]:
# Fit the best SARIMA model
sarima_model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 0, 12))
sarima_model_fit = sarima_model.fit(disp=False)

# Forecast
sarima_forecast = sarima_model_fit.get_forecast(steps=len(test))
sarima_forecast_values = sarima_forecast.predicted_mean

In [None]:
# Evaluate SARIMA model
sarima_rmse = np.sqrt(mean_squared_error(test, sarima_forecast_values))
print(f'SARIMA RMSE: {sarima_rmse}')

In [None]:
# Plot the results for SARIMA
plt.figure(figsize=(27, 8))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, sarima_forecast_values, label='SARIMA Forecast')
plt.legend()
plt.title('SARIMA Model Forecast')
plt.xticks(rotation=90)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I have hyperparameters p d q and seasonal for this model, give me the best ordering of p d q and seasonal for SARIMA model

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

There are no missing values ​​in this data set and whatever outliers there are, we consider them as predictions.We have taken evaluation metrics like Root mean square error(RMSE), Mean square error and to predict the data we have used ARIMA and SARIMA model, we have observed that SARIMA worked better than ARIMA so we will consider SARIMA model for our data set

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***