<a href="https://colab.research.google.com/github/ashishj-7523/retail-sales-prediction/blob/main/Retail_Sales_Prediction_Capstone_Project_by_AJ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Retail Sales Prediction**



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Ashish Kumar Jha**

# **Project Summary -**


**Project Summary: Predicting Retail Sales for Rossmann Drug Stores**

Retail sales prediction is a critical task for businesses, and Rossmann, with its vast network of drug stores across Europe, is no exception. Accurate sales forecasts are vital for optimizing inventory management, staff scheduling, and marketing strategies. In this capstone project, we are tasked with predicting daily sales for 1,115 Rossmann stores using historical sales data and additional store-related information.

**Business Context:**

Rossmann's extensive operations encompass over 3,000 drug stores in seven European countries. Store managers face the challenging task of forecasting sales for up to six weeks in advance. These forecasts depend on numerous factors, including promotional activities, competition from other stores, school and state holidays, seasonality, and the geographical location of each store. Furthermore, some stores have been temporarily closed for renovation, adding complexity to the prediction task. Currently, store managers rely on their own insights and circumstances, resulting in varying prediction accuracies.

**Data Overview:**

The dataset provided for this project comprises two primary components: the Rossmann stores data and the store data. The Rossmann stores data contains essential information such as the store number, day of the week, date, sales, the number of customers, store open status, and indicators for promotions and holidays. The store data provides additional details about each store, including its type, assortment, competition distance, and information about ongoing promotions.

**Project Steps:**

**Data Preparation and Exploration:**

In this initial step, we load the provided datasets into our working environment. We then explore the data's structure, check for missing values, and familiarize ourselves with its content. A deep understanding of the data's format and characteristics is crucial for subsequent steps.

**Data Preprocessing:**

Data preprocessing is fundamental for ensuring our data is suitable for analysis and modeling. This step includes addressing missing values, converting date columns into datetime objects, encoding categorical variables (e.g., one-hot encoding), and potentially performing feature engineering. Merging the two datasets using a common key, such as the 'Store' column, allows us to combine store-specific information with sales data.

**Exploratory Data Analysis (EDA):**

EDA involves visualizing and analyzing the data to gain insights into its distribution, patterns, and relationships. Through histograms, scatter plots, and summary statistics, we can understand the distribution of sales, customer behavior, and the impact of factors like promotions and holidays.

**Feature Selection:**

Feature selection helps us identify the most relevant variables for predicting sales. We can use techniques like feature importance from a machine learning model or correlation analysis to determine which features have the most significant impact.

**Model Building:**

Model building involves selecting an appropriate machine learning model, splitting the dataset into training and testing sets, training the model on the training data, and evaluating its performance on the testing data. In this step, we start with a simple model like Linear Regression and gradually explore more complex models like Random Forest or XGBoost.

**Model Tuning:**

Model tuning involves optimizing the model's hyperparameters to improve its performance. Techniques like Grid Search or Random Search can be used to find the best set of hyperparameters.

**Model Interpretation:**

Model interpretation helps us understand which factors have the most significant impact on sales. Techniques like feature importance analysis and partial dependence plots can provide insights into how variables affect the target variable.

**Prediction:**

After training and fine-tuning the model, we can use it to make sales predictions on the test set. This step allows us to assess how well our model generalizes to unseen data.

**Conclusion and Reporting:**

In the final step, we summarize our findings, including the model's performance and any insights gained from the data. We can create visualizations and reports to present our results to stakeholders and provide recommendations based on our analysis.

In conclusion, this capstone project aims to leverage historical sales data and store-specific information to predict retail sales accurately. By following these step-by-step procedures and applying various data analysis and machine learning techniques, we can help Rossmann optimize its store operations and improve sales forecasting accuracy.

# **GitHub Link -**

https://github.com/ashishj-7523/retail-sales-prediction

# **Problem Statement**



**Problem Statement: Predicting Retail Sales for Rossmann Drug Stores**

In the highly competitive retail industry, accurate sales forecasting is of paramount importance for effective business management and strategy development. Our task is to address the sales prediction challenge faced by Rossmann, a prominent retail chain operating over 3,000 drug stores across seven European countries.

**Business Context:**

Rossmann's store managers are tasked with forecasting daily sales for each store, up to six weeks in advance. These sales forecasts are vital for various aspects of store management, including inventory planning, staff scheduling, and marketing campaign optimization. However, predicting sales accurately is a complex endeavor, as it depends on a multitude of factors, including:

**Promotions:**

The impact of ongoing promotions on sales needs to be understood and predicted.

**Competition:**

The presence of nearby competitors can significantly affect a store's sales.

**Seasonality:**

Sales patterns vary throughout the year due to seasonal trends.

**Holidays:**

School and state holidays influence customer traffic and purchasing behavior.

**Geographical Factors:**

Each store's location and the local demographics may impact sales.

**Store-Specific Information:**

Store characteristics such as type and assortment also play a role.

**Store Closure:**

Some stores may have been temporarily closed for refurbishment, affecting historical sales data.

**Project Objective:**

Our primary objective is to develop a robust sales prediction model that can accurately forecast daily sales for Rossmann drug stores. The model should take into account the various factors mentioned above and provide reliable sales forecasts for future periods.

## Data Description

#### RossmannStoresData.csv - historical data including Sales
#### store.csv  - supplemental information about the stores


#### Data fields
#### Most of the fields are self-explanatory.

* **Id** - an Id that represents a (Store, Date) duple within the set
*  **Store** - a unique Id for each store
*  **Sales** - the turnover for any given day (Dependent Variable)
* **Customers** - the number of customers on a given day
* **Open** - an indicator for whether the store was open: 0 = closed, 1 = open
* **StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* **SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools
* **StoreType** - differentiates between 4 different store models: a, b, c, d
* **Assortment** - describes an assortment level: a = basic, b = extra, c = extended. An assortment strategy in retailing involves the number and type of products that stores display for purchase by consumers.
* **CompetitionDistance** - distance in meters to the nearest competitor store
* **CompetitionOpenSince**[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
* **Promo** - indicates whether a store is running a promo on that day
* **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* **Promo2Since**[Year/Week] - describes the year and calendar week when the store started participating in Promo2
* **PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
from google.colab import drive
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import math
import random
import pickle
import json
import datetime as datetime

from scipy import stats

from sklearn.preprocessing import RobustScaler, MinMaxScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import xgboost as xgb

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

In [None]:
# Load Dataset

# Load our first dataset
data1 = pd.read_csv('/content/drive/My Drive/Capstone Project Retail Sales Prediction/Rossmann Stores Data.csv')

# Load our second dataset

data2 = pd.read_csv('/content/drive/My Drive/Capstone Project Retail Sales Prediction/store.csv')

### Dataset First View

In [None]:
# Dataset First Look
# represent first 5 rows
data1.head()

In [None]:
# represent last 5 rows

data1.tail()

In [None]:
# represent first 5 rows

data2.head()

In [None]:
# represent last 5 rows

data2.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

data1.shape

In [None]:
data2.shape

### Dataset Information

In [None]:
# Dataset Info

data1.info()

We can see that there are no null values in this dataset

In [None]:
data2.info()

There are many null values in most of the columns

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

data1.duplicated().sum()

It means data1 is cleaned and no duplicates here.

In [None]:
data2.duplicated().sum()

dataset 2 is also cleaned with duplicates. No duplicates here.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

data1.isnull().sum()

No Null values here.

In [None]:
data2.isnull().sum()

Many Null values present here.

In [None]:
# Visualizing the missing values

# Create a heatmap of missing values
plt.figure(figsize=(10, 6))
sns.heatmap(data2.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

Here, we can see the distribution of missing values in different columns and at different places.

### What did you know about your dataset?

**Data1:**

1. **Null Values:** There are no null values in Data1.

2. **Data Type Adjustment:** The date column in Data1 is currently stored as an object (likely a string). To facilitate time-based analysis, we need to convert it to a datetime data type.

3. **Duplicates:** No duplicate rows were found in Data1.

**Data2:**

1. **Null Values:** Data2 exhibits a significant number of missing values across many columns. This suggests that we'll need to handle missing data during the data preprocessing phase.

2. **Data Type:** Further examination of Data2's columns may be necessary to ensure that data types are appropriate for their respective purposes.

3. **Duplicates:** Similar to Data1, no duplicate rows were identified in Data2.

In summary, Data1 appears to be relatively clean, with no missing values or duplicates. However, we do need to convert the date column's data type for time-based analysis. In contrast, Data2 presents a data quality challenge due to numerous missing values, which will require careful handling during data preprocessing. Additionally, verifying and adjusting data types in Data2 may be necessary for accurate analysis and modeling.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

data1.columns

In [None]:
data2.columns

In [None]:
# Dataset Describe

data1.describe()

In [None]:
data2.describe()

### Variables Description

**Data1:**

1. **Store:** Unique identifier for each Rossmann store.

2. **DayOfWeek:** Day of the week when the sales data was recorded (1 for Sunday, 7 for Saturday).

3. **Date:** Date of sales records (to be converted to datetime data type).

4. **Sales:** Daily sales figures for each store.

5. **Customers:** Number of customers who visited the store on a given day.

6. **Open:** Binary variable (0 or 1) indicating whether the store was open (1) or closed (0) on a particular day.

7. **Promo:** Binary variable (0 or 1) indicating whether a promotional campaign was active for the store on the recorded day.

8. **StateHoliday:** Indicator of whether the recorded day was a state holiday (1) or not (0).

9. **SchoolHoliday:** Binary variable (0 or 1) indicating whether there was a school holiday on the recorded day.

**Data2:**

1. **Store:** Unique identifier for each Rossmann store.

2. **StoreType:** Categorical variable describing the type of store (e.g., 'a', 'b', 'c', 'd').

3. **Assortment:** Categorizes the level of products available in stores (e.g., 'a', 'b', 'c').

4. **CompetitionDistance:** Numerical variable representing the distance to the nearest competitor store.

5. **CompetitionOpenSinceMonth:** Month when the nearest competitor store opened.

6. **CompetitionOpenSinceYear:** Year when the nearest competitor store opened.

7. **Promo2:** Binary variable (0 or 1) indicating whether Promo2 is active for the store.

8. **Promo2SinceWeek:** Calendar week when Promo2 started.

9. **Promo2SinceYear:** Year when Promo2 started.

10. **PromoInterval:** Information about the intervals at which Promo2 is offered (e.g., 'Jan', 'Feb', 'Mar', etc.).

Understanding these variables is essential for data preprocessing, feature engineering, and subsequent modeling steps. Note that the 'Date' column in Data1 should be converted to a datetime data type for time-based analysis.


### Check Unique Values for each variable.

In [None]:
# Check unique values for each variable in Data1
unique_values_data1 = {col: data1[col].unique() for col in data1.columns}

# Check unique values for each variable in Data2
unique_values_data2 = {col: data2[col].unique() for col in data2.columns}

# Print unique values for Data1
print("Unique Values in Data1:")
for col, values in unique_values_data1.items():
    print(f"{col}: {values}")

# Print unique values for Data2
print("\nUnique Values in Data2:")
for col, values in unique_values_data2.items():
    print(f"{col}: {values}")


In [None]:
# Check the number of unique values for each variable in Data1
num_uniques_data1 = {col: len(data1[col].unique()) for col in data1.columns}

# Check the number of unique values for each variable in Data2
num_uniques_data2 = {col: len(data2[col].unique()) for col in data2.columns}

# Print the number of unique values for Data1
print("Number of Unique Values in Data1:")
for col, num_uniques in num_uniques_data1.items():
    print(f"{col}: {num_uniques}")

# Print the number of unique values for Data2
print("\nNumber of Unique Values in Data2:")
for col, num_uniques in num_uniques_data2.items():
    print(f"{col}: {num_uniques}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# copy the dataset

df1 = data1.copy()
df2 = data2.copy()

In [None]:
df1

In [None]:
df2

### Handle missing values

In [None]:
# check missing values in df1

df1.isnull().sum()

There are no nissing values so no need to handle.

In [None]:
# check null values in df2

df2.isnull().sum()

There are many missing values so we need to handle.

To handle missing values in df2, we can use various techniques depending on the nature of the data. Here's a suggested approach for each of the columns with missing values:

**CompetitionDistance:**

 This numerical variable has only a few missing values. You can fill these missing values with the median or mean of the non-missing values in the same column. Let's use the median here.

In [None]:
df2['CompetitionDistance'].fillna(df2['CompetitionDistance'].median(), inplace=True)


**CompetitionOpenSinceMonth and CompetitionOpenSinceYear:**

 These columns represent the month and year when the nearest competitor store opened. Missing values could indicate that there is no nearby competition. You can fill these missing values with 0 to indicate no competition.

In [None]:
df2['CompetitionOpenSinceMonth'].fillna(0, inplace=True)
df2['CompetitionOpenSinceYear'].fillna(0, inplace=True)


**Promo2SinceWeek, Promo2SinceYear, and PromoInterval:**

 These columns relate to a promotional campaign and its timing. Missing values might suggest that Promo2 was not active. Fill the missing values in 'Promo2SinceWeek' and 'Promo2SinceYear' with 0, and for 'PromoInterval', you can use 'NoPromo' to indicate no promotional intervals.

In [None]:
df2['Promo2SinceWeek'].fillna(0, inplace=True)
df2['Promo2SinceYear'].fillna(0, inplace=True)
df2['PromoInterval'].fillna('NoPromo', inplace=True)


In [None]:
df2.isnull().sum()

### Remove duplicates

In [None]:
df1.duplicated().sum()

In [None]:
df2.duplicated().sum()

No Duplicates found in both dataset so no need to handle.

### Datatype Conversion

In [None]:
df1.info()

All datatypes are correct we have to convert datetime column

In [None]:
# Convert the 'Date' column to datetime data type
df1['Date'] = pd.to_datetime(df1['Date'], format='%Y-%m-%d')

# Verify the data type conversion
print(df1['Date'].dtype)


In [None]:
df1.info()

In [None]:
df2.info()

Here Datatype is normal for all columns

### What all manipulations have you done and insights you found?

**Data Manipulations:**

1. **Handling Missing Values in `df2`:**
   - Filled missing values in 'CompetitionDistance' with the median.
   - Filled missing values in 'CompetitionOpenSinceMonth' and 'CompetitionOpenSinceYear' with 0 to indicate no competition.
   - Filled missing values in 'Promo2SinceWeek' and 'Promo2SinceYear' with 0 to indicate no Promo2.
   - Filled missing values in 'PromoInterval' with 'NoPromo' to indicate no promotional intervals.

2. **Data Type Conversion in `df1`:**
   - Converted the 'Date' column from an object data type to a datetime data type for time-based analysis.

3. **Feature Engineering in `df1`:**
   - Extracted additional date-related features such as 'Year,' 'Month,' 'Day,' and 'DayOfWeek' from the 'Date' column.

**Key Insights:**

1. **Data1 Insights:**
   - Examined the distribution of sales, customers, and other numerical variables.
   - Discovered that there are no missing values in `df1`.
   - Determined that there are no duplicate rows in `df1`.
   - Transformed the 'Date' column to datetime for time-based analysis.
   - Extracted date-related features to facilitate time-series analysis.

2. **Data2 Insights:**
   - Identified missing values in several columns such as 'CompetitionDistance,' 'CompetitionOpenSinceMonth,' 'CompetitionOpenSinceYear,' 'Promo2SinceWeek,' 'Promo2SinceYear,' and 'PromoInterval.'
   - Handled missing values by filling them appropriately based on domain knowledge.
   - Confirmed that there are no duplicate rows in `df2`.
   - Ensured that data types for all columns in `df2` are appropriate for their respective purposes.

These data manipulations and insights provide a solid foundation for further data analysis, feature engineering, and modeling in your retail sales prediction project. They ensure that your data is clean, well-structured, and ready for in-depth analysis.


### Merging both the Datasets

In [None]:
# merge the datasets on store data
df = df1.merge(right=df2, on="Store", how="left")

In [None]:
df

In [None]:
df.info()

In [None]:
# Exctract some features from the date column

df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['WeekOfYear'] = df['Date'].dt.weekofyear
df['DayOfWeak'] = df['Date'].dt.dayofweek + 1

In [None]:
df

In [None]:
df.sample(10)

in this template we have to make 15 charts and for each chart we have to write 3 questions answers
1. Why did you pick the specific chart?
2. What is/are the insight(s) found from the chart?
3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

so, suggest different charts with 15 problems for each problem each 1 chart represent solution


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***