<a href="https://colab.research.google.com/github/ankit-chuahan/Air_bnb_project/blob/main/ML_END_term_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual/Team
**Created By**  - Ankit Chauhan


# **Project Summary -**

 **About Rossmann Stores**

Rossmann Stores is a German drugstore chain founded in 1972 by Dirk Roßmann. It is one of the largest drugstore chains in Europe, with over 3,000 stores in Germany, Poland, Hungary, the Czech Republic, Slovakia, Turkey, and Spain.

Rossmann Stores sells a wide range of products, including pharmaceuticals, cosmetics, toiletries, household goods, and food. The company is known for its low prices and its commitment to customer service.

Rossmann Stores has been a pioneer in the use of technology in the drugstore industry. The company was one of the first to introduce self-service checkout and online shopping. Rossmann Stores also uses data analytics to improve its product selection and marketing campaigns.

The company has been successful in recent years, thanks to its focus on low prices, customer service, and technology. Rossmann Stores is well-positioned to continue to grow in the future.

Here are some key facts about Rossmann Stores:

* Founded in 1972
* Over 3,000 stores in 7 countries
* Sells a wide range of products, including pharmaceuticals, cosmetics, toiletries, household goods, and food
* Known for its low prices and commitment to customer service
* A pioneer in the use of technology in the drugstore industry


**About dataset**

The Rossmann store project is a classic case study in retail analytics. The goal of the project is to predict the daily sales for each store in the Rossmann store chain, using historical sales data and various features such as promotions , holidays.

The project dataset contains over 1,000 stores and over 1,000 days of sales data. The data includes features such as:

* Store number
* Date
* Sales
* Promotions
* Holidays
* Competition

The project is typically divided into two phases:

1. **Data preparation:** This phase involves cleaning the data, handling missing values, and creating new features.
2. **Model training and evaluation:** This phase involves training various machine learning models to predict sales, and then evaluating the performance of the models.

The Rossmann store project is a challenging but rewarding project for anyone interested in learning about retail analytics and machine learning.

Here are some of the key takeaways from the project:

* **Data preparation is essential.** The quality of the data has a significant impact on the performance of the machine learning models.
* **There is no one-size-fits-all solution.** The best machine learning model for the Rossmann store project will depend on the specific data and the desired outcome.
* **Machine learning can be used to improve business outcomes.** The Rossmann store project has shown that machine learning can be used to improve sales forecasting and inventory management.

# **GitHub Link -**

GitHub Link -

Link here. https://github.com/ankit-chuahan/rossman_sales_analysis__/blob/main/ml_end_term_ankit.ipynb

# **Problem Statement**


**Write Problem Statement Here.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score as r2, mean_squared_error as mse
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix,classification_report

### Dataset Loading

In [None]:
# Load Dataset
rossman = pd.read_csv('Rossmann Stores Data.csv')
store = pd.read_csv('store.csv')

### Dataset First View

In [None]:
# Dataset First Look
# top n rows of rossman datset
rossman.head(5)

In [None]:
# top n rows of store dataset
store.head(5)

In [None]:
# now we have to merge both the dataset
# store is common field in both column
df = pd.merge(rossman, store, on = 'Store', how ='left')

In [None]:
# sample  of dataframe
df.sample(5)

In [None]:
# checking shape of dataframe
rossman.shape,store.shape

### Dataset Rows & Columns count

In [None]:
# To check the shape of dataframe
print(f"the shape of datset is ",df.shape)
print(f"No. of rows-", df.shape[0])
print(f"No. of columns-", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
# info use to find datatype and memory usage
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# checking duplicates
print("Number of duplicate value in dataframe  is--->",df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# checking null values
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cmap="Greens")
plt.show()

### What did you know about your dataset?

The dataset contains sales data for over 1,000 Rossmann stores over a period of several years. The data includes information on store location, sales, promotions, holidays, and competition. The dataset also includes a number of features that can be used to predict sales, such as store size, store type, and the number of employees.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* **Store**: This is the unique identifier for each store.
* **Date**: This is the date of the sale.
* **DayOfWeek**: This is the day of the week of the sale.
* **Open**: This is a binary variable indicating whether the store was open on the date of the sale.
* **Promo**: This is a binary variable indicating whether there was a promotion on the date of the sale.
* **SchoolHoliday**: This is a binary variable indicating whether there was a school holiday on the date of the sale.
* **StoreType**: This is the type of store.
* **Assortment**: This is the assortment of products sold at the store.
* **CompetitionDistance**: This is the distance to the nearest competitor.
* **CompetitionOpenSinceMonth**: This is the month in which the nearest competitor opened.
* **CompetitionOpenSinceYear**: This is the year in which the nearest competitor opened.
* **Promo2**: This is a binary variable indicating whether there was a second promotion on the date of the sale.
* **Promo2SinceWeek**: This is the week in which the second promotion started.
* **Promo2SinceYear**: This is the year in which the second promotion started.
* **PromoInterval**: This is the interval between promotions.
* **Year**: This is the year of the sale.
* **Month**: This is the month of the sale.
* **Day**: This is the day of the month of the sale.


### Check Unique Values for each variable.

# **Checking Distinct values in columns**

In [None]:
df['DayOfWeek'].unique()


In [None]:
df['Open'].unique()


Where 1 mean open
0 means Closed

In [None]:
df['StateHoliday'].unique()

There are some values in str format and some in numeric soo we have to handle this where 0 indicates no holiday  and a,b,c indicates differnt holidays

In [None]:
# Replacing all '0' with 0
df['StateHoliday'] = df['StateHoliday'].replace("0",0)

In [None]:
# chceking unique value in store type column
df['StoreType'].unique()

So there are 4 type of store (A B C  D)

In [None]:
# checking unique values in assortment column
df['Assortment'].unique()

3 type of assortment are A- Basic B - Average C - Good

In [None]:
# checking unique values in PromoInterval
df['PromoInterval'].unique()

In [None]:
# chceking unique customers and total number of unique customers
print(df['Customers'].unique())
print('total number of unique customers',len(df['Customers'].unique()))

In [None]:
# chceking unique values in promo
df['Promo'].unique()

Where 1 means there is promotion and 0 means no promotion

In [None]:
# chceking unique values in SchoolHoliday
df['SchoolHoliday'].unique()

In [None]:
# chceking unique values in CompetitionOpenSinceMonth
df['CompetitionOpenSinceMonth'].unique()

In [None]:
# chceking unique values in CompetitionOpenSinceYear column
df['CompetitionOpenSinceYear'].unique()

In [None]:
# checking unique values in promo2 column
df['Promo2'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Change data types object to int
df.loc[df['StateHoliday'] == '0', 'StateHoliday'] = 0
df.loc[df['StateHoliday'] == 'a', 'StateHoliday'] = 1
df.loc[df['StateHoliday'] == 'b', 'StateHoliday'] = 2
df.loc[df['StateHoliday'] == 'c', 'StateHoliday'] = 3
#store the value with same column name i.e StateHoliday with function astype
df['StateHoliday'] = df['StateHoliday'].astype(int, copy=False)


df# change Data Types object into int
df.loc[df['Assortment'] == 'a', 'Assortment'] = 0
df.loc[df['Assortment'] == 'b', 'Assortment'] = 1
df.loc[df['Assortment'] == 'c', 'Assortment'] = 2
#store the value with same column name i.e Assortment with function astype
df['Assortment'] = df['Assortment'].astype(int, copy=False)


# change Data Types object into int
df.loc[df['StoreType'] == 'a', 'StoreType'] = 0
df.loc[df['StoreType'] == 'b', 'StoreType'] = 1
df.loc[df['StoreType'] == 'c', 'StoreType'] = 2
df.loc[df['StoreType'] == 'd', 'StoreType'] = 3
#store the value with same column name i.e Assortment with function astype
df['StoreType'] = df['StoreType'].astype(int, copy=False)

In [None]:
# Extracting year, month and day from "Date" using pd.to_datetime
# and Droping column 'Date
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].apply(lambda x: x.year)
df['Month'] = df['Date'].apply(lambda x: x.month)
df['Day'] = df['Date'].apply(lambda x: x.day)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# 1. **Distribution of Customers:**

sns.distplot(df['Customers'])
plt.title('Distribution of Customers')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The distribution plot is used to visualize the distribution of a single variable. In this case, we are using the distribution plot to visualize the distribution of the 'Customers' variable.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The distribution of customers is positively skewed, which means that there are more stores with a small number of customers than stores with a large number of customers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to identify stores that are underperforming in terms of customer traffic. The business can then take steps to improve the performance of these stores, such as increasing marketing efforts or improving the product selection.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.distplot(df['Sales'])
plt.title('Distribution of Sales')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The distribution plot is used to visualize the distribution of a single variable. In this case, we are using the distribution plot to visualize the distribution of the 'Sales' variable.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The distribution of sales is positively skewed, which means that there are more stores with low sales than stores with high sales.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to identify stores that are underperforming in terms of sales. The business can then take steps to improve the performance of these stores, such as increasing marketing efforts or improving the product selection.


#### Chart - 3

In [None]:
# Chart - 3 visualization code
# a bar graph of sales by store type
sns.barplot(x='StoreType', y='Sales', data=df)
plt.title('sales by store type')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

The bar chart is used to compare the values of a categorical variable across different categories. In this case, we are using the bar chart to compare the sales of different store types.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The bar chart shows that store type A has the highest average sales, followed by store type B, store type C, and store type D.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to identify the most successful store type and then open more stores of that type. The business can also use the insights to develop marketing campaigns that are specifically targeted to each store type.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Assortment by sales
sns.barplot(x='Assortment', y='Sales', data=df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The bar chart is used to compare the values of a categorical variable across different categories. In this case, we are using the bar chart to compare the sales of different assortment types.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The bar chart shows that assortment type C has the highest average sales, followed by assortment type B and assortment type A.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to identify the most successful assortment type and then stock more products of that type. The business can also use the insights to develop marketing campaigns that are specifically targeted to each assortment type.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
# sales effected by holidays

sns.barplot(x='StateHoliday', y='Sales', data=df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The bar chart is used to compare the values of a categorical variable across different categories. In this case, we are using the bar chart to compare the sales of different state holidays.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The bar chart shows that sales are highest on state holiday 0, followed by state holiday 1, state holiday 2, and state holiday 3.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to plan marketing campaigns and promotions around state holidays. The business can also use the insights to ensure that they have enough staff on hand to meet the increased demand during state holidays.


#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Salest trend by month
sns.lineplot(x='Month', y='Sales', data=df,hue='Year')
plt.title("Sales trend by month")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The line chart is used to visualize the trend of a variable over time. In this case, we are using the line chart to visualize the trend of sales over time.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The line chart shows that sales have been increasing over time. There is a seasonal pattern to sales, with sales being highest in the summer months and lowest in the winter months.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to plan marketing campaigns and promotions around the seasonal patterns in sales. The business can also use the insights to ensure that they have enough staff on hand to meet the increased demand during the summer months.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Sale Vs CompetitionOpenSinceYear
plt.figure(figsize=(15,6))
sns.pointplot(x= 'CompetitionOpenSinceYear', y= 'Sales', data=df, color='green')
plt.xticks(rotation=90)
plt.title('Plot between Sales and Competition Open Since year')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

The pointplot is used to visualize the relationship between two variables. In this case, we are using the pointplot to visualize the relationship between sales and the year in which the nearest competitor opened.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The pointplot shows that there is a negative relationship between sales and the year in which the nearest competitor opened. This means that stores that have been open for a longer period of time tend to have lower sales.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to identify stores that are at risk of losing sales to competitors. The business can then take steps to improve the performance of these stores, such as increasing marketing efforts or improving the product selection.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Ploting a pie chart of the assortment counts
assortment_counts = df['Assortment'].value_counts()

plt.pie(assortment_counts, labels=assortment_counts.index, autopct='%1.1f%%')
plt.title('Assortment Pie Chart')
plt.show()

In [None]:
# prompt: Why did you pick the specific chart
#  What is/are the insight(s) found from the chart
#  Will the gained insights help creating a positive business impact

# Why did you pick the specific chart?
The pie chart is used to visualize the proportions of different categories in a dataset. In this case, we are using the pie chart to visualize the proportions of different assortment types in the dataset.

# What is/are the insight(s) found from the chart?
The pie chart shows that assortment type C is the most common assortment type, followed by assortment type B and assortment type A.

# Will the gained insights help creating a positive business impact?
Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to ensure that they have a sufficient supply of products of each assortment type. The business can also use the insights to develop marketing campaigns that are specifically targeted to each assortment type.


##### 1. Why did you pick the specific chart?

Answer Here.

The pie chart is used to visualize the proportions of different categories in a dataset. In this case, we are using the pie chart to visualize the proportions of different assortment types in the dataset.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The pie chart shows that assortment type C is the most common assortment type, followed by assortment type B and assortment type A.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the business can use the insights to ensure that they have a sufficient supply of products of each assortment type. The business can also use the insights to develop marketing campaigns that are specifically targeted to each assortment type.


#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Sales by the  PromoInterval

import matplotlib.pyplot as plt
sns.barplot(x='PromoInterval', y='Sales', data=df)
plt.title('Sales by PromoInterval')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The bar chart is used to compare the values of a categorical variable across different categories. In this case, we are using the bar chart to compare the sales of different promo intervals.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The bar chart shows that sales are highest for promo intervals of 0 and 1, and lowest for promo intervals of 5 and 6.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the business can use the insights to adjust the frequency of their promotions. The business can also use the insights to develop marketing campaigns that are specifically targeted to customers who are more likely to make purchases during certain promo intervals.


#### Chart - 10

In [None]:
# Chart - 10 visualization code
#scatterplot of Competition Distance and Sales
sns.scatterplot(x=df['CompetitionDistance'], y=df['Sales'])

##### 1. Why did you pick the specific chart?

Answer Here.

The scatterplot is used to visualize the relationship between two variables. In this case, we are using the scatterplot to visualize the relationship between sales and the distance to the nearest competitor.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The scatterplot shows that there is a negative relationship between sales and the distance to the nearest competitor. This means that stores that are closer to competitors tend to have lower sales.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the business can use the insights to identify stores that are at risk of losing sales to competitors. The business can then take steps to improve the performance of these stores, such as increasing marketing efforts or improving the product selection.


#### Chart - 11

In [None]:
# Chart - 11 visualization code
sns.scatterplot(x='Customers',y='Sales',data=df,)
plt.title('Sales over customers')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The scatterplot is used to visualize the relationship between two variables. In this case, we are using the scatterplot to visualize the relationship between sales and the number of customers.


In [None]:
# prompt: Why did you pick the specific chart
# What is/are the insight(s) found from the chart?
# Will the gained insights help creating a positive business impact?

# **Why did you pick the specific chart?**
The scatterplot is used to visualize the relationship between two variables. In this case, we are using the scatterplot to visualize the relationship between sales and the number of customers.

# **What is/are the insight(s) found from the chart?**
The scatterplot shows that there is a positive relationship between sales and the number of customers. This means that stores with more customers tend to have higher sales.

# **Will the gained insights help creating a positive business impact?**
Yes, the gained insights can help creating a positive business impact. For example, the business can use the insights to identify stores that are underperforming in terms of sales. The business can then take steps to improve the performance of these stores, such as increasing marketing efforts or improving the product selection.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The scatterplot shows that there is a positive relationship between sales and the number of customers. This means that stores with more customers tend to have higher sales.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the business can use the insights to identify stores that are underperforming in terms of sales. The business can then take steps to improve the performance of these stores, such as increasing marketing efforts or improving the product selection.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Checking Frequency distribution for continous features:
plt.figure(figsize=(20,10))

#First plot(Sales vs Frequency)
plt.subplot(2,2,1)
plt.xlabel("Sales")
plt.ylabel("Frequency")
sns.kdeplot(df["Sales"], color="Green", shade = True)  #kernel density estimate (KDE) plot
plt.title('Density distribution of Sales',size = 15)

#Second plot(CompetitionDistance vs Frequency)
plt.subplot(2,2,2)
plt.xlabel("CompetitionDistance")
plt.ylabel("Frequency")
sns.kdeplot(df["CompetitionDistance"], color="Blue", shade = True) #kernel density estimate (KDE) plot
plt.title('Density distribution of CompetitionDistance',size = 15)

#Third plot(Customers vs Frequency)
plt.subplot(2,2,3)
plt.xlabel("Customers")
plt.ylabel("Frequency")
sns.kdeplot(df["Customers"], color="Red", shade = True) #kernel density estimate (KDE) plot
plt.title('Density distribution of Customers',size = 15)

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#we need only meaningful numeric columns here, let's drop the unnecessary to get a clear picture
columns_to_drop = [ 'Year', 'DayOfWeek']
corr_df = df.drop(columns = columns_to_drop, axis =1)
corr_df['StateHoliday'].replace({'a':1, 'b':1,'c':1}, inplace=True)

In [None]:
#correlation heatmap
plt.figure(figsize=(16,10))
sns.heatmap(corr_df.corr(), cmap="coolwarm", annot=True)

##### 1. Why did you pick the specific chart?

Answer Here.

The heatmap is used to visualize the correlation between multiple variables. In this case, we are using the heatmap to visualize the correlation between all of the numerical variables in the dataset.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

- Sales have a strong positive correlation with Customers.
- Sales have a weak positive correlation with CompetitionDistance.
- Sales have a weak negative correlation with PromoInterval.
- Customers have a strong positive correlation with CompetitionDistance.
- Customers have a weak negative correlation with PromoInterval.
- CompetitionDistance has a weak negative correlation with PromoInterval.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

In [None]:
# prompt: **1. Hypothesis 1:**
# * Null Hypothesis (H0): There is no significant difference in sales between stores with and without promotions.
# * Alternative Hypothesis (H1): There is a significant difference in sales between stores with and without promotions.
# can you perform it?

import scipy.stats as stats

# Create two groups based on whether the store has a promotion or not
promotion_group = df[df['PromoInterval'] == 0]['Sales']
non_promotion_group = df[df['PromoInterval'] != 0]['Sales']

# Perform a t-test to compare the means of the two groups
t_statistic, p_value = stats.ttest_ind(promotion_group, non_promotion_group)

# Set the significance level
alpha = 0.05

# Make a decision based on the p-value
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in sales between stores with and without promotions.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in sales between stores with and without promotions.")


Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

 * Null Hypothesis (H0): There is no significant difference in sales between stores with and without promotions.
 * Alternative Hypothesis (H1): There is a significant difference in sales between stores with and without promotions.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Create two groups based on whether the store has a promotion or not
promotion_group = df[df['PromoInterval'] == 0]['Sales']
non_promotion_group = df[df['PromoInterval'] != 0]['Sales']

# Perform a t-test to compare the means of the two groups
t_statistic, p_value = stats.ttest_ind(promotion_group, non_promotion_group)

# Set the significance level
alpha = 0.05

# Make a decision based on the p-value
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in sales between stores with and without promotions.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in sales between stores with and without promotions.")

##### Why did you choose the specific statistical test?

Answer Here.

I chose the t-test because it is a parametric test that is used to compare the means of two independent groups. The t-test is appropriate for this situation because the data is normally distributed and the two groups are independent.


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
df.isnull().sum()

In [None]:
# Handling Missing Values & Missing Value Imputation
# Handling missing values
# Replacing Null values in CompetitionDistance with median.
df['CompetitionDistance'].fillna(df['CompetitionDistance'].median(), inplace = True)

# Replacing Null values with 0 in CompetitionOpenSinceMonth
df['CompetitionOpenSinceMonth'] = df['CompetitionOpenSinceMonth'].fillna(0)

# Replacing Null values with 0 in CompetitionOpenSinceYear
df['CompetitionOpenSinceYear'] = df['CompetitionOpenSinceYear'].fillna(0)

# Replacing Null values with 0 in Promo2SinceWeek
df['Promo2SinceWeek'] = df['Promo2SinceWeek'].fillna(0)

## Replacing Null values with 0 in Promo2SinceYear
df['Promo2SinceYear'] = df['Promo2SinceYear'].fillna(0)

## Replacing Null values with 0 in PromoInterval
df['PromoInterval'] =df['PromoInterval'].fillna(0)

In [None]:
df

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

**Missing Value Imputation Techniques:**

1. **Median Imputation:** Used to impute missing values in the `CompetitionDistance` column. This technique replaces missing values with the median value of the column. It is a robust technique that is not affected by outliers.

2. **Zero Imputation:** Used to impute missing values in the `CompetitionOpenSinceMonth`, `CompetitionOpenSinceYear`, `Promo2SinceWeek`, `Promo2SinceYear`, and `PromoInterval` columns. This technique replaces missing values with 0. It is a simple technique that is easy to implement.

**Reasons for Choosing These Techniques:**

1. **Median Imputation:**
    - The `CompetitionDistance` column contains continuous data.
    - The median is a robust measure of central tendency that is not affected by outliers.
    - Using the median ensures that the imputed values are representative of the distribution of the data.

2. **Zero Imputation:**
    - The `CompetitionOpenSinceMonth`, `CompetitionOpenSinceYear`, `Promo2SinceWeek`, `Promo2SinceYear`, and `PromoInterval` columns contain categorical data.
    - Replacing missing values with 0 indicates that there is no competition or promotion during that period.
    - This is a simple and effective way to handle missing values in categorical data.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Identify the outliers
# In this example, we will use the box plot to identify the outliers for the 'Sales' variable


sns.boxplot(x='Sales', data=df)

# Decide on the outlier treatment strategy
# In this example, we will winsorize the outliers for the 'Sales' variable

from scipy.stats.mstats import winsorize

df['Sales'] = winsorize(df['Sales'], limits=[0.05, 0.05])

# Alternatively, you can remove the outliers for the 'Sales' variable

q1 = df['Sales'].quantile(0.25)
q3 = df['Sales'].quantile(0.75)
iqr = q3 - q1

df = df[(df['Sales'] >= q1 - 1.5*iqr) & (df['Sales'] <= q3 + 1.5*iqr)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

# prompt: What all outlier treatment techniques have you used and why did you use those techniquesWhat all outlier treatment techniques have you used and why did you use those techniques

**Outlier Treatment Techniques:**

1. **Winsorizing:** Used to treat outliers in the `Sales` variable. This technique replaces outliers with the closest values within the specified limits. It is a robust technique that preserves the shape of the distribution.

2. **Removing Outliers:** Alternatively, outliers can be removed from the dataset. This technique is appropriate when outliers are not representative of the underlying population.

**Reasons for Choosing These Techniques:**

1. **Winsorizing:**
    - The `Sales` variable contains continuous data.
    - Winsorizing is a robust technique that preserves the shape of the distribution.
    - Using winsorizing ensures that the outliers do not have an undue influence on the analysis.

2. **Removing Outliers:**
    - This technique is not used in this example. However, it could be appropriate if the outliers are not representative of the underlying population.

**Additional Notes:**

- The choice of outlier treatment technique depends on the specific data and the goals of the analysis.
- It is important to carefully consider the impact of outlier treatment on the results of the analysis.

In [None]:
# creating a copy of older df
new_df = df.copy()

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Perform one-hot encoding on the 'PromoInterval' column of the DataFrame 'new_df'
new_df = pd.get_dummies(new_df, columns=['PromoInterval'],drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

- **One-hot encoding:** Used to encode the `PromoInterval` column. This technique creates a new binary feature for each category in the column.

**Reasons for Choosing These Techniques:**

- **One-hot encoding:**
    - The `PromoInterval` column contains multiple categories.
    - One-hot encoding is a simple and effective way to encode categorical variables with multiple categories.
    - It is also a popular technique for encoding categorical variables in machine learning models.


In [None]:
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
final['StateHoliday'] = le.fit_transform(new_df['StateHoliday'])
final = pd.get_dummies(new_df, columns=['StoreType', 'Assortment','StateHoliday'])

### 8. Data Splitting

In [None]:
# defining dependent variable and independent variable
dependent_variables = 'Sales'

independent_variables = list(new_df.columns.drop(['Promo2SinceYear','Sales','Date']))

In [None]:
# Create the data of independent variables
X = new_df[independent_variables].values

# Create the data of dependent variable
y = new_df[dependent_variables].values

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)
print(X_train.shape)
print(X_test.shape)


In [None]:
X_train

In [None]:
X_test

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
final = df.copy()

# Fit the Algorithm
# Here we Train the model
reg = LinearRegression().fit(X_train, y_train)

#Checking the Regression Score i.e R-squared value
reg.score(X_train, y_train)

# Predict on the model
#Predicting Dependent Variable With Test Dataset i.e 20%
y_pred = reg.predict(X_test)
y_pred

In [None]:
# Checking the intercept of different indpendent columns
reg.intercept_

In [None]:
# Checking the cofficient of different independent columns
reg.coef_

In [None]:
#Predicting on Train Dataset
y_pred_train = reg.predict(X_train)
y_pred_train

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#  evaluation Metric Score chart
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
#Calculate MSE & RMSE for Test Prediction
MSE  = mean_squared_error(y_test, y_pred)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

# calculate the R-squared score between the true target values (y_test) and the predicted values
r2 = r2_score(y_test, y_pred)
print("R2 :" ,r2)


In [None]:
# true target values (y_test) and the corresponding predicted values (y_pred) side by side.
pd.DataFrame(zip(y_test, y_pred), columns = ['actual', 'pred'])

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***