<a href="https://colab.research.google.com/github/coreyalejandro/Prediction-of-Product-Sales/blob/main/Prediction-of-Product-Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediction of Product Sales**

## **Project Overview**

### Introduction
### Understanding The Task

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.
The data scientists at BigMart also left us with this short, but important message:
>Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

### Hypothesis Generation

Based on our understanding of the problem and the domain, as well as our personal experiences and perspectives, we have generated the following hypotheses about the factors that might influence product sales at BigMart:

#### Product-Related Hypotheses

- **Quality**: Products with superior quality, as indicated by user ratings, are likely to have higher sales.
- **Price**: The relationship between price and sales is likely to be complex. Extremely high prices might deter some customers, while very low prices might raise questions about the product's quality. There might be an optimal price range that maximizes sales.
- **Packaging**: The packaging of a product, including its visual appeal and the information it provides, might influence sales. Products with more attractive or informative packaging might have higher sales.
- **Brand**: Products from well-known or highly regarded brands are likely to have higher sales.

#### Store-Related Hypotheses

- **Location**: Stores located in more populated or affluent areas, or in areas with fewer competing stores, are likely to have higher overall sales.
- **Size and Layout**: Larger stores, or stores with more effective layouts, are likely to have higher overall sales.
- **Staff**: Stores with more knowledgeable and helpful staff are likely to have higher overall sales.

#### Sales and Marketing Strategy Hypotheses

- **Product Bundling**: Bundling related products together, or placing them near each other in the store, might increase sales of those products.
- **Sales**: Products that are on sale are likely to have higher sales. However, if a product is always on sale, customers might start to perceive the sale price as the regular price.
- **One-Stop Shopping**: Grouping all the products needed for a particular event or occasion together might increase sales of those products.
- **In-Store Advertising**: Advertising a product in different parts of the store, not just near where it's usually located, might increase sales of that product.

These hypotheses will guide our exploratory data analysis. We will explore the data with these hypotheses in mind, looking for patterns and relationships that might support or contradict them.

## Loading and Inspecting the Data

### Loading

#### Installs and Imports

In [2]:
#Installs
#!pip install missingno
#!pip install pprint
#!pip install matplotlib-venn

[31mERROR: Could not find a version that satisfies the requirement pprint (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pprint[0m[31m


In [7]:
# Import necessary libraries

# For Data Analysis
import pandas as pd  # A powerful data manipulation library in Python
import numpy as np  # A library that supports multi-dimensional arrays and matrices

# For Data Visualization
import matplotlib.pyplot as plt  # A Python plotting library used for 2D graphics
import seaborn as sns  # A Python data visualization library based on matplotlib
import missingno as msno # A Python library for visualizing missing data
import matplotlib.patches as mpatches  # A module for drawing shapes

# For Working with Dates
from datetime import datetime  # A module that support manipulating dates and times

# For Displaying Outputs
from IPython.display import display  # A tool for displaying diffent outputs
import pprint # A Python module that allows for pretty printing of data structures

#### Dataset
The BigMart sales_predictions_2023 dataset can be found [here](https://raw.githubusercontent.com/coreyalejandro/Prediction-of-Product-Sales/3b70bdd21e316dadafe354cd715804825801fdc9/Data/sales_predictions_2023.csv).

#### Data Dictionary

- `Item_Identifier`: Unique product ID
- `Item_Weight`: Weight of product
- `Item_Fat_Content`: Whether the product is low fat or regular
- `Item_Visibility`: The percentage of total display area of all products in a store allocated to the particular product
- `Item_Type`: The category to which the product belongs
- `Item_MRP`: Maximum Retail Price (list price) of the product
- `Outlet_Identifier`: Unique store ID
- `Outlet_Establishment_Year`: The year in which store was established
- `Outlet_Size`: The size of the store in terms of ground area covered
- `Outlet_Location_Type`: The type of city in which the store is located
- `Outlet_Type`: Whether the outlet is a grocery store or some sort of supermarket
- `Item_Outlet_Sales`: Sales of the product in the particular store. This is the target variable to be predicted.

#### DataFrame

In [8]:
# Loading dataset and reading it into Pandas DataFrame
data_url = 'https://raw.githubusercontent.com/coreyalejandro/Prediction-of-Product-Sales/3b70bdd21e316dadafe354cd715804825801fdc9/Data/sales_predictions_2023.csv'
df_raw = pd.read_csv(data_url)

# Checking success of load by previewing first five rows of DataFrame.
pp = pprint.PrettyPrinter(indent=4)
pprint.pprint(df_raw.head())
print('\nThe data appear to have loaded successfully.')

  Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0           FDA15         9.30          Low Fat         0.016047   
1           DRC01         5.92          Regular         0.019278   
2           FDN15        17.50          Low Fat         0.016760   
3           FDX07        19.20          Regular         0.000000   
4           NCD19         8.93          Low Fat         0.000000   

               Item_Type  Item_MRP Outlet_Identifier  \
0                  Dairy  249.8092            OUT049   
1            Soft Drinks   48.2692            OUT018   
2                   Meat  141.6180            OUT049   
3  Fruits and Vegetables  182.0950            OUT010   
4              Household   53.8614            OUT013   

   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type  \
0                       1999      Medium               Tier 1   
1                       2009      Medium               Tier 3   
2                       1999      Medium               Tier

### Inspecting the Data

In [9]:
# Inspect the shape of the DataFrame
print(f'\nThe DataFrame has {df_raw.shape[0]} rows and {df_raw.shape[1]} columns.')

# Inspect the data types of the DataFrame
print('\nData types of the DataFrame:\n')
pprint.pprint(df_raw.dtypes)

# Additional information about the DataFrame
print('\n')
print('\nAdditional information about the DataFrame:\n')
pprint.pprint(df_raw.info())

# Statistical summary of the DataFrame
print('\nStatistical summary of the DataFrame:\n')
pprint.pprint(df_raw.describe())


The DataFrame has 8523 rows and 12 columns.

Data types of the DataFrame:

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object



Additional information about the DataFrame:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility           

### Data Reports


#### Hypotheses Report

Now that we have generated our hypotheses and inspected the data, let's compare the variables we have in the data with the variables we need to test our hypotheses. We'll create a Venn diagram to visualize this comparison.

First, let's list the variables we need to test our hypotheses:

- Quality (not available in the data)
- Price (available as 'Item_MRP')
- Packaging (not directly available in the data)
- Brand (not directly available, but we can use 'Item_Identifier' as a proxy)
- Store location (available as 'Outlet_Location_Type')
- Store size and layout (available as 'Outlet_Size' and 'Outlet_Type')
- Staff (not available in the data)
- Product bundling (not directly available in the data)
- Sales (available as 'Item_Outlet_Sales')
- One-stop shopping (not directly available in the data)
- In-store advertising (not directly available in the data)

As we can see, some of the hypothesized variables are not directly available in the data. However, we can use the available variables as proxies, and/or we can create new features based on the available variables to test our hypotheses. For example, we can use 'Item_Identifier' as a proxy for brand, and we can create new features based on 'Item_Type' and 'Outlet_Type' to test the hypotheses about product bundling and one-stop shopping.

Below, we creted a table to visualize the overlap between the hypothesized variables and the variables found in the data.

**<p align="center">Hypotheses Variables vs Data Variables</p>**

| Hypotheses Variable | Data Variable | Inferences (and Assumptions) |
| --- | --- | --- |
| Quality | Not directly available | Could be inferred from 'Item_MRP' (assuming that higher-priced items are of higher quality) and 'Item_Fat_Content' (assuming that low-fat items are of higher quality) |
| Price | 'Item_MRP' | - |
| Packaging | Not directly available | Could be inferred from 'Item_Type' and 'Item_Fat_Content' (assuming that certain types of items have certain types of packaging) |
| Brand | 'Item_Identifier' | - |
| Store location | 'Outlet_Location_Type' | - |
| Store size and layout | 'Outlet_Size' and 'Outlet_Type' | - |
| Staff | Not available | - |
| Product bundling | Not directly available | Could be inferred from 'Item_Type' and 'Outlet_Type' (assuming that certain types of items are more likely to be bundled together and sold at stores of certain types) |
| Sales | 'Item_Outlet_Sales' | - |
| One-stop shopping | Not directly available | Could be inferred from 'Item_Type' and 'Outlet_Type' (assuming that stores of certain types offer a wider range of items) |
| In-store advertising | Not directly available | Could be inferred from 'Item_Visibility' (assuming that more visible items are more likely to be advertised in the store) |

#### Data Inspection Report

Here are the observations from the initial inspection of the data:

- The DataFrame has 8523 rows and 12 columns.
- The columns are a mix of numerical (float64 and int64) and categorical (object) data types.
- There are missing values in the 'Item_Weight' and 'Outlet_Size' columns that we'll need to handle appropriately.
- The 'Item_Outlet_Sales' column, which is our target variable, is a continuous variable and doesn't have any missing values.

Here's a more detailed inspection report:

**Item_Identifier**:
- This is a categorical variable with unique identifiers for each product.
- There are no missing values in this column.

**Item_Weight**:
- This is a numerical variable representing the weight of the product.
- There are missing values in this column that we'll need to handle.

**Item_Fat_Content**:
- This is a categorical variable representing whether the product is low fat or regular.
- There are no missing values in this column.

**Item_Visibility**:
- This is a numerical variable representing the visibility of the product in the store.
- There are no missing values in this column.

**Item_Type**:
- This is a categorical variable representing the type of the product.
- There are no missing values in this column.

**Item_MRP**:
- This is a numerical variable representing the maximum retail price of the product.
- There are no missing values in this column.

**Outlet_Identifier**:
- This is a categorical variable with unique identifiers for each store.
- There are no missing values in this column.

**Outlet_Establishment_Year**:
- This is a numerical variable representing the year the store was established.
- There are no missing values in this column.

**Outlet_Size**:
- This is a categorical variable representing the size of the store.
- There are missing values in this column that we'll need to handle.

**Outlet_Location_Type**:
- This is a categorical variable representing the type of location where the store is situated.
- There are no missing values in this column.

**Outlet_Type**:
- This is a categorical variable representing the type of the store.
- There are no missing values in this column.

**Item_Outlet_Sales**:
- This is our target variable. It's a numerical variable representing the sales of the product in the particular store.
- There are no missing values in this column.

In [None]:
# Create a copy of the original dataframe
df_raw = df.copy()

# Create a separate dataframe for the cleaned data
df_clean = df.copy()

### Remove Unecessary Columns
Based on these observations, we might consider the following data cleaning steps:

Remove Unnecessary Columns: At this stage, all columns seem to be relevant to our analysis. However, if we find that a column is not contributing useful information for our predictive model, we might consider removing it. For example, if Item_Identifier is just a unique ID for each item and doesn't contain any meaningful pattern, it might not be useful for our model.

Convert Columns: We could convert the Outlet_Establishment_Year column to a more interpretable format, such as the age of the store in years. This would involve subtracting the establishment year from the current year.


In [11]:
# Convert 'Outlet_Establishment_Year' to 'Outlet_Age'
import datetime

current_year = datetime.datetime.now().year
df_raw['Outlet_Age'] = current_year - df_raw['Outlet_Establishment_Year']


# Display the first few rows of the dataframe to confirm the changes
df_raw.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,24
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,14
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,24
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38,25
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,36


## Updated Data Dictionary


Here is the updated data dictionary for the cleaned variables in the `df_clean` dataframe:

- `Item_Identifier`: Unique product ID
- `Item_Weight`: Weight of product
- `Item_Fat_Content`: Whether the product is low fat or regular
- `Item_Visibility`: The percentage of total display area of all products in a store allocated to the particular product
- `Item_Type`: The category to which the product belongs
- `Item_MRP`: Maximum Retail Price (list price) of the product
- `Outlet_Identifier`: Unique store ID
- `Outlet_Size`: The size of the store in terms of ground area covered
- `Outlet_Location_Type`: The type of city in which the store is located
- `Outlet_Type`: Whether the outlet is a grocery store or some sort of supermarket
- `Item_Outlet_Sales`: Sales of the product in the particular store. This is the target variable to be predicted.
- `Outlet_Age`: The age of the store in years, calculated as the current year minus the year the store was established.

### Missing Data

Next, we'll need to handle the missing values in the 'Item_Weight' and 'Outlet_Size' columns. The strategy for handling these missing values will depend on the nature of these variables and the business context. For example, we might fill the missing values with the mean or median of the 'Item_Weight' column, or we might use a more complex imputation method based on other variables. For the 'Outlet_Size' column, we might fill the missing values with the most common size, or we might use a more complex method based on other variables.