<a href="https://colab.research.google.com/github/diazid/sales-predictions/blob/main/sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Part 2: Food Sales Prediction

Name: Israel Diaz



**Description:**

This project will be a sales prediction for food items sold at various stores. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in increasing sales.

## Loading Data

Loading data from container

In [1]:
filepath = 'https://drive.google.com/uc?export=download&id=1apwZQiYRcktux62Ki6qaJa_JI-hDGb75'

In [2]:
#IMPORTING PANDAS LIBRARY
import pandas as pd
import numpy as np

In [3]:
#LOADING DATA INTO PANDAS DATAFRAME
df = pd.read_csv(filepath)

Previewing the content and info.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


I see that there is a couple of the variables that has missing values:

* No. Entries : 8523
* `Item Weight` : 7060 non null values
* `Outlet_Size` : 6113 non null values

Other variables are complete. 

In [5]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [6]:
df.shape

(8523, 12)

The data frame register 8523 entries and 12 variable. Each variable represent the following. 



|Variable Name            	| Description |
|---------------------------|-------------|
|Item_Identifier            |	Unique product ID|
|Item_Weight	              | Weight of product|
|Item_Fat_Content           |	Whether the product is low fat or regular|
|Item_Visibility          	|The percentage of total display area of all products in a store allocated to the particular product|
|Item_Type	                |The category to which the product belongs|
|Item_MRP	                  |Maximum Retail Price (list price) of the product|
|Outlet_Identifier          |Unique store ID|
|Outlet_Establishment_Year	|The year in which store was established|
|Outlet_Size	              |The size of the store in terms of ground area covered|
|Outlet_Location_Type	      |The type of area in which the store is located|
|Outlet_Type	              |Whether the outlet is a grocery store or some sort of supermarket|
|Item_Outlet_Sales         	|Sales of the product in the particular store. This is the target variable to be predicted. |

In the next section I will be dealing with the data preparation.

## Data Cleaning

### Data types

In [7]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

All data types seems to be correct, at least in the first watch. If in the next steps I find I have to change in the data types it will be done. 

### Finding Duplicates

In [8]:
df.duplicated().sum()

0

There are no duplicated entries in the data set.

### Solving Missing Values

In [9]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

There are have missing values in `Item_Weight` and `Outlet_type` columns. 

I'll impute the mean value according the `Item_Type`, let's say, not just only 1 mean for all the missing values but the ones grouped by `Item_Type`.


In [10]:
df['Item_Weight'] = df.groupby("Item_Type")["Item_Weight"].transform(lambda x: x.fillna(x.mean()))


## taken from 
## 'https://datascience.stackexchange.com/questions/37717/imputation-missing-values-other-than-using-mean-median-in-python'

In [11]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                     0
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

Done!

Other columns that presented missing values is `Outlet_Size` variable:

In [12]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

There are three categories in the `Outlet_Size` variable:
* Small
* Medium
* High

Due to the this variable is categorical, and due to we don't know more information about the reason of the missing values, I will impute the missing values to `unknown` category. 

Once this is corrected, I can continue with the previus analysis between `Outlet_Size` and `Item_Fat_Content`

#### Imputing missing values to `unknown` category.

In [13]:
# IMPUTING MISSING VALUES TO UNKNOWN CATEGORY

#imputing the missing values to Unknown
df['Outlet_Size'].fillna('Unknown', inplace=True)

#### Checking the data with imputed values

In [14]:
df.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [15]:
df['Outlet_Size'].unique()

array(['Medium', 'Unknown', 'High', 'Small'], dtype=object)

Done!, 

No more missing values. 

### Looking for Inconsistencies

In [16]:
df['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

I will assume that the category `LF` is the same as `Low Fat`, as well as `reg` to `Regular`. And the will be added capt to `low fat`. 

In [17]:
df['Item_Fat_Content'].replace({'LF': 'Low Fat', 
                                'low fat': 'Low Fat', 
                                'reg': 'Regular'}, 
                                inplace=True)

df['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

In [18]:
df['Item_Identifier'].unique()

array(['FDA15', 'DRC01', 'FDN15', ..., 'NCF55', 'NCW30', 'NCW05'],
      dtype=object)

In [19]:
df['Outlet_Establishment_Year'].unique()

array([1999, 2009, 1998, 1987, 1985, 2002, 2007, 1997, 2004])

In [20]:
df['Outlet_Identifier'].unique()

array(['OUT049', 'OUT018', 'OUT010', 'OUT013', 'OUT027', 'OUT045',
       'OUT017', 'OUT046', 'OUT035', 'OUT019'], dtype=object)

In [21]:
df['Outlet_Location_Type'].unique()

array(['Tier 1', 'Tier 3', 'Tier 2'], dtype=object)

In [22]:
df['Outlet_Type'].unique()

array(['Supermarket Type1', 'Supermarket Type2', 'Grocery Store',
       'Supermarket Type3'], dtype=object)

I see no more inconsistencies

### Summary statistics

In [23]:
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,8523.0,8523.0,8523.0,8523.0,8523.0
mean,12.85789,0.066132,140.992782,1997.831867,2181.288914
std,4.232804,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,9.31,0.026989,93.8265,1987.0,834.2474
50%,12.867061,0.053931,143.0128,1999.0,1794.331
75%,16.0,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


## Exploratory Visuals

## Explanatory Visuals