<a href="https://colab.research.google.com/github/diazid/sales-predictions/blob/main/sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Part 1: Food Sales Prediction

Name: Israel Diaz



## Loading Data

Loading data from container

In [1]:
filepath = 'https://drive.google.com/uc?export=download&id=1apwZQiYRcktux62Ki6qaJa_JI-hDGb75'

In [2]:
#IMPORTING PANDAS LIBRARY
import pandas as pd

In [3]:
#LOADING DATA INTO PANDAS DATAFRAME
df = pd.read_csv(filepath)

Previewing the content and info.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


The data has the following:
* No. Entries : 8523
* `Item Weight` : 7060 non null values
* `Outlet_Size` : 6113 non null values

Other variables are complete. 

In [5]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [6]:
df.shape

(8523, 12)

## Data Cleaning

In [7]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

All data types seems to be correct

In [8]:
df.duplicated().sum()

0

There aren't duplicate data

In [9]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

As we saw before, we have missing values in `Item_Weight` and `Outlet_type` columns. We'll be exploring that columns in the following steps

We will remove the `Item_Weight` column due to the relevance of that information. It is not so important for the analysis to know how much weight the item has.

In [10]:
df.drop(columns=['Item_Weight'], inplace= True)

In [11]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

Two of the three sizes exceed 2000 items, except for the High size.



In [12]:
df.groupby(by=['Item_Type'])['Outlet_Size'].value_counts()

Item_Type              Outlet_Size
Baking Goods           Medium         203
                       Small          187
                       High            73
Breads                 Medium          83
                       Small           71
                       High            25
Breakfast              Medium          36
                       Small           30
                       High            13
Canned                 Medium         217
                       Small          189
                       High            65
Dairy                  Medium         218
                       Small          198
                       High            80
Frozen Foods           Medium         274
                       Small          249
                       High            92
Fruits and Vegetables  Medium         413
                       Small          328
                       High           142
Hard Drinks            Medium          75
                       Small           50

As we see there, it would be possible that the missing values belong to the High category, because the other categories are similar in number to each other. We will impute the missing values to the High category taking into account this insigh. 

In [13]:
# IMPUTING MISSING VALUES TO HIGH CATEGORY
df['Outlet_Size'].fillna('High', inplace=True)


In [14]:
df.groupby(by=['Item_Type'])['Outlet_Size'].value_counts()

Item_Type              Outlet_Size
Baking Goods           High           258
                       Medium         203
                       Small          187
Breads                 High            97
                       Medium          83
                       Small           71
Breakfast              High            44
                       Medium          36
                       Small           30
Canned                 High           243
                       Medium         217
                       Small          189
Dairy                  High           266
                       Medium         218
                       Small          198
Frozen Foods           High           333
                       Medium         274
                       Small          249
Fruits and Vegetables  High           491
                       Medium         413
                       Small          328
Hard Drinks            High            89
                       Medium          75

As we see the data seems more uniform. 

In [15]:
df.isna().sum()

Item_Identifier              0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Now we have no more missing values. 

The next step will be to find inconsistencies in the data.

In [16]:
df['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

We see inconcistencies in the names of the categories. We'll assume that LF head to Low Fat, and reg to Regular. 

In [17]:
df['Item_Fat_Content'].replace({'LF': 'Low Fat', 
                                'low fat': 'Low Fat', 
                                'reg': 'Regular'}, 
                                inplace=True)

df['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

In [18]:
df['Item_Type'].unique()

array(['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables',
       'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods',
       'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned',
       'Breads', 'Starchy Foods', 'Others', 'Seafood'], dtype=object)

In [19]:
df['Item_Identifier'].unique()

array(['FDA15', 'DRC01', 'FDN15', ..., 'NCF55', 'NCW30', 'NCW05'],
      dtype=object)

In [20]:
df['Outlet_Establishment_Year'].unique()

array([1999, 2009, 1998, 1987, 1985, 2002, 2007, 1997, 2004])

In [21]:
df['Outlet_Identifier'].unique()

array(['OUT049', 'OUT018', 'OUT010', 'OUT013', 'OUT027', 'OUT045',
       'OUT017', 'OUT046', 'OUT035', 'OUT019'], dtype=object)

In [22]:
df['Outlet_Location_Type'].unique()

array(['Tier 1', 'Tier 3', 'Tier 2'], dtype=object)

In [23]:
df['Outlet_Type'].unique()

array(['Supermarket Type1', 'Supermarket Type2', 'Grocery Store',
       'Supermarket Type3'], dtype=object)

Summary statistics

In [26]:
df.describe()

Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,8523.0,8523.0,8523.0,8523.0
mean,0.066132,140.992782,1997.831867,2181.288914
std,0.051598,62.275067,8.37176,1706.499616
min,0.0,31.29,1985.0,33.29
25%,0.026989,93.8265,1987.0,834.2474
50%,0.053931,143.0128,1999.0,1794.331
75%,0.094585,185.6437,2004.0,3101.2964
max,0.328391,266.8884,2009.0,13086.9648


## Exploratory Visuals

## Explanatory Visuals