# Food Sales Prediction Project
<p>
    The information use in this project corresponds to 2013 sales of <strong>1559</strong> products and was collected by data scientist at BigMart across 10 stores in different cities. Some features of each product have been defined. 
    In order to help to the retailer to understand the product properties and the outlets that play a crucial role in the sales, <strong>this project aims to predict the sales of food products in each outlet, it will try to understand the properties of products and outlets which play a key role in increasing sales.</strong>
</p>
<p><a href="https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#ProblemStatement" target:"_blank">source: <i>Analytics Vidhya 2013-2022</i></a></p>
<table>
    <tbody>
        <tr>
            <td>Variable</td>
            <td >Description</td>
        </tr>
        <tr>
            <td >Item_Identifier</td>
            <td >Unique product ID</td>
        </tr>
        <tr>
            <td>Item_Weight</td>
            <td >Weight of product</td>
        </tr>
        <tr>
            <td>Item_Fat_Content</td>
            <td>Whether the product is low fat or not</td>
        </tr>
        <tr>
            <td>Item_Visibility</td>
            <td>The % of total display area of all products in a store allocated to the particular product</td>
        </tr>
        <tr>
            <td>Item_Type</td>
            <td>The category to which the product belongs</td>
        </tr>
        <tr>
            <td>Item_MRP</td>
            <td>Maximum Retail Price (list price) of the product</td>
        </tr>
        <tr>
            <td>Outlet_Identifier</td>
            <td>Unique store ID</td>
        </tr>
        <tr>
            <td>Outlet_Establishment_Year</td>
            <td>The year in which store was established</td>
        </tr>
        <tr>
            <td>Outlet_Size</td>
            <td>The size of the store in terms of ground area covered</td>
        </tr>
        <tr>
            <td>Outlet_Location_Type</td>
            <td>The type of city in which the store is located</td>
        </tr>
        <tr>
            <td>Outlet_Type</td>
            <td>Whether the outlet is just a grocery store or some sort of supermarket</td>
        </tr>
    </tbody>
</table>

## Importing Pandas Module

In [3]:
import pandas as pd

## Charging Information

In [107]:
df = pd.read_csv("sales_predictions.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Characterizing DataFrame

In [5]:
df.shape

(8523, 12)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


<p style="color:rgb(200,0,0);">There are missing data in the columns: Item_Weight and Outlet_Size. Let's check if there is duplicated info</p>

## Remove Duplicate or Irrelevant Observations

In [13]:
df.duplicated().sum()

0

<p style="color:green;">There is not duplicated rows.</p>

In [65]:
df["Item_Weight"].isna().sum()

1463

Since item Weight can be categorized as an irrelevant feature in the sales prediction and in this column there is missing information this column will be delete. Another irrelevant columns to this analysis is the year in which store was established, so this column will also be removed.

In [108]:
df.drop(columns = ["Item_Weight", "Outlet_Establishment_Year"], inplace=True)
df.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,,Tier 3,Grocery Store,732.38
4,NCD19,Low Fat,0.0,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052


## Fixing Structural Errors

### Type of Information in Each Column

In [87]:
df.dtypes

Item_Identifier          object
Item_Fat_Content         object
Item_Visibility         float64
Item_Type                object
Item_MRP                float64
Outlet_Identifier        object
Outlet_Size              object
Outlet_Location_Type     object
Outlet_Type              object
Item_Outlet_Sales       float64
dtype: object

There is consistency in the type of information in each column

### Number of Products

In [83]:
len(df["Item_Identifier"].unique()) # this will give us the number of products. As was said thw numbero of products os 1559a

1559

### Columns Categories

In [88]:
df["Item_Fat_Content"].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

<div style="background-color:rgb(150,0,0); padding:10px;">
    <p>There are some irregularities in the Item_Fat_Content; it can be deduced that</p>
    <ul>
        <li>LF = Low Fat</li>
        <li>low fat = Low Fat</li>
        <li>reg = Regular</li>
    </ul>
</div>

In [109]:
df["Item_Fat_Content"].replace(["LF","reg","low fat"],["Low Fat", "Regular","Low Fat"], inplace=True)
df["Item_Fat_Content"].unique()

array(['Low Fat', 'Regular'], dtype=object)

In [90]:
df["Item_Type"].unique()

array(['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables',
       'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods',
       'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned',
       'Breads', 'Starchy Foods', 'Others', 'Seafood'], dtype=object)

In [91]:
df["Outlet_Location_Type"].unique()

array(['Tier 1', 'Tier 3', 'Tier 2'], dtype=object)

In [92]:
df["Outlet_Type"].unique()

array(['Supermarket Type1', 'Supermarket Type2', 'Grocery Store',
       'Supermarket Type3'], dtype=object)

## Identifying and Handling Missing Data

In [14]:
df["Outlet_Size"].unique()

array(['Medium', nan, 'High', 'Small'], dtype=object)

Since the outlet size can be considered as an important feature to take into account when doing predictions of sales it is a good idea group the information in order to find patterns that can be useful to make decisions about the missing information.

In [64]:
df.groupby(["Outlet_Type","Outlet_Location_Type","Outlet_Identifier"])["Outlet_Size"].unique()

Outlet_Type        Outlet_Location_Type  Outlet_Identifier
Grocery Store      Tier 1                OUT019                [Small]
                   Tier 3                OUT010                  [nan]
Supermarket Type1  Tier 1                OUT046                [Small]
                                         OUT049               [Medium]
                   Tier 2                OUT017                  [nan]
                                         OUT035                [Small]
                                         OUT045                  [nan]
                   Tier 3                OUT013                 [High]
Supermarket Type2  Tier 3                OUT018               [Medium]
Supermarket Type3  Tier 3                OUT027               [Medium]
Name: Outlet_Size, dtype: object

Having seen the structure of the information the missing data in this category will be labeled as "Missing"

In [110]:
df["Outlet_Size"] = df["Outlet_Size"].fillna("Missing")
df["Outlet_Size"].isna().sum()

0

## Validating and QA

In [111]:
df = df.sort_values(["Outlet_Location_Type","Outlet_Type","Outlet_Identifier","Item_Identifier"])
df.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
2879,DRA24,Regular,0.069909,Soft Drinks,163.2868,OUT019,Small,Tier 1,Grocery Store,491.3604
6179,DRA59,Regular,0.223985,Soft Drinks,186.2924,OUT019,Small,Tier 1,Grocery Store,555.2772
1708,DRC25,Low Fat,0.07944,Soft Drinks,86.7882,OUT019,Small,Tier 1,Grocery Store,85.8882
2950,DRD15,Low Fat,0.099442,Dairy,233.1642,OUT019,Small,Tier 1,Grocery Store,697.0926
2766,DRD25,Low Fat,0.13827,Soft Drinks,111.686,OUT019,Small,Tier 1,Grocery Store,452.744


In [115]:
df_indexed = df.set_index(["Outlet_Location_Type","Outlet_Type","Outlet_Identifier","Item_Identifier"])
df_indexed

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Item_Outlet_Sales
Outlet_Location_Type,Outlet_Type,Outlet_Identifier,Item_Identifier,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Tier 1,Grocery Store,OUT019,DRA24,Regular,0.069909,Soft Drinks,163.2868,Small,491.3604
Tier 1,Grocery Store,OUT019,DRA59,Regular,0.223985,Soft Drinks,186.2924,Small,555.2772
Tier 1,Grocery Store,OUT019,DRC25,Low Fat,0.079440,Soft Drinks,86.7882,Small,85.8882
Tier 1,Grocery Store,OUT019,DRD15,Low Fat,0.099442,Dairy,233.1642,Small,697.0926
Tier 1,Grocery Store,OUT019,DRD25,Low Fat,0.138270,Soft Drinks,111.6860,Small,452.7440
...,...,...,...,...,...,...,...,...,...
Tier 3,Supermarket Type3,OUT027,NCZ06,Low Fat,0.093706,Household,253.8698,Medium,3297.7074
Tier 3,Supermarket Type3,OUT027,NCZ17,Low Fat,0.079047,Health and Hygiene,39.8506,Medium,1480.0734
Tier 3,Supermarket Type3,OUT027,NCZ30,Low Fat,0.026058,Household,121.9098,Medium,3374.2744
Tier 3,Supermarket Type3,OUT027,NCZ53,Low Fat,0.024359,Health and Hygiene,190.4214,Medium,5652.6420


In [117]:
df_indexed.to_csv(path_or_buf="food_sales_forecast_cleaned.csv")

## Summary Statistics

In [65]:
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


## Working Section

In [52]:
items_serie = df.groupby("Item_Identifier")["Item_Weight"].mean()
#df["Item_Identifier"].apply(lambda x: if x.isna() x = items_serie[x])
#df.groupby("Item_Identifier")["Item_Weight"].mean().isna().sum()
len(items_serie.unique())

451

In [47]:
x = 1
def replaceNan(x):
    if type(x)!= "float":
        print(round(x,2))
    else:
        pass
    
for i in items_serie:
    if x == 20:
        break
    else:
        replaceNan(i)
        x+=1
#items_serie.apply(lambda x: replaceNan(x))

11.6
19.35
8.27
7.39
6.12
8.79
12.3
16.75
5.92
17.85
8.26
17.85
5.73
13.8
13.0
8.67
12.1
6.96
15.0


In [55]:
len(df["Outlet_Identifier"].unique())

10