# Food Sales Prediction Project
<p>
    The information use in this project corresponds to 2013 sales and was collected by data scientist at BigMart across 10 stores in different cities. Some features of each product have been defined. 
    In order to help to the retailer to understand the product properties and the outlets that play a crucial role in the sales, this project aims to predict the sales of food products in each outlet.
</p>
<p><a href="https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#ProblemStatement" target:"_blank">source: <i>Analytics Vidhya 2013-2022</i></a></p>
<table>
    <tbody>
        <tr>
            <td>Variable</td>
            <td >Description</td>
        </tr>
        <tr>
            <td >Item_Identifier</td>
            <td >Unique product ID</td>
        </tr>
        <tr>
            <td>Item_Weight</td>
            <td >Weight of product</td>
        </tr>
        <tr>
            <td>Item_Fat_Content</td>
            <td>Whether the product is low fat or not</td>
        </tr>
        <tr>
            <td>Item_Visibility</td>
            <td>The % of total display area of all products in a store allocated to the particular product</td>
        </tr>
        <tr>
            <td>Item_Type</td>
            <td>The category to which the product belongs</td>
        </tr>
        <tr>
            <td>Item_MRP</td>
            <td>Maximum Retail Price (list price) of the product</td>
        </tr>
        <tr>
            <td>Outlet_Identifier</td>
            <td>Unique store ID</td>
        </tr>
        <tr>
            <td>Outlet_Establishment_Year</td>
            <td>The year in which store was established</td>
        </tr>
        <tr>
            <td>Outlet_Size</td>
            <td>The size of the store in terms of ground area covered</td>
        </tr>
        <tr>
            <td>Outlet_Location_Type</td>
            <td>The type of city in which the store is located</td>
        </tr>
        <tr>
            <td>Outlet_Type</td>
            <td>Whether the outlet is just a grocery store or some sort of supermarket</td>
        </tr>
    </tbody>
</table>

## Importing Pandas Module

In [1]:
import pandas as pd

## Charging information

In [62]:
df = pd.read_csv("sales_predictions.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Characterizing DataFrame

In [4]:
df.shape

(8523, 12)

In [6]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


<p style="color:rgb(200,0,0);">There are missing data in the columns: Item_Weight and Outlet_Size. Let's check if there is duplicated info</p>

In [13]:
df.duplicated().sum()

0

<p style="color:green;">There is not duplicated rows. Let's check the categories in the Outlet_size column</p>

### Identifying and making decisions about missing data

In [14]:
df["Outlet_Size"].unique() # missing data has been categorized as NaN

array(['Medium', nan, 'High', 'Small'], dtype=object)

Since the outlet size can be considered as an important feature to take into account when doing prediction of sales it is a good idea group the information in order to find patterns that can be useful to make decisions about whether or not to replace indeterminate data.

In [64]:
df.groupby(["Outlet_Type","Outlet_Location_Type","Outlet_Identifier"])["Outlet_Size"].unique()

Outlet_Type        Outlet_Location_Type  Outlet_Identifier
Grocery Store      Tier 1                OUT019                [Small]
                   Tier 3                OUT010                  [nan]
Supermarket Type1  Tier 1                OUT046                [Small]
                                         OUT049               [Medium]
                   Tier 2                OUT017                  [nan]
                                         OUT035                [Small]
                                         OUT045                  [nan]
                   Tier 3                OUT013                 [High]
Supermarket Type2  Tier 3                OUT018               [Medium]
Supermarket Type3  Tier 3                OUT027               [Medium]
Name: Outlet_Size, dtype: object

In [20]:
df.loc[df["Outlet_Size"].isna()]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
8,FDH17,16.200,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.200,Regular,0.094450,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.5350
25,NCD06,13.000,Low Fat,0.099887,Household,45.9060,OUT017,2007,,Tier 2,Supermarket Type1,838.9080
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344
...,...,...,...,...,...,...,...,...,...,...,...,...
8502,NCH43,8.420,Low Fat,0.070712,Household,216.4192,OUT045,2002,,Tier 2,Supermarket Type1,3020.0688
8508,FDW31,11.350,Regular,0.043246,Fruits and Vegetables,199.4742,OUT045,2002,,Tier 2,Supermarket Type1,2587.9646
8509,FDG45,8.100,Low Fat,0.214306,Fruits and Vegetables,213.9902,OUT010,1998,,Tier 3,Grocery Store,424.7804
8514,FDA01,15.000,Regular,0.054489,Canned,57.5904,OUT045,2002,,Tier 2,Supermarket Type1,468.7232


In [25]:
df.loc[df["Item_Weight"].isna() | df["Outlet_Size"].isna()]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.20,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
7,FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,FDH17,16.20,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.20,Regular,0.094450,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.5350
18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.6680
...,...,...,...,...,...,...,...,...,...,...,...,...
8504,NCN18,,Low Fat,0.124111,Household,111.7544,OUT027,1985,Medium,Tier 3,Supermarket Type3,4138.6128
8508,FDW31,11.35,Regular,0.043246,Fruits and Vegetables,199.4742,OUT045,2002,,Tier 2,Supermarket Type1,2587.9646
8509,FDG45,8.10,Low Fat,0.214306,Fruits and Vegetables,213.9902,OUT010,1998,,Tier 3,Grocery Store,424.7804
8514,FDA01,15.00,Regular,0.054489,Canned,57.5904,OUT045,2002,,Tier 2,Supermarket Type1,468.7232


In [28]:
len(df["Item_Identifier"].unique())

1559

In [52]:
items_serie = df.groupby("Item_Identifier")["Item_Weight"].mean()
#df["Item_Identifier"].apply(lambda x: if x.isna() x = items_serie[x])
#df.groupby("Item_Identifier")["Item_Weight"].mean().isna().sum()
len(items_serie.unique())

451

In [47]:
x = 1
def replaceNan(x):
    if type(x)!= "float":
        print(round(x,2))
    else:
        pass
    
for i in items_serie:
    if x == 20:
        break
    else:
        replaceNan(i)
        x+=1
#items_serie.apply(lambda x: replaceNan(x))

11.6
19.35
8.27
7.39
6.12
8.79
12.3
16.75
5.92
17.85
8.26
17.85
5.73
13.8
13.0
8.67
12.1
6.96
15.0


In [55]:
len(df["Outlet_Identifier"].unique())

10

## Summary Statistics

In [65]:
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648
