# Supervised Learning Project: Big Mart Sales

BigMart is a big supermarket chain, with stores all around the country. The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

* **Data Dictionary**

We have a train (8523) and test (5681) data set, the train data set has both input and output variable(s).

Train file:
CSV containing the item outlet information with a sales value

Variable Description
* ItemIdentifier ---- Unique product ID
* ItemWeight ---- Weight of product
* ItemFatContent ---- Whether the product is low fat or not
* ItemVisibility ---- The % of the total display area of all products in a store allocated to the particular product
* ItemType ---- The category to which the product belongs
* ItemMRP ---- Maximum Retail Price (list price) of the product
* OutletIdentifier ---- Unique store ID
* OutletEstablishmentYear ---- The year in which the store was established
* OutletSize ---- The size of the store in terms of ground area covered
* OutletLocationType ---- The type of city in which the store is located
* OutletType ---- Whether the outlet is just a grocery store or some sort of supermarket
* ItemOutletSales ---- sales of the product in particular store. This is the outcome variable to be predicted.



Test file:
CSV containing item outlet combinations for which sales need to be forecasted

* Variable Description
* ItemIdentifier ----- Unique product ID
* ItemWeight ---- Weight of product
* ItemFatContent ----- Whether the product is low fat or not
* ItemVisibility ---- The % of the total display area of all products in a store allocated to the particular product
* ItemType ---- The category to which the product belongs
* ItemMRP ----- Maximum Retail Price (list price) of the product
* OutletIdentifier ----- Unique store ID
* OutletEstablishmentYear ----- The year in which store store was established
* OutletSize ----- The size of the store in terms of ground area covered
* OutletLocationType ---- The type of city in which the store is located
* OutletType ---- whether the outlet is just a grocery store or some sort of supermarket


This is a supervised machine learning problem with a target label as: Item_Outlet_Sales 

Since the aim is predict the sales for test dataset, this is a regression task.

Importing libraries necessary for this project.

In [None]:
# Libraries for manipulate the data.
import pandas as pd
import numpy as np

# Libraries for data visualization.
import seaborn as sns
import matplotlib.pyplot as plt

# Libraries for model building.
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor


# Exploratory Data Analysis

Exploratory data analysis is an approach to analyzing data sets and extracting useful information from the data. The analysis starts from the descriptive exploration of the data such as number of missing records and values to a visual exploration in order to better represent the data in more intuitive formats. This technique often using statistical graphics and other data visualization methods.

In [None]:
# Load dataset.
df_train = pd.read_csv('../input/big-mart-salescsv/Train_UWu5bXk.csv')
df_test = pd.read_csv('../input/big-mart-salescsv/Test_u94Q5KV.csv')

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
# Check the shape of the data.
print('Training data: {}'.format(df_train.shape))
print('Test data: {}'.format(df_test.shape))

In [None]:
# Check for null values on training data.
print(df_train.isnull().sum())

In [None]:
# Check for null values on test data.
print(df_test.isnull().sum())

In [None]:
# Generate descriptive statistics on training data.
df_train.describe()

In [None]:
# Generate descriptive statistics on test data.
df_test.describe()

**The graphs below show the univariate distribution data of the numeric variables.**

In [None]:
plt.style.use('ggplot')

for column in df_train.describe().columns:
    sns.displot(df_train[column].dropna(), kde=True, element='step')
    plt.show()

**Boxplot**

In [None]:
for column in df_train.describe().columns:
    sns.boxplot(x=df_train[column].dropna())
    plt.show()

**Relationship between variables**

In [None]:
for column in df_train.describe().columns:
    sns.relplot(data=df_train.dropna(), x=column, y='Item_Outlet_Sales')
    plt.show()

**Analysis of the 'Item_Type' categorical variable to see the distribution of the items sold on outlets**

Among the products sold, the 'Fruits and Vegetables' are the most sold items, while 'Seafood' are the least sold.

In [None]:
plt.figure(figsize=(15,10))
ax = sns.countplot(x=df_train['Item_Type'])
plt.xticks(rotation=90)
plt.show()

for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.2, p.get_height()+20))

**Distribution of the 'Outlet_Size' variable**

In [None]:
plt.figure(figsize=(10,8))
ax = sns.countplot(x=df_train['Outlet_Size'])
plt.show()

for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+30))

**Distribution of the 'Outlet_Location_Type' variable**

In [None]:
plt.figure(figsize=(10,8))
ax = sns.countplot(x=df_train['Outlet_Location_Type'])
plt.show()

for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+50))

**Distribution of the 'Outlet_Type' variable**

In [None]:
plt.figure(figsize=(10,8))
ax = sns.countplot(x=df_train['Outlet_Type'])
plt.show()

for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+30))

**Impact of the 'Item_Fat_Content' on 'Item_Outlet_Sales'**

In [None]:
# Resolve naming discrepancies on 'Item_Fat_Content' variable.
df_train['Item_Fat_Content'] = df_train['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

df_item_fat_pivot = df_train.pivot_table(index='Item_Fat_Content', values='Item_Outlet_Sales', aggfunc=np.median)

ax = df_item_fat_pivot.plot(kind='bar', color='blue', figsize=(12,8), alpha=0.6)
plt.ylabel('Item_Outlet_Sales')
plt.title('Impact of the Item_Fat_Content on Item_Outlet_Sales')
plt.xticks(rotation=0)
plt.show()

for p in ax.patches:
    ax.annotate('{:.2f}'.format(p.get_height()), (p.get_x()+0.2, p.get_height()+30))

**Impact of the 'Outlet_Type' on 'Item_Outlet_Sales'**

In [None]:
df_outlet_type_pivot = df_train.pivot_table(index='Outlet_Type', values='Item_Outlet_Sales', aggfunc=np.median)

ax = df_outlet_type_pivot.plot(kind='bar', color='maroon', figsize=(12,8), alpha=0.6)
plt.ylabel('Outlet_Type')
plt.title('Impact of the Outlet_Type on Item_Outlet_Sales')
plt.xticks(rotation=0)
plt.show()

for p in ax.patches:
    ax.annotate('{:.2f}'.format(p.get_height()), (p.get_x()+0.2, p.get_height()+30))

**Correlation Matrix**

In [None]:
plt.figure(figsize=(20,10))
ax = sns.heatmap(df_train.corr(), annot=True, square=True, cmap='inferno')
plt.show()

# Feature Enginnering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. This process has two main goals:

- Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
- Improving the performance of machine learning models.


In [None]:
# Join training and test data to apply data mining techniques.
df_train_aux = df_train.copy()
df_test_aux = df_test.copy()
df_train_aux['Source_Data'] = 'Train'
df_test_aux['Source_Data'] = 'Test'

df_data = pd.concat([df_train_aux, df_test_aux], ignore_index=True)

In [None]:
df_data

In [None]:
df_data.isnull().sum()

**Resolve naming discrepancies on 'Item_Fat_Content' variable.**

In [None]:
df_data['Item_Fat_Content'] = df_data['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

df_data['Item_Fat_Content'].value_counts()

**Treat Missing Values (NaN)**

- Item_Weight

    Analysing the Boxplot graph of the Item_Weight variable, it's possible to assume that it has approximately an normal distribution    (Gaussian distribution). In this case, the missing values can be replaced by the median of the Item_Weight column.

In [None]:
df_data['Item_Weight'].median()

In [None]:
# Replace missing values on Item_Weight column.
df_data['Item_Weight'] = df_data['Item_Weight'].fillna(df_data['Item_Weight'].median())

In [None]:
df_data.isnull().sum()

- Outlet_Size
    
    The missing values of the Outlet_Size column will be replaced by the 'Medium', because it is the value most frequently on column.

In [None]:
df_data['Outlet_Size'].value_counts()

In [None]:
# Replace missing values on Outlet_Size column.
df_data['Outlet_Size'] = df_data['Outlet_Size'].fillna('Medium')

In [None]:
df_data.isnull().sum()

In [None]:
df_data

- Item_Visibility

    The Item_Visibility column has some items with value 0 (no visibility), but all the items needs to be visible to the customers. This means that those items was not available and were marked as 0. Therefore, it's need to treat this as missing values.

In [None]:
# Amount of items marked with visility 0.
df_data[df_data['Item_Visibility'] == 0]['Item_Visibility'].count()

Again, the approach followed here, will be to replace those missing values with the median of the column. Since, the Boxplot graph of the Item_Visibility apresents some outliers. The median is less sensible to outliers than the mean.

In [None]:
df_data['Item_Visibility'].median()

In [None]:
# Replace values 0 on Item_Visibility column.
df_data.loc[df_data['Item_Visibility']<=0 , 'Item_Visibility'] = df_data['Item_Visibility'].median()

In [None]:
df_data

- Outlet_Establishment_Year

In [None]:
df_data['Outlet_Establishment_Year'].value_counts()

In [None]:
df_data['Outlet_Years'] = 2013 - df_data['Outlet_Establishment_Year']
df_data['Outlet_Years'].describe()

- Item_Type

In [None]:
df_data['Item_Type'].value_counts()

- Item_Identifier

Analysing the data, it's possible to note that those item types are divided in three main categories which are Food, Drink and Non-Consumable.

In [None]:
df_data['Item_Identifier'].value_counts()

It's possible to note that the item names starts with either 'FD' (Food), 'DR' (Drink) or 'NC' (Non-Consumable).

In [None]:
# Get only the first two characters.
df_data['New_Item_Type'] = df_data['Item_Identifier'].apply(lambda x: x[0:2])

In [None]:
# Rename the 'New_Item_Type' to more intuitive categories.
df_data['New_Item_Type'] = df_data['New_Item_Type'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drink'}) 

df_data['New_Item_Type'].value_counts()

Mark non-consumables as separate category in 'Item_Fat_Content'.

In [None]:
df_data.loc[df_data['New_Item_Type'] == 'Non-Consumable', 'Item_Fat_Content'] = 'Non-Edible'

df_data['Item_Fat_Content'].value_counts()

Calculate the visibility average of each product.

In [None]:
item_visibility_avg = df_data.pivot_table(values='Item_Visibility', index='Item_Identifier')

item_visibility_avg

In [None]:
function = lambda x: x['Item_Visibility']/item_visibility_avg['Item_Visibility'][item_visibility_avg.index == x['Item_Identifier']][0]
df_data['Item_Visibility_Avg'] = df_data.apply(function, axis=1).astype(float)

df_data

**Transform categorical variables**

In [None]:
# Use One-hot encoding.
df_data = pd.get_dummies(df_data, prefix=['Item', 'Outlet', 'Outlet', 'Outlet', 'Outlet', 'Outlet'], columns=['Item_Fat_Content', 
                                         'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'New_Item_Type'])

In [None]:
df_data.iloc[:, :15]

In [None]:
df_data.columns.tolist()

# Model Building

In [None]:
# Remove columns that doesn't will be used on model training.
df_mdl = df_data.drop(columns=['Item_Identifier', 'Item_Type', 'Outlet_Establishment_Year'])

df_mdl

In [None]:
# Split the data in training and test sets.
df_mdl_train = df_mdl.loc[df_mdl['Source_Data'] == 'Train']
df_mdl_test = df_mdl.loc[df_mdl['Source_Data'] == 'Test']

In [None]:
df_mdl_train

In [None]:
df_mdl_test

In [None]:
# Remove columns that doesn't will be used.
df_mdl_train = df_mdl_train.drop(columns=['Source_Data'])
df_mdl_test = df_mdl_test.drop(columns=['Item_Outlet_Sales', 'Source_Data'])


In [None]:
df_mdl_test

In [None]:
x_train = df_mdl_train.drop(columns=['Item_Outlet_Sales']).to_numpy()
y_train = df_mdl_train['Item_Outlet_Sales'].to_numpy()

x_test = df_mdl_test.to_numpy()

In [None]:
# Standardize features.
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)

x_test = scaler.transform(x_test)

# Random Forest

Random forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or average prediction (regression) of the individual trees.

In [None]:
reg = RandomForestRegressor(n_estimators=100, random_state=42)
reg.fit(x_train, y_train)

In [None]:
y_pred = reg.predict(x_test)
y_pred

In [None]:
df_result = pd.DataFrame(columns=['Item_Identifier', 'Outlet_Identifier', 'Item_Outlet_Sales'])
df_result['Item_Identifier'] = df_test['Item_Identifier']
df_result['Outlet_Identifier'] = df_test['Outlet_Identifier']
df_result['Item_Outlet_Sales'] = y_pred

df_result.to_csv('result.csv', index=False)

In [None]:
df_result

In [None]:
reg.score(x_train, y_train)

# Considerations

The Random forest algorithm obtained a good score (coefficient of determination r2) of 0.9 at training data.

But, there are others methods that can be used in this dataset. Examples:

- Support Vector Regressor (SVR)
- Bagging Regression
- Gradient Boosting 
- Artificial Neural Networks (ANN)
- Among others.