# Implementing Semi-supervised learning (SSL) to predict sales price of a given product
### Source: https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
### Data: Test.csv & Train.csv
### Problem Statement
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

 

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder

### Read data sets from file

In [9]:
train = pd.read_csv('./Test.csv')
test = pd.read_csv('./Train.csv')

### Preprocess data

Select features to be used for data modeling

In [10]:
# fields to keep
fields = [
    'Item_Weight',
    'Item_Fat_Content',
    'Outlet_Establishment_Year',
    'Outlet_Size',
    'Item_Visibility',
    'Item_MRP',
    'Outlet_Location_Type',
    'Outlet_Type',
    'Item_Outlet_Sales'
]

train = train[train.columns.intersection(fields)]
test = test[test.columns.intersection(fields)]

Transform data
- Impute missing weight with mean
- Reduce categorical values and encode them

In [11]:
# impute mean
train['Item_Weight'].fillna((train['Item_Weight'].mean()), inplace=True)
test['Item_Weight'].fillna((test['Item_Weight'].mean()), inplace=True)

# reduce fat content to two categories
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace(
    ['low fat', 'LF'], ['Low Fat', 'Low Fat'])
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace(['reg'], [
                                                              'Regular'])
test['Item_Fat_Content'] = test['Item_Fat_Content'].replace(
    ['low fat', 'LF'], ['Low Fat', 'Low Fat'])
test['Item_Fat_Content'] = test['Item_Fat_Content'].replace(['reg'], [
                                                            'Regular'])

# calculate establishment year
train['Outlet_Establishment_Year'] = 2013 - train['Outlet_Establishment_Year']
test['Outlet_Establishment_Year'] = 2013 - test['Outlet_Establishment_Year']
# default small for missing outlet_size
train['Outlet_Size'].fillna('Small', inplace=True)
test['Outlet_Size'].fillna('Small', inplace=True)

# label encoding cate. var.
col = ['Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Fat_Content']
test['Item_Outlet_Sales'] = 0

combi = train.append(test)
number = LabelEncoder()
for i in col:
    combi[i] = number.fit_transform(combi[i].astype('str'))
    combi[i] = combi[i].astype('int')
train = combi[:train.shape[0]]
test = combi[train.shape[0]:]

Split data into X_train, y_train, X_test

In [12]:
# set features and labels
test = test.drop('Item_Outlet_Sales', axis=1)
y_train = train['Item_Outlet_Sales']
train = train.drop('Item_Outlet_Sales', axis=1)

features = train.columns
target = 'Item_Outlet_Sales'

X_train, X_test = train, test

   Item_Fat_Content  Item_MRP  Item_Visibility  Item_Weight  \
0                 0  107.8622         0.007565    20.750000   
1                 1   87.3198         0.038428     8.300000   
2                 0  241.7538         0.099575    14.600000   
3                 0  155.0340         0.015388     7.315000   
4                 1  234.2300         0.118599    12.695633   

   Outlet_Establishment_Year  Outlet_Location_Type  Outlet_Size  Outlet_Type  
0                         14                     0            1            1  
1                          6                     1            2            1  
2                         15                     2            2            0  
3                          6                     1            2            1  
4                         28                     2            1            3  


### Data Modeling

Get rough estimates by training different regressor algorithms with the labelled data.
- Select the algorithm that gives us the best result

In [13]:
from sklearn.linear_model import BayesianRidge, Ridge, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

In [14]:
model_factory = [
    RandomForestRegressor(),
    # XGBRegressor(nthread=1),
    # MLPRegressor(),
    Ridge(),
    BayesianRidge(),
    ExtraTreesRegressor(),
    ElasticNet(),
    KNeighborsRegressor(),
    GradientBoostingRegressor()
]

for model in model_factory:
    model.seed = 42
    num_folds = 3

    scores = cross_val_score(
        model,
        X_train,
        y_train,
        cv=num_folds,
        scoring='neg_mean_squared_error'
    )
    score_description = " %0.2f (+/- %0.2f)" % (np.sqrt(scores.mean()*-1), scores.std() * 2)

    print('{model:25} CV-5 RMSE: {score}'.format(
        model=model.__class__.__name__,
        score=score_description
    ))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').