<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/Reg_Bench.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression Benchmark

### Importing Libraries

In [1]:
#importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Importing Dataset

In [3]:
data=pd.read_csv('/content/drive/MyDrive/train_bm.csv')

In [4]:
data.shape

(8523, 12)

In [5]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [6]:
data.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

### Shuffling and Creating Train and Test Set

In [7]:
from sklearn.utils import shuffle

# Shuffling the Dataset
data = shuffle(data, random_state = 42)

#creating 4 divisions
div = int(data.shape[0]/4)

# 3 parts to train set and 1 part to test set
train = data.iloc[:3*div+1,:]
test = data.iloc[3*div+1:]

In [8]:
train.shape, test.shape, data.shape

((6391, 12), (2132, 12), (8523, 12))

In [9]:
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
7503,FDI28,14.3,Low Fat,0.0263,Frozen Foods,79.4302,OUT013,1987,High,Tier 3,Supermarket Type1,1743.0644
2957,NCM17,7.93,Low Fat,0.071136,Health and Hygiene,42.7086,OUT046,1997,Small,Tier 1,Supermarket Type1,356.8688
7031,FDC14,14.5,Regular,0.041313,Canned,42.0454,OUT049,1999,Medium,Tier 1,Supermarket Type1,377.5086
1084,DRC36,,Regular,0.044767,Soft Drinks,173.7054,OUT027,1985,Medium,Tier 3,Supermarket Type3,5778.4782
856,FDS27,10.195,Regular,0.012456,Meat,197.511,OUT035,2004,Small,Tier 2,Supermarket Type1,2356.932


In [10]:
test.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
4904,FDC34,16.0,Regular,0.172759,Snack Foods,157.0972,OUT046,1997,Small,Tier 1,Supermarket Type1,4673.916
514,FDL48,19.35,Regular,0.082198,Baking Goods,49.2034,OUT013,1987,High,Tier 3,Supermarket Type1,729.051
7879,NCC06,19.0,Low Fat,0.027028,Household,127.0336,OUT049,1999,Medium,Tier 1,Supermarket Type1,3962.8416
409,DRP35,,Low Fat,0.090427,Hard Drinks,126.2336,OUT027,1985,Medium,Tier 3,Supermarket Type3,3195.84
6596,NCM29,11.5,Low Fat,0.017678,Health and Hygiene,129.2626,OUT045,2002,,Tier 2,Supermarket Type1,1705.1138


## Simple Mean ( mean of Item_Outlet_Sales)

In [11]:
# storing simple mean in a new column in the test set as "simple_mean"
test['simple_mean'] = train['Item_Outlet_Sales'].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['simple_mean'] = train['Item_Outlet_Sales'].mean()


In [12]:
#calculating mean absolute error
from sklearn.metrics import mean_absolute_error as MAE

simple_mean_error = MAE(test['Item_Outlet_Sales'] , test['simple_mean'])
simple_mean_error

1369.1543806463005

## Mean Item Outlet Sales with respect to Outlet_Type

In [13]:
out_type = pd.pivot_table(train, values='Item_Outlet_Sales', index = ['Outlet_Type'], aggfunc=np.mean)
out_type

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Type,Unnamed: 1_level_1
Grocery Store,339.713738
Supermarket Type1,2269.172538
Supermarket Type2,1961.889047
Supermarket Type3,3735.211455


In [14]:
# initializing new column to zero
test['Out_type_mean'] = 0

# For every unique entry in Outlet_Identifier
for i in train['Outlet_Type'].unique():
  # Assign the mean value corresponding to unique entry
  test['Out_type_mean'][test['Outlet_Type'] == str(i)] = train['Item_Outlet_Sales'][train['Outlet_Type'] == str(i)].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Out_type_mean'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Out_type_mean'][test['Outlet_Type'] == str(i)] = train['Item_Outlet_Sales'][train['Outlet_Type'] == str(i)].mean()


In [15]:
#calculating mean absolute error
out_type_error = MAE(test['Item_Outlet_Sales'] , test['Out_type_mean'] )
out_type_error

1140.9971466514626

## Mean Item Outlet Sales with respect to Outlet_Establishment_Year

In [16]:
out_year = pd.pivot_table(train, values='Item_Outlet_Sales', index = ['Outlet_Establishment_Year'], aggfunc=np.mean)
out_year

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Establishment_Year,Unnamed: 1_level_1
1985,2506.409498
1987,2255.42999
1997,2213.285178
1998,328.811495
1999,2316.032587
2002,2163.635786
2004,2373.476783
2007,2295.841815
2009,1961.889047


In [17]:
# initializing new column to zero
test['Out_year_mean'] = 0

# For every unique entry in Outlet_Identifier
for i in train['Outlet_Establishment_Year'].unique():
  # Assign the mean value corresponding to unique entry
  test['Out_year_mean'][test['Outlet_Establishment_Year'] == i] = train['Item_Outlet_Sales'][train['Outlet_Establishment_Year'] == i].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Out_year_mean'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Out_year_mean'][test['Outlet_Establishment_Year'] == i] = train['Item_Outlet_Sales'][train['Outlet_Establishment_Year'] == i].mean()


In [18]:
#calculating mean absolute error
out_year_error = MAE(test['Item_Outlet_Sales'] , test['Out_year_mean'] )
out_year_error

1265.9192742645316

## Mean Item Outlet Sales with respect to Outlet_Location_Type

In [19]:
out_loc = pd.pivot_table(train, values = 'Item_Outlet_Sales', index = ['Outlet_Location_Type'], aggfunc = np.mean)
out_loc

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Location_Type,Unnamed: 1_level_1
Tier 1,1839.154066
Tier 2,2276.975327
Tier 3,2281.328336


In [20]:
# Initializing empty column
test['out_loc_mean'] = 0

# For every unique entry in Item_Identifier
for i in train['Outlet_Location_Type'].unique():
  # calculate and assign mean corresponding to the uniques entries
  test['out_loc_mean'][test['Outlet_Location_Type'] == str(i)] = train['Item_Outlet_Sales'][train['Outlet_Location_Type'] == str(i)].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['out_loc_mean'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['out_loc_mean'][test['Outlet_Location_Type'] == str(i)] = train['Item_Outlet_Sales'][train['Outlet_Location_Type'] == str(i)].mean()


In [21]:
#calculating mean absolute error
out_loc_error = MAE(test['Item_Outlet_Sales'] , test['out_loc_mean'] )
out_loc_error

1363.6202533673845

## Mean Item_Outlet_Sales with respect to both Outlet_Location_Type and Outlet_Establishment_Year

In [22]:
combo = pd.pivot_table(train, values = 'Item_Outlet_Sales', index = ['Outlet_Location_Type','Outlet_Establishment_Year'], aggfunc = np.mean)
combo

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_Outlet_Sales
Outlet_Location_Type,Outlet_Establishment_Year,Unnamed: 2_level_1
Tier 1,1985,350.670086
Tier 1,1997,2213.285178
Tier 1,1999,2316.032587
Tier 2,2002,2163.635786
Tier 2,2004,2373.476783
Tier 2,2007,2295.841815
Tier 3,1985,3735.211455
Tier 3,1987,2255.42999
Tier 3,1998,328.811495
Tier 3,2009,1961.889047


In [23]:
# Initiating new empty column
test['Super_mean'] = 0

# Assigning variables to strings ( to shorten code length)
s2 = 'Outlet_Location_Type'
s1 = 'Outlet_Establishment_Year'

# For every Unique Value in s1
for i in test[s1].unique():
  # For every Unique Value in s2
  for j in test[s2].unique():
    # Calculate and Assign mean to new column, corresponding to both unique values of s1 and s2 simultaneously
    test['Super_mean'][(test[s1] == i) & (test[s2]==str(j))] = train['Item_Outlet_Sales'][(train[s1] == i) & (train[s2]==str(j))].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Super_mean'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Super_mean'][(test[s1] == i) & (test[s2]==str(j))] = train['Item_Outlet_Sales'][(train[s1] == i) & (train[s2]==str(j))].mean()


In [24]:
#calculating mean absolute error
super_mean_error = MAE(test['Item_Outlet_Sales'] , test['Super_mean'] )
super_mean_error

1140.3845730546504