# Supervised Learning - Linear Regression

Do you remember the recipe for Machine Learning? Let me remind you once again!

* Define Problem : We start by defining the problem we are trying to solve. This can be as simple as prediction of your next semester's result based on your previous results.
* Collect Data : Next step is to collect relevant data based on the problem definition. This can be your grades in different semesters.
* Prepare Data : The data collected for our problem is preprocessed. This can be removing redundant grades and replacing the missing ones.
* Select Model(Algorithm) : After the data is ready, we proceed to select the machine learning model. The selection is based on the problem type e.g. classification, regression etc and the data that is available to us. The model can be linear regression model in our case.
* Train Model : The selected model is then trained to learn from the data we have collected.
* Evaluate Model : Final step is to evaluate the model that we have trained for accuracy and view the results.
This is exactly what we are going to do here.

## Step 1 - Define Problem

The data scientists at AwesomeMart have collected 2013 sales data for 1559 products across 10 stores in different cities. The aim is to build a predictive model and find out the sales of each product at a particular store using machine learning.

Using this model, AwesomeMart will try to understand the properties of products and stores which play a key role in increasing sales.

## Step 2 - Collect & Prepare Data¶

Step 2.1 - Import Data & Primary Data Analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("train.csv")


Now let us do some quick data analysis!

In [2]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [3]:
df.shape

(8523, 12)

In [4]:
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


Here are a few inferences, you can draw by looking at the output of describe() function:

    * Average cost of an item is 140
    * AwesomeMart was first established at 1985 
    * They have a max sales of 13,086 and min of 33
    * There are about 8,523 products in store and 12 features.
    
For the non-numerical values (e.g. Item_Fat_Content, Item_Type etc.), we can look at frequency distribution to understand whether they make sense or not. The frequency table can be printed by following command:

In [5]:
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

## Step 2.2 - Finding & Imputing Missing Values

In [6]:
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [7]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean')
imputer = imputer.fit(df.iloc[:, 1:2])
df.iloc[:, 1:2] = imputer.transform(df.iloc[:, 1:2])

In [8]:
df['Outlet_Size'] = df['Outlet_Size'].fillna('Medium')

In [9]:
df.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Awesome! No we don't have any missing values.

## Step 2.3 - Data Visualization

In [10]:
# plt.figure(figsize=(6,6))
# sns.boxplot(x = 'Item_MRP', y = 'Item_Outlet_Sales', data = df)

In [11]:
# plt.figure(figsize=(6,6))
# sns.barplot(x = 'Item_Weight', y = 'Item_Outlet_Sales', data = df)

In [12]:
# plt.figure(figsize=(6,6))
# sns.violinplot(x = 'Outlet_Size', y = 'Item_Outlet_Sales', hue = 'Loan_Status', data = df, split = True)

## Step 3 - Modeling
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories

In [15]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [14]:
df.Item_Identifier.value_counts()
Item_Identifier_New = pd.get_dummies(df.Item_Identifier,prefix='Item_Identifier').Item_Identifier_ID

df.Item_Fat_Content.value_counts()
Item_Fat_Content_New = pd.get_dummies(df.Item_Fat_Content,prefix='Item_Fat_Content').Item_Fat_Content_Low Fat

df.Item_Type.value_counts()
Item_Type_New = pd.get_dummies(df.Item_Type,prefix='Item_Type').Item_Type_Dairy

df.Self_Employed.value_counts()
self_emp_category = pd.get_dummies(df.Self_Employed,prefix='employed').employed_Yes

loan_status = pd.get_dummies(df.Loan_Status,prefix='status').status_Y

property_category = pd.get_dummies(df.Property_Area,prefix='property')

AttributeError: 'DataFrame' object has no attribute 'Gender'