# **Regression – Prediction of Grocery Sales**
- **Author:** Yvon Bilodeau
- **Last updated:** August 2022
---

## **Project Desciption**

### **Overview**

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.

### **Data Source**



The data was sourced from [analyticsvidhya.com](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/)

There are 8523 rows, and 12 columns.
The rows represent 8523 observations, and the columns represent 11 features and 1 target variable.

### **Data Dictionary**


- **Item_Identifier:** Unique product ID
- **Item_Weight:** Weight of product
- **Item_Fat_Content:** Whether the product is low fat or regular
- **Item_Visibility:** The percentage of total display area of all products in store allocated to the particular product
- **Item_Type:** The category to which the product belongs
- **Item_MRP:** Maximum Retail Price (list price) of the product
- **Outlet_Identifier:** Unique store ID
- **Outlet_Establishment_Year:** The year in which store was established
- **Outlet_Size:** The size of the store in terms of ground area covered
- **Outlet_Location_Type:** The type of area in which the store is located
- **Outlet_Type:** Whether the outlet is a grocery store or some sort of supermarket
- **Item_Outlet_Sales:** Sales of the product in the particular store. This is the target variable to be predicted.

## **Import Libraries | Load the Dataset**

### **Import Libraries**

In [None]:
import numpy as np
import pandas as pd

from scipy import stats
# StatsModels
import statsmodels.api as sm

import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
price_0_fmt = StrMethodFormatter("${x:,.0f}")
price_2_fmt = StrMethodFormatter("${x:,.2f}")
perc_0_fmt = StrMethodFormatter('{x:.0%}')
perc_2_fmt = StrMethodFormatter('{x:.2%}') 
weight_fmt= StrMethodFormatter("{x:.6}")
density_fmt= StrMethodFormatter("{x:.7}")
decimal_2_fmt = StrMethodFormatter('{x:,.2}') 
decimal_0_fmt = StrMethodFormatter('{x:,.0}') 

import seaborn as sns

import missingno as msno

# Preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Dummy Regression
from sklearn.dummy import DummyRegressor
# Linear Regression
from sklearn.linear_model import LinearRegression
# Lasso Regression
from sklearn.linear_model import Lasso
# Ridge Regression
from sklearn.linear_model import Ridge
# Elastic Net Regression
from sklearn.linear_model import ElasticNet
# Decision Trees
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
# Bagged Trees
from sklearn.ensemble import BaggingRegressor
# Suport Vector Regression
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
# K Neighbors Regression
from sklearn.neighbors import KNeighborsRegressor
# Random Forests
from sklearn.ensemble import RandomForestRegressor
# Gradient Boost
from sklearn.ensemble import GradientBoostingRegressor
# Light Gradient Boost
from lightgbm import LGBMRegressor
# XGBoost - eXtreme Gradient Boost
from xgboost import XGBRegressor

# Import GridSearch for Model Hypertuning
from sklearn.model_selection import GridSearchCV

# Regression Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Set global scikit-learn configuration 
from sklearn import set_config
# Display estimators as a diagram in a Jupyter lab or notebook context
set_config(display=None) # {‘text’, ‘diagram’}, default=None

# Calculate the correlation/strength-of-association of features
# in a data-set with dython
from dython.nominal import associations

In [None]:
# Set display options
# Display all columns
pd.set_option('display.max_column', None)
# Display full width of field
pd.set_option('display.max_colwidth', None)
# Display table across entire available view
pd.set_option('display.width', -1)
# Display all rows
pd.set_option('display.max_rows', None)

In [None]:
image_folder_path = "C:/Users/DELL/Documents/GitHub/Regression---Prediction-of-Grocery-Sales/Images/"

### **Load the Dataset**

In [None]:
# Load the dataset
url = 'https://raw.githubusercontent.com/YBilodeau/Regression---Prediction-of-Grocery-Sales/main/Data/Grocery_Sales.csv'
df = pd.read_csv(url)

In [None]:
# Create copies of the dataset for editing 
eda_df = df.copy()
ml_df = df.copy()

## **Inspect the Data**

### **Display the Row and Column Count**

In [None]:
# Display the number of rows and columns for the dataframe
df.shape
print(f'There are {df.shape[0]} rows, and {df.shape[1]} columns.')
print(f'The rows represent {df.shape[0]} observations, and the columns represent {df.shape[1]-1} features and 1 target variable.')

### **Display Data Types**

In [None]:
# Display the column names and datatypes for each column
# Columns with mixed datatypes are identified as an object datatype
df.dtypes

### **Display Column Names, Count of Non-Null Values, and Data Types**

In [None]:
# Display the column names, count of non-null values, and their datatypes
df.info()

## **Clean the Data**

### **Display First (5) Rows**

In [None]:
# Display the first (5) rows of the dataframe
df.head().style.format({ "Item_MRP":          price_2_fmt,
                         "Item_Outlet_Sales": price_2_fmt,
                         "Item_Visibility":   perc_2_fmt, 
                         "Item_Weight":       weight_fmt})

- The data appears to have loaded correctly.

### **Display the Descriptive Statistics**

In [None]:
# Display the descriptive statistics for the numeric columns
df.describe()

In [None]:
# Display the descriptive statistics for the non-numeric columns
df.describe(exclude="number")

### **Remove Unnecessary Columns**

#### **'Item_Identifier' column**

In [None]:
# Calculate the count of unique values for this column
unique_values = df.Item_Identifier.nunique()
# Display the count of unique values for this column
print(f'This column has {unique_values} unique values.')

- The high cardinality of this object column feature may adversely impact machine learning model prediction performance and processing times, as well as  exaggerate it's feature importance.
- Dropping it for machine learning will be reconsidered during Preprocessing.

### **Remove Unecessary Rows**

In [None]:
# Display the number of duplicate rows in the dataset
print(f'There are {df.duplicated().sum()} duplicate rows.')

- No duplicates were found or dropped.

### **Missing Values**

In [None]:
# Display missing values by column
msno.matrix(df, figsize=(16,3), labels=True, 
            fontsize=12, sort="descending", color=(0,0,0));

In [None]:
# Display the count of missing values by column
print(df.isna().sum())

In [None]:
# Display the percentage of missing values by column
print(df.isna().sum()/len(df)*100)

#### **'Item Weight' column**


- 'Item_Weight' has 1463 (17.16%) missing values.
- All identical 'Item_Identifier's should have the same 'Item_Weight'. The missing values can be imputed from other 'Item_Identifier's that have the same value.
- For EDA, this step can be applied now.
- For Machine Learning, this step will be applied after the train_test_splt utilizing 'Item_Identifier's from the Train dataset only to prevent data leakage.

In [None]:
# Loop through index values of the dataframe
for ind in eda_df.index:
    # Create a filter to select the Item_Identifier corresponding to the index
    item_filter = eda_df['Item_Identifier'] == eda_df.loc[ind,'Item_Identifier']

    # Calculate the rounded value of the mean 'Item_Weight' of this rows 'Item_Identifier' 
    mean_item_weight = round(eda_df.loc[item_filter,'Item_Weight'].mean(), 3)

    # Assign mean_item_weight to the 'Item_Weight' column of this row
    eda_df.loc[ind,'Item_Weight'] = mean_item_weight

In [None]:
# Identify any remaining 'Item_Identifier's without 'Item_Weight' for the eda_df
print(eda_df.Item_Weight.isnull().sum())
eda_df[eda_df.Item_Weight.isnull()]

- For EDA, the remaining (4) values can be imputed from the median value of the column.


In [None]:
# Calculate the median value for the column
median_item_weight = eda_df['Item_Weight'].median()

# Fill the column's missing values with the median value for the column
eda_df['Item_Weight'].fillna(median_item_weight, inplace=True)

In [None]:
# Confirm there are no remaining missing values for this column
print(eda_df.Item_Weight.isnull().sum())

#### **'Outlet_Size' column**
- 'Outlet_Size' has 2410 (28%) missing values out of 8523 rows, which is significant.

In [None]:
# Display the unique values for the column
print(df['Outlet_Size'].unique())

In [None]:
# Group the unique values by 'Outlet_Type' and 'Outlet_Location_Type'
df.groupby(['Outlet_Type','Outlet_Location_Type'])['Outlet_Size'].unique()

- It does not appear that we can impute 'Outlet_Size' based on 'Outlet_Type' and/or 'Outlet_Location_Type'.
- Missing values will be replaced with 'Unknown' for this Object Data Type.

In [None]:
# Replace missing values with 'Unknown'
df.Outlet_Size.fillna('Unknown',inplace=True)
eda_df.Outlet_Size.fillna('Unknown',inplace=True)
ml_df.Outlet_Size.fillna('Unknown',inplace=True)

In [None]:
# Confirm missing values have been replaced
print(df['Outlet_Size'].unique())
print(eda_df['Outlet_Size'].unique())
print(ml_df['Outlet_Size'].unique())

### **Inspect Column Datatypes for Errors**

- Ensure all columns match the data types listed in the data dictionary.

In [None]:
 # Display column names and datatypes
df.dtypes

In [None]:
# Display first (5) rows
df.head()

- All columns match their datatypes.

### **Inspect Column Names for Errors**

- Check for common syntax errors which may include extra white spaces at the beginning or end of strings or column names.

- Check for typos or inconsistencies in strings that need to be fixed.(example: cat, Cat, cats)

In [None]:
# Display column names
df.columns

- No issues with column names noted.

### **Inspect Column Values for Errors**

#### **Object Datatypes**

- Check for common syntax errors which may include extra white spaces at the beginning or end of strings or column names.

- Check for typos or inconsistencies in strings that need to be fixed.

In [None]:
# Display the descriptive statistics for the non-numeric columns
df.describe(exclude="number")

In [None]:
# Create a series of the datatypes
data_types = df.dtypes
# Create a filter to select only the object datatypes
object_data_types = data_types[(data_types == "object")]
# Display the series of object datatypes
object_data_types

##### **'Item_Identifier' column**

In [None]:
# Display the unique values from the column
print(df['Item_Identifier'].unique())

- This column has 1559 unique values, which is too many to 
inspect visually.

##### **'Item_Fat_Content' column**

In [None]:
# Display the unique values from the column
print(df['Item_Fat_Content'].unique())

In [None]:
df.groupby(['Item_Fat_Content'])['Item_Fat_Content'].count()

- Replace 'low fat' and 'LF' with 'Low Fat'.
- Replace 'reg' with 'Regular'.

In [None]:
# Define a dictionary with key/value pairs
dict = {"low fat": 'Low Fat', "reg": 'Regular', "LF": 'Low Fat'}

In [None]:
# Replace values using dictionary
eda_df.replace({'Item_Fat_Content': dict}, inplace = True)
ml_df.replace({'Item_Fat_Content': dict}, inplace = True)

In [None]:
# Display unique values to confirm they have been updated
print(eda_df['Item_Fat_Content'].unique())
print(ml_df['Item_Fat_Content'].unique())

- Replacement values confirmed.

##### **'Item_Type' column**

In [None]:
# Display the unique values from the column
print(df['Item_Type'].unique())

In [None]:
df.groupby(['Item_Type'])['Item_Type'].count()

- No issues noted.

##### **'Outlet_Identifier' column**

In [None]:
# Display the unique values from the column
print(df['Outlet_Identifier'].unique())

In [None]:
df.groupby(['Outlet_Identifier'])['Outlet_Identifier'].count()

- No issues noted.

##### **'Outlet_Size' column**

In [None]:
# Display the unique values from the column
print(df['Outlet_Size'].unique())

In [None]:
df.groupby(['Outlet_Size'])['Outlet_Size'].count()

- No issues noted.

##### **'Outlet_Location_Type' column**

In [None]:
# Display the unique values from the column
print(df['Outlet_Location_Type'].unique())

In [None]:
df.groupby(['Outlet_Location_Type'])['Outlet_Location_Type'].count()

- No issues noted.

##### **'Outlet_Type' column**

In [None]:
# Display the unique values from the column
print(df['Outlet_Type'].unique())

In [None]:
df.groupby(['Outlet_Type'])['Outlet_Type'].count()

- No issues noted.

#### **Numerical Datatypes**

In [None]:
# Display the descriptive statistics for the numeric columns
df.describe().round(3)

- **'Item_Weight'** -
No unusual statistics were noted.
- **'Item_Visibility'** - 
The percentage of total display area of all products in a store allocated to the particular product appears to have an extreme range. (Minimum value of 0.000 and a Maximum value of 0.328)
- **'Item_MRP'** -
No unusual statistics were noted.
- **'Outlet_Establishment_Year'** -
No unusual statistics were noted.
- **'Item_Outlet_Sales'** -
No unusual statistics were noted.


## **Exploratory Data Analysis**

#### **Functions**

##### **Statistics Function**

In [None]:
# Create a function to display supplemental statistics 
def column_statistics(df, column_name, max_unique_values_to_disply=20):
    # Display the count of missing values for this column
    print(f'Missing Values: {df[column_name].isna().sum()} ({round((df[column_name].isna().sum())/(df.shape[0])*100,1)})%')

    # Determine Outliers - Only if this is a numeric column
    if (df[column_name].dtype == 'int64') | (df[column_name].dtype == 'float64'):
        # Create outlier filters
        q1 = df[column_name].quantile(0.25) # 25th percentile
        q3 = df[column_name].quantile(0.75) # 75th percentile
        iqr = q3 - q1 # Interquartile range
        low_limit = q1 - (1.5 * iqr) # low limit
        high_limit = q3 + (1.5 * iqr) # high limit
        # Create outlier dataframes
        low_df = df[(df[column_name] < low_limit)]
        high_df = df[(df[column_name] > high_limit)]
        # Calculate the outlier counts and percentages
        low_oulier_count = low_df.shape[0]
        low_outlier_percentge = round(((low_oulier_count)/(df.shape[0])*100),1)
        high_oulier_count = high_df.shape[0]
        high_outlier_percentge = round(((high_oulier_count)/(df.shape[0])*100),1)
        # Display the outlier counts.
        print(f'Outliers: {low_oulier_count} ({low_outlier_percentge})% low, {high_oulier_count} ({high_outlier_percentge})% high')
        
    # Display the count of unique values for this column
    print(f'Unique values: {df[column_name].nunique()}')

    # Display the unique values including Nan and their counts for this column,
    # if the number of unique values is below the function parameter
    if df[column_name].nunique() < max_unique_values_to_disply:
        print(df[column_name].value_counts(dropna=False))

##### **Function to Display Histogram**

In [None]:
# Create a function to create a Histogram
def hist_plot(df, column_name, bin_count='auto',
              fs=(8,4), file_name='',
              tit_lab='', x_lab='', y_lab='', 
              fmt='',
              hza='center', rot=0):
    # Variables
    feature = df[column_name]
    mean = feature.mean()
    median = feature.median()
    # Plot
    fig, ax = plt.subplots(nrows=1, figsize=fs, facecolor='w')
    sns.histplot(data=df, x=column_name, 
                 color='#069AF3', linewidth=2, bins=bin_count)
    # Title and labels
    plt.title(tit_lab, fontsize = 18, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel('Instances', fontsize = 14, weight='bold')
    # String format
    if fmt != '':
        ax.xaxis.set_major_formatter(fmt)
    # Ticks
    plt.xticks(fontsize=10, weight='bold')
    plt.yticks(fontsize=10, weight='bold')
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Vertical lines
    ax.axvline(mean, color = 'magenta', linewidth=2, 
               label=f'Mean = {mean:,.2f}')
    ax.axvline(median, ls='dotted', color = 'darkmagenta',  linewidth=2, 
               label=f'Median = {median:,.2f}')
    # Display
    ax.legend();
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display a KDE Plot**

In [None]:
# Create a function to create a KDE Plot
def kde_plot(df, column_name,
             fs=(8,4), file_name='',
             tit_lab='', x_lab='', y_lab='',
             fmt='',
             hza='center', rot=0):    
    # Variables
    feature = df[column_name]
    mean = feature.mean()
    median = feature.median()
    # PLot
    fig, ax = plt.subplots(nrows=1, figsize=fs, facecolor='w')
    sns.kdeplot(data=df, x=column_name, 
                color='#069AF3', linewidth=2, fill=True)
    # Title and labels
    plt.title(tit_lab, fontsize = 18, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel('Density', fontsize = 14, weight='bold')
    # String format
    if fmt != '':
        ax.xaxis.set_major_formatter(fmt)
    # Ticks
    plt.xticks(fontsize=10, weight='bold')
    plt.yticks(fontsize=10, weight='bold')
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Vertical lines
    ax.axvline(mean, color = 'magenta', linewidth=2, 
               label=f'Mean = {mean:,.2f}')
    ax.axvline(median, ls='dotted', color = 'darkmagenta',  linewidth=2, 
               label=f'Median = {median:,.2f}')
    # Display
    ax.legend();
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display a Boxplot**

In [None]:
# Create a function to create a KDE Plot
def box_plot(df, column_name,
             fs=(8,4), file_name='',
             tit_lab='', x_lab='', y_lab='', 
             fmt='',
             hza='center', rot=0):    
    # Variables
    feature = df[column_name]
    mean = feature.mean()
    median = feature.median()
    # Plot
    fig, ax = plt.subplots(nrows=1, figsize=fs, facecolor='w')
    sns.boxplot(data=df, x=column_name, width=.5, color='#069AF3', ax=ax,
                medianprops={'color':'k', 'linewidth':2},
                whiskerprops={'color':'k', 'linewidth':2},
                boxprops={'facecolor':'#069AF3', 
                          'edgecolor':'k', 'linewidth':2},
                capprops={'color':'k', 'linewidth':3},
                flierprops={'marker':'o', 'markersize':8, 
                            'markerfacecolor':'#069AF3', 
                            'markeredgecolor':'k'}); 
    # Title and labels
    plt.title(tit_lab, fontsize = 18, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel('', fontsize = 14, weight='bold')
    # String format
    if fmt != '':
        ax.xaxis.set_major_formatter(fmt)
    # Ticks
    plt.xticks(fontsize=10, weight='bold')
    plt.yticks(fontsize=10, weight='bold')
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Vertical lines
    ax.axvline(mean, color = 'magenta', linewidth=2, 
               label=f'Mean = {mean:,.2f}')
    ax.axvline(median, ls='dotted', color = 'darkmagenta',  linewidth=2, 
               label=f'Median = {median:,.2f}')

    # ax.yaxis.set_major_formatter({x:.5f});
    ax.legend();
    plt.tight_layout()
    plt.show;

##### **Function to Display Countplot**

In [None]:
# Create a function to create a Count Plot
def count_plot(df, column_name, label_order, 
               fs=(8,4), file_name='',
               tit_lab='', x_lab='', y_lab='', 
               hza='center', rot=0):       
    # Plot    
    fig, ax = plt.subplots(nrows=1, figsize=fs, facecolor='w')
    sns.countplot(data=df, x=column_name, lw=3, ec='k', 
                  color='#069AF3', order=label_order)
#     plt.xticks(weight='bold', rotation=rot, ha=hza)
    # Title and labels
    plt.title(tit_lab, fontsize = 18, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel('Instances', fontsize = 14, weight='bold')
    # Ticks
    plt.xticks(fontsize=12, weight='bold', ha=hza, rotation=rot,)
    plt.yticks(fontsize=10, weight='bold')
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Display
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display Scatter Plot**

In [None]:
# Create a function to create a Scatter Plot
def scatter_plot(df, x, y, file_name='',
                 tit_lab='', x_lab='', y_lab=''):
    # Variables
    palette_dict = {0: 'indigo' , 1: 'magenta'}
    # PLot
    fig, ax = plt.subplots(figsize=(8,4), facecolor='w');
    sns.scatterplot(x=df[x], y=df[y]);
    # Regression Line
    m, b, *_ = stats.linregress(df[x], df[y])
    # Labels
    plt.title(tit_lab, fontsize = 22, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel(y_lab, fontsize = 14, weight='bold');
    # Ticks
    plt.xticks(fontsize = 10, weight='bold')
    plt.yticks(fontsize = 10, weight='bold');
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # String format
    ax.yaxis.set_major_formatter(price_fmt);
    # Display
    # plt.legend(bbox_to_anchor=(1.23, 1))
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display Reg Plot**

In [None]:
# Create a function to create a Reg Plot
def reg_plot(df, x, y, file_name='',
             tit_lab='', x_lab='', y_lab='',
             x_fmt='', y_fmt=''):
    # Variaables
    palette_dict = {0: 'indigo' , 1: 'magenta'}
    # PLot
    fig, ax = plt.subplots(figsize=(8,4), facecolor='w');
    sns.scatterplot(x=df[x], y=df[y]);
    sns.regplot(data=df, x=x,y=y,
                line_kws={'color':'black'},
                scatter_kws={'s':1}); 
    # Regression Line
    m, b, *_ = stats.linregress(df[x], df[y])
    # Labels
    plt.title(tit_lab, fontsize = 22, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel(y_lab, fontsize = 14, weight='bold')
    # String format
    if x_fmt != '':
        ax.xaxis.set_major_formatter(x_fmt)
    if y_fmt != '':
        ax.yaxis.set_major_formatter(y_fmt)
    # Ticks
    plt.xticks(fontsize = 10, weight='bold')
    plt.yticks(fontsize = 10, weight='bold');
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Display
    # plt.legend(bbox_to_anchor=(1.23, 1))
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display Reg Plot - Version 2**

In [None]:
# Create a function to create a Reg Plot
# Version 2 will distinguish 'Outlet_Location_Type'
def reg_2_plot(df, x, y, file_name='',
               tit_lab='', x_lab='', y_lab='',
               x_fmt='', y_fmt=''):
    # PLot
    fig, ax = plt.subplots(figsize=(8,4), facecolor='w');
    sns.scatterplot(data=df, x=df[x], y=df[y], hue='Outlet_Type', alpha=.8);

    sns.regplot(data=df[df['Outlet_Type'] == 'Grocery Store'], 
                x='Item_Visibility', y='Item_Outlet_Sales', 
                lowess=True,
                scatter_kws={'s':0},
                line_kws={'color':'green', 'lw':3})
    sns.regplot(data=df[df['Outlet_Type'] == 'Supermarket Type1'], 
                x = 'Item_Visibility', y = 'Item_Outlet_Sales', 
                lowess=True,
                scatter_kws={'s':0}, 
                line_kws={'color': "blue", 'lw':3})
    sns.regplot(data=df[df['Outlet_Type'] == 'Supermarket Type2'], 
                x = 'Item_Visibility', y = 'Item_Outlet_Sales', 
                lowess=True,
                scatter_kws={'s':0}, 
                line_kws={'color': "orange", 'lw':3})
    sns.regplot(data=df[df['Outlet_Type'] == 'Supermarket Type3'], 
                x = 'Item_Visibility', y = 'Item_Outlet_Sales', 
                lowess=True,
                scatter_kws={'s':0}, 
                line_kws={'color': "red", 'lw':3})
    # Regression Line
    m, b, *_ = stats.linregress(df[x], df[y])
    # Labels
    plt.title(tit_lab, fontsize = 22, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel(y_lab, fontsize = 14, weight='bold');
    # String format
    if x_fmt != '':
        ax.xaxis.set_major_formatter(x_fmt)
    if y_fmt != '':
        ax.yaxis.set_major_formatter(y_fmt)
    # Ticks
    plt.xticks(fontsize = 10, weight='bold')
    plt.yticks(fontsize = 10, weight='bold');
    # Face 
    ax.set_facecolor('lightblue')
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Display
    # plt.legend(bbox_to_anchor=(1.23, 1))
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display Bar Plot**

In [None]:
# Create a function to create a Bar Plot
def bar_plot(df, x_column_name, y_column_name, label_order, 
             fs=(8,4), file_name='',
             tit_lab='', x_lab='', y_lab='', 
             fmt='',
             hza='center', rot=0):
    # Plot
    fig, ax = plt.subplots(nrows=1, figsize=fs, facecolor='w')
    sns.barplot(data=df, 
                y=y_column_name, 
                x=x_column_name, 
                order=label_order,
                palette='viridis');
    # Title and labels
    plt.title(tit_lab, fontsize = 18, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel(y_lab, fontsize = 14, weight='bold')
    # Ticks
    plt.xticks(fontsize=10, weight='bold', rotation=rot, ha=hza)
    plt.yticks(fontsize=10, weight='bold')
    # Face
    ax.set_facecolor('w')
    # String format
    if fmt != '':
        ax.yaxis.set_major_formatter(fmt)
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Display    
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display Line Plot for Model Metrics**

In [None]:
# Create a function to create a Bar Plot
def line_plot(df, column_1='Train R2', column_2='Test R2',
             fs=(12,8), file_name='',
             tit_lab='', x_lab='', y_lab='', 
             fmt='',
             hza='center', rot=0):
    # Plot
    fig, ax = plt.subplots(nrows=1, figsize=fs, facecolor='w')
    sns.lineplot(data=model_metrics_df[column_1], color="blueviolet", 
                 linewidth=3, markersize=10, marker='o', label='Train');
    sns.lineplot(data=model_metrics_df[column_2], color="yellowgreen", 
                 linewidth=3, markersize=10, marker='o', label='Test');
    # Title and labels
    plt.title(tit_lab, fontsize = 18, weight='bold')
    plt.xlabel(x_lab, fontsize = 14, weight='bold')
    plt.ylabel(y_lab, fontsize = 14, weight='bold')
    # Ticks
    plt.xticks(fontsize=10, weight='bold', rotation=rot, ha=hza)
    plt.yticks(fontsize=10, weight='bold')
    # Face
    ax.set_facecolor('w')
    # String format
    if fmt!= '':
        ax.yaxis.set_major_formatter(fmt)
    # Spines
    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(3)
    # Display    
    plt.tight_layout()
    plt.show;
    # Save an image of the plot
    if file_name != '':
        folder_file_name = image_folder_path + file_name
        plt.savefig(folder_file_name, format='png', 
                    facecolor='w', edgecolor='w')

##### **Function to Display Skew**

In [None]:
# Creates a function to determine skew
def skew_function(df, column_name):
    feature = df[column_name]
    mean = feature.mean()
    median = feature.median()
    if median < mean:
        print('This feature is positively skewed.')
    else:
        print('This feature is negatively skewed.')

##### **Function to Display Kurtosis**

In [None]:
# Creates a function to determine kurtosis
def kurtosis_function(df, column_name):
    kurt = stats.kurtosis(df[column_name], fisher = False)
    if kurt > 3:
        print(f'This feature is Leptokurtic, because it has kurtosis value of {kurt}')
        if kurt < 3.5:
             print(f'Though we could say it is Mesokurtic, as the value is close to 3.')       
    elif kurt < 3:
        print(f'This feature is Platykurtic, because it has kurtosis value of {kurt}.')
        if kurt > 2.5:
             print(f'Though we could say it is Mesokurtic, as the value is close to 3.')       
    else:
        print(f'This feature is Mesokurtic, because it has kurtosis value of {kurt}.') 

In [None]:
# Creates a function to determine the number of outliers based on z-score
def outlier_function(df, column_name):
    outliers = np.abs(stats.zscore(df[column_name])) > 3
    print(f'This feature has {outliers.sum()} outliers.')

##### **Function to Describe Distribution**

In [None]:
# Determine if this feature has skew, and if so, which direction (+/-)
# Determine the kurtosis of the feature; Mesokurtic, Leptokurtic, or Platykurtic
# Determine the number of outliers for this feature based on zscore
def dist_desc(df, column_name):
    skew_function(df, column_name)
    kurtosis_function(df, column_name)
    outlier_function(df, column_name)

### **Categorical Columns**

#### **Summary Statistics**

In [None]:
# Display the descriptive statistics for the non-numeric columns
eda_df.describe(exclude=('number'))

#### **'Item_Identifier' column**

In [None]:
# Display column statistics
eda_df.Item_Identifier.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_Identifier')

In [None]:
# Display normalzed value counts
eda_df['Item_Identifier'].value_counts(normalize=True).head()

#### **'Item_Fat_Content' column**

In [None]:
# Display column statistics
df.Item_Fat_Content.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_Fat_Content')

In [None]:
# Display normalzed value counts
eda_df['Item_Fat_Content'].value_counts(normalize=True)

In [None]:
# Utilize function to display count plot
count_plot(eda_df, 'Item_Fat_Content', ['Low Fat', 'Regular'], 
           tit_lab='Item Fat Content')

#### **'Item_Type' column**

In [None]:
# Display column statistics
eda_df.Item_Type.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_Type')

In [None]:
# Display normalzed value counts
eda_df['Item_Type'].value_counts(normalize=True)

In [None]:
labels = ['Fruits and Vegetables', 'Snack Foods','Household', 'Frozen Foods',
              'Dairy','Canned', 'Baking Goods', 'Health and Hygiene',
              'Soft Drinks', 'Meat', 'Breads', 'Hard Drinks', 'Others',
              'Starchy Foods', 'Breakfast', 'Seafood']

# Utilize function to display count plot
count_plot(eda_df, 'Item_Type', labels, 
           fs=(14,6), hza='right', rot=30,
           tit_lab='Item Type')

#### **'Outlet_Size' column**

In [None]:
# Display column statistics
eda_df['Outlet_Size'].describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Outlet_Size')

In [None]:
# Display normalzed value counts
eda_df['Outlet_Size'].value_counts(normalize=True)

In [None]:
# Utilize function to display count plot
labels = ['Medium', 'Unknown', 'Small', 'High']

# Utilize function to display count plot
count_plot(eda_df, 'Outlet_Size', labels, 
           tit_lab='Outlet Size')

#### **'Outlet_Location_Type' column**

In [None]:
# Display column statistics
eda_df['Outlet_Location_Type'].describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Outlet_Location_Type')

In [None]:
# Display normalzed value counts
eda_df['Outlet_Location_Type'].value_counts(normalize=True)

In [None]:
# Utilize function to display count plot
labels = ['Tier 3', 'Tier 2', 'Tier 1']

# Utilize function to display count plot
count_plot(eda_df, 'Outlet_Location_Type', labels, 
           tit_lab='Outlet Location Type')

#### **'Outlet_Type' column**

In [None]:
# Display column statistics
eda_df['Outlet_Type'].describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Outlet_Type')

In [None]:
# Display normalzed value counts
eda_df['Outlet_Type'].value_counts(normalize=True)

In [None]:
eda_df['Outlet_Type'].unique()

In [None]:
# Utilize function to display count plot
labels = ['Supermarket Type1', 'Grocery Store', 
          'Supermarket Type3', 'Supermarket Type2']

# Utilize function to display count plot
count_plot(eda_df, 'Outlet_Type', labels, 
           tit_lab='Outlet Type',
           rot=30, hza='right')

### **Numerical Columns**

#### **Summary Statistics**

In [None]:
# Display the descriptive statistics for the numeric columns
eda_df.describe()

#### **'Item_Weight' column**

**Statistics**

In [None]:
# Display column statistics
eda_df.Item_Weight.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_Weight')

**Plots**

In [None]:
# Utilize function to display histogram plot
hist_plot(eda_df, 'Item_Weight',
          tit_lab='Item Weight', 
          x_lab='Pounds')

In [None]:
# Utilize function to display a KDE plot
kde_plot(eda_df, 'Item_Weight',
         tit_lab='Item Weight', 
         x_lab='Pounds')

In [None]:
box_plot(eda_df, 'Item_Weight',
             tit_lab='Item Weight', 
             x_lab='Pounds')

In [None]:
reg_plot(eda_df, 'Item_Weight', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Item Weight (Pounds)', y_lab='Item Outlet Sales (Dollars)',
         y_fmt=price_0_fmt)

**Distribution Description**

- This feature has a continuous distribution.

In [None]:
# Determine if this feature has skew, and if so, which direction (+/-)
# Determine the kurtosis of the feature; Mesokurtic, Leptokurtic, or Platykurtic
# Determine the number of outliers for this feature based on zscore
dist_desc(eda_df, 'Item_Weight')

#### **'Item_Visibility' column**

**Statistics**

In [None]:
# Display column statistics
eda_df.Item_Visibility.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_Visibility')

**Plots**

In [None]:
# Utilize function to display histogram plot
hist_plot(eda_df, 'Item_Visibility',
          tit_lab='Item Visibility', 
          x_lab='Percentage of display area',
          fmt=perc_0_fmt)

In [None]:
# Utilize function to display a KDE plot
kde_plot(eda_df, 'Item_Visibility',
         tit_lab='Item Visibility', 
         x_lab='Percentage of display area',
         fmt=perc_0_fmt)

In [None]:
box_plot(eda_df, 'Item_Visibility',
         tit_lab='Item Visibility', 
         x_lab='Percentage of display area',
         fmt=perc_0_fmt)

In [None]:
# Utilize function to display regplot 
reg_plot(eda_df, 'Item_Visibility', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Item_Visibility (Percentage of display area)', 
         y_lab='Item Outlet Sales (Dollars)',
         x_fmt=perc_0_fmt, y_fmt=price_0_fmt)

**Distribution Description**

- This feature has a continuous distribution.

In [None]:
# Determine if this feature has skew, and if so, which direction (+/-)
# Determine the kurtosis of the feature; Mesokurtic, Leptokurtic, or Platykurtic
# Determine the number of outliers for this feature based on zscore
dist_desc(eda_df, 'Item_Visibility')

- For EDA, outliers do not need to be removed.
- For Machine Learning, outliers may be removed from the ml_df to determine if it will improve model performance.

#### **'Item_MRP' column**

**Statistics**

In [None]:
# Display column statistics
eda_df.Item_MRP.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_MRP')

**Plots**

In [None]:
# Utilize function to display histogram plot
hist_plot(eda_df, 'Item_MRP',
          tit_lab='Item MRP', x_lab='Dollars',
          fmt=price_0_fmt)

In [None]:
# Utilize function to display a KDE plot
kde_plot(eda_df, 'Item_MRP',
         tit_lab='Item MRP', 
         x_lab='Dollars',
          fmt=price_0_fmt)

In [None]:
box_plot(eda_df, 'Item_MRP',
         tit_lab='Item MRP', 
         x_lab='Dollars',
         fmt=price_0_fmt)

In [None]:
reg_plot(eda_df, 'Item_MRP', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Item MRP (Dollars)', 
         y_lab='Item Outlet Sales (Dollars)',
         x_fmt=price_0_fmt, y_fmt=price_0_fmt)

**Distribution Description**

- This feature has a continuous distribution.

In [None]:
# Determine if this feature has skew, and if so, which direction (+/-)
# Determine the kurtosis of the feature; Mesokurtic, Leptokurtic, or Platykurtic
# Determine the number of outliers for this feature based on zscore
dist_desc(eda_df, 'Item_MRP')

#### **'Outlet_Establishment_Year' column**

In [None]:
# Display column statistics
eda_df['Outlet_Establishment_Year'].describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Outlet_Establishment_Year')

In [None]:
# Display normalzed value counts
eda_df['Outlet_Establishment_Year'].value_counts(normalize=True)

**Plots**

In [None]:
# Utilize function to display histogram plot
hist_plot(eda_df, 'Outlet_Establishment_Year',
          tit_lab='Outlet Establishment Year', 
          x_lab='Year')

In [None]:
# Utilize function to display a KDE plot
kde_plot(eda_df, 'Outlet_Establishment_Year',
         tit_lab='Outlet Establishment Year', 
         x_lab='Year')

In [None]:
box_plot(eda_df, 'Outlet_Establishment_Year',
         tit_lab='Outlet Establishment Year', 
         x_lab='Year')

In [None]:
# Utilize function to display regplot 
reg_plot(eda_df, 'Outlet_Establishment_Year', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Outlet Establishment Year', 
         y_lab='Item Outlet Sales',
         y_fmt=price_0_fmt)

**Distribution Description**

- This feature has a discrete distribution.

In [None]:
# Determine if this feature has skew, and if so, which direction (+/-)
# Determine the kurtosis of the feature; Mesokurtic, Leptokurtic, or Platykurtic
# Determine the number of outliers for this feature based on zscore
dist_desc(eda_df, 'Outlet_Establishment_Year')

#### **'Item_Outlet_Sales' column**

**Statistics**

In [None]:
# Display column statistics
eda_df.Item_Outlet_Sales.describe()

In [None]:
# Display supplemental column statistics
column_statistics(eda_df, 'Item_Outlet_Sales')

**Plots**

In [None]:
# Utilize function to display histogram plot
hist_plot(eda_df, 'Item_Outlet_Sales',
          tit_lab='Item Outlet Sales', 
          x_lab='Dollars',
          fmt=price_0_fmt)

In [None]:
# Utilize function to display a KDE plot
kde_plot(eda_df, 'Item_Outlet_Sales',
         tit_lab='Item Outlet Sales', 
         x_lab='Dollars',
         fmt=price_0_fmt)

In [None]:
box_plot(eda_df, 'Item_Outlet_Sales',
         tit_lab='Item Outlet Sales', 
         x_lab='Dollars',
         fmt=price_0_fmt)

**Distribution Description**

- This feature has a continuous distribution.

In [None]:
# Determine if this feature has skew, and if so, which direction (+/-)
# Determine the kurtosis of the feature; Mesokurtic, Leptokurtic, or Platykurtic
# Determine the number of outliers for this feature based on zscore
dist_desc(eda_df, 'Item_Outlet_Sales')

- This column is our target column, outliers will not be removed.

### **Feature Correlation**

In [None]:
# Plot Correlation Heatmap
plt.figure(figsize = (8,8),facecolor='w')
corr = eda_df.corr().abs()
mask = np.triu(np.ones_like(corr))
sns.heatmap(corr,square=True, cmap='viridis', annot=True, mask=mask);
plt.title('Correlation Heatmap', fontsize = 16, weight='bold')
plt.xticks(fontsize = 12, weight='bold', rotation=90)
plt.yticks(fontsize = 12, weight='bold', rotation=0);
plt.tight_layout()
plt.show;

In [None]:
# Calculate the correlation/strength-of-association of features in data-set 
# with both categorical and continuous features using:
# - Pearson's R for continuous-continuous cases
# - Correlation Ratio for categorical-continuous cases 
# - Cramer's V or Theil's U for categorical-categorical cases
associations(eda_df, 
             figsize=(10,10), 
             cmap='viridis', 
             display_columns='Item_Outlet_Sales', 
             hide_rows='Item_Outlet_Sales')

## **Explanatory Data Analysis**

### **Company**

In [None]:
oldest_outlet = eda_df.Outlet_Establishment_Year.max()
print(f'The first outlet store was opened in {oldest_outlet}.')

In [None]:
number_outlets = eda_df.Outlet_Identifier.nunique()
print(f'The company has {number_outlets} outlet stores.')

In [None]:
number_items = eda_df.Item_Identifier.nunique()
number_item_types = eda_df.Item_Type.nunique()
print(f'The company offers {number_items} items across a total of {number_item_types} product categories.')

In [None]:
total_sales = eda_df.Item_Outlet_Sales.sum()
print(f'Total sales for the period were ${total_sales :,.2f}.')

### **Outlets**

In [None]:
# Create a dataframe grouped by Outlet_Identifier displaying the
# aggregated sum of Item_Outlet_Sales
outlet_identifier_df = eda_df.groupby(['Outlet_Identifier'])\
                 ['Item_Outlet_Sales'].agg(['sum'])\
                 .sort_values(['sum'], ascending = False)
outlet_identifier_df['sum'] = round(outlet_identifier_df['sum'],2)

In [None]:
# Create column by copying from index
outlet_identifier_df.insert(loc = 0,
          column = 'Outlet_Identifier',
          value = outlet_identifier_df.index)

In [None]:
# Reset index
outlet_identifier_df.reset_index(drop=True, inplace=True)

In [None]:
# Rename aggregate column name
outlet_identifier_df = outlet_identifier_df.rename(columns={'sum': 'Total Sales'})

In [None]:
# Display the dataframe
outlet_identifier_df.style.format({'Total Sales': price_2_fmt})


In [None]:
# Utilize function to display bar plot
bar_plot(df=outlet_identifier_df, 
         x_column_name='Outlet_Identifier', 
         y_column_name='Total Sales', 
         fs=(10,5), file_name='Total Sales by Outlet.png',
         tit_lab='Total Sales by Outlet',
         label_order=outlet_identifier_df.Outlet_Identifier,
         fmt=price_2_fmt,
         hza='center', rot=0)

In [None]:
outlet_df= eda_df.loc[:,['Outlet_Identifier', 'Outlet_Size','Outlet_Type','Outlet_Location_Type']]
# Drop duplicates
outlet_df.drop_duplicates(inplace=True, ignore_index=True)
# Display the dataframe
outlet_df.style.hide(axis='index')

#### **Outlet Size**

In [None]:
# Create a dataframe grouped by Outlet_Size displaying the
# aggregated sum of Item_Outlet_Sales for the entire company
company_sales_by_outlet_size_df = eda_df.groupby('Outlet_Size')['Item_Outlet_Sales'].agg(['sum']).sort_values('sum', ascending = False).head(10)
company_sales_by_outlet_size_df['sum'] = round(company_sales_by_outlet_size_df['sum'],3)


In [None]:
# Create column by copying from index
company_sales_by_outlet_size_df.insert(loc = 0,
          column = 'Outlet_Size',
          value = company_sales_by_outlet_size_df.index)

In [None]:
# Reset index
company_sales_by_outlet_size_df.reset_index(drop=True, inplace=True)

In [None]:
# Rename aggregate column name
company_sales_by_outlet_size_df = company_sales_by_outlet_size_df.rename(columns={'sum': 'Total Sales'})

In [None]:
# Display the dataframe
company_sales_by_outlet_size_df#.style.format({'Total Sales': price_2_fmt})

In [None]:
# Utilize function to display bar plot
bar_plot(df=company_sales_by_outlet_size_df, 
         x_column_name='Outlet_Size', 
         y_column_name='Total Sales', 
         fs=(10,5), file_name='Total Sales by Outlet Size.png',
         tit_lab='Total Sales by Outlet Size',
         label_order=company_sales_by_outlet_size_df.Outlet_Size,
         fmt=price_2_fmt,
         hza='center', rot=0)

In [None]:
# Create a dataframe grouped by Outlet_Size displaying the
# average Outlet_Identifiers aggregated sum of Item_Outlet_Sales
outlet_size_sales_df = pd.DataFrame()

for size in eda_df['Outlet_Size'].unique():
    filter = eda_df['Outlet_Size'] == size
    filtered_df = eda_df[filter].copy()
    
    size_sales_sum = filtered_df['Item_Outlet_Sales'].sum()
    size_outlet_count = filtered_df['Outlet_Identifier'].nunique()
    size_item_count = filtered_df['Item_Identifier'].count()
    
    average_size_sales = round(size_sales_sum/size_outlet_count, 2) 
    average_size_count = round(size_item_count/size_outlet_count, 2)
    
    outlet_size_sales_df.loc[size, 'Average Outlet Sales'] = average_size_sales
    outlet_size_sales_df.loc[size, 'Average Item Count'] = average_size_count

In [None]:
outlet_size_sales_df = outlet_size_sales_df.sort_values(by=['Average Outlet Sales'], ascending=False)
outlet_size_sales_df

In [None]:
labels = ['Medium', 'High', 'Small', 'Unknown']

# Utilize function to display bar plot
bar_plot(df=outlet_size_sales_df, 
         x_column_name=outlet_size_sales_df.index,
         y_column_name='Average Outlet Sales', 
         fs=(10,5), file_name='Average Outlet Sales by Outlet Size.png',
         tit_lab='Average Outlet Sales by Outlet Size',
         label_order=labels,
         fmt=price_2_fmt,
         hza='center', rot=0)

In [None]:
outlet_size_df = eda_df.groupby(['Outlet_Size','Outlet_Identifier'])\
                 ['Item_Outlet_Sales'].agg(['sum'])\
                 .sort_values(['sum'], ascending = False)
outlet_size_df['sum'] = round(outlet_size_df['sum'],3)

In [None]:
outlet_size_df

In [None]:
# Drop 'Outlet_Size'
ml_df = ml_df.drop(columns=['Outlet_Size'])

#### **Location Type**

In [None]:
# Create a dataframe grouped by Outlet_Location_Type displaying the
# aggregated sum of Item_Outlet_Sales for the entire company
company_sales_by_outlet_location_type_df = eda_df.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].agg(['sum']).sort_values('sum', ascending = False).head(10)
company_sales_by_outlet_location_type_df['sum'] = round(company_sales_by_outlet_location_type_df['sum'],3)
company_sales_by_outlet_location_type_df

In [None]:
# Create column by copying from index
company_sales_by_outlet_location_type_df.insert(loc = 0,
          column = 'Outlet_Location_Type',
          value = company_sales_by_outlet_location_type_df.index)

In [None]:
# Reset index
company_sales_by_outlet_location_type_df.reset_index(drop=True, inplace=True)

In [None]:
# Rename aggregate column name
company_sales_by_outlet_location_type_df = company_sales_by_outlet_location_type_df.rename(columns={'sum': 'Total Sales'})

In [None]:
# Display the dataframe
company_sales_by_outlet_location_type_df#.style.format({'Total Sales': price_2_fmt})

In [None]:
# Utilize function to display bar plot
bar_plot(df=company_sales_by_outlet_location_type_df, 
         x_column_name='Outlet_Location_Type', 
         y_column_name='Total Sales', 
         fs=(10,5), file_name='Total Sales by Outlet Loation Type.png',
         tit_lab='Total Sales by Outlet Location Type',
         label_order=company_sales_by_outlet_location_type_df.Outlet_Location_Type,
         fmt=price_2_fmt,
         hza='center', rot=0)

In [None]:
# Create a dataframe grouped by Outlet_Location_Type displaying the
# average Outlet_Identifiers aggregated sum of Item_Outlet_Sales
average_outlet_sales_by_location_type_df = pd.DataFrame()
for out_type in eda_df['Outlet_Location_Type'].unique():
    filter = eda_df['Outlet_Location_Type'] == out_type
    filtered_df = eda_df[filter].copy()
    
    type_sales_sum = filtered_df['Item_Outlet_Sales'].sum()
    type_outlet_count = filtered_df['Outlet_Identifier'].nunique()
    type_item_count = filtered_df['Item_Identifier'].count()
    
    average_type_sales = round(type_sales_sum/type_outlet_count, 2)  
    average_type_count = round(type_item_count/type_outlet_count, 2)
    
    average_outlet_sales_by_location_type_df.loc[out_type, 'Average Outlet Sales'] = average_type_sales
    average_outlet_sales_by_location_type_df.loc[out_type, 'Average Item Count'] = average_type_count

In [None]:
# Display the dataframe
average_outlet_sales_by_location_type_df = average_outlet_sales_by_location_type_df.sort_values(by=['Average Outlet Sales'], ascending=False)
average_outlet_sales_by_location_type_df

In [None]:
labels = ['Tier 2', 'Tier 3', 'Tier 1']

# Utilize function to display bar plot
bar_plot(df=average_outlet_sales_by_location_type_df, 
         x_column_name=average_outlet_sales_by_location_type_df.index, 
         y_column_name='Average Outlet Sales', 
         fs=(10,5), file_name='Average Outlet Sales by Outlet Location Type.png',
         tit_lab='Average Outlet Sales by Outlet Location Type',
         label_order=labels,
         fmt=price_2_fmt,
         hza='center', rot=0)

In [None]:
labels = ['Tier 2', 'Tier 3', 'Tier 1']

# Utilize function to display bar plot
bar_plot(df=average_outlet_sales_by_location_type_df, 
         x_column_name=average_outlet_sales_by_location_type_df.index, 
         y_column_name='Average Item Count', 
         fs=(10,5), file_name='Average Outlet Item Count by Outlet Location Type.png',
         tit_lab='Average Outlet Item Count by Outlet Location Type',
         label_order=labels,
         fmt=price_2_fmt,
         hza='center', rot=0)

In [None]:
outlet_sales_by_location_type_df = eda_df.groupby(['Outlet_Location_Type','Outlet_Identifier'])\
                 ['Item_Outlet_Sales'].agg(['sum'])\
                 .sort_values(['sum'], ascending = False)
outlet_sales_by_location_type_df['sum'] = round(outlet_sales_by_location_type_df['sum'],2)

In [None]:
outlet_sales_by_location_type_df

In [None]:
# Drop 'Outlet_Location_Type'
ml_df = ml_df.drop(columns=['Outlet_Location_Type'])

#### **Outlet Type**

In [None]:
# Create a dataframe grouped by Outlet_Type displaying the
# aggregated sum of Item_Outlet_Sales for the entire company
company_sales_by_outlet_type_df = eda_df.groupby('Outlet_Type')['Item_Outlet_Sales'].agg(['sum']).sort_values('sum', ascending = False).head(10)
company_sales_by_outlet_type_df['sum'] = round(company_sales_by_outlet_type_df['sum'],3)
company_sales_by_outlet_type_df

In [None]:
# labels = ['Supermarket Type1', 'Supermarket Type3', 
#           'Supermarket Type2', 'Grocery Store']

# bar_plot(company_sales_by_outlet_type_df, 'sum',
#          fs=(10,5), file_name='Total Sales by Outlet Type.png',
#          tit_lab='Total Company Sales by Outlet Type',
#          label_order=labels,
#          fmt=price_0_fmt,
#          hza='center', rot=0)

In [None]:
# Create a dataframe grouped by Outlet_Type displaying the
# aggregated sum of Item_Outlet_Sales for the entire company
company_sales_by_outlet_type_df = eda_df.groupby('Outlet_Type')['Item_Outlet_Sales']\
                .agg(['sum']).sort_values('sum', ascending = False).head(10)\
                .reset_index()
company_sales_by_outlet_type_df['sum'] = round(company_sales_by_outlet_type_df['sum'],3)
company_sales_by_outlet_type_df.style.format({"sum":  "${:20,.2f}"})

In [None]:
type(company_sales_by_outlet_type_df)

In [None]:
# labels = ['Supermarket Type1', 'Supermarket Type3', 
#           'Supermarket Type2', 'Grocery Store']

# bar_plot(company_sales_by_outlet_type_df, column_name='sum',
#          fs=(10,5), file_name='Total Sales by Outlet Type.png',
#          tit_lab='Total Company Sales by Outlet Type',
#          label_order=labels,
#          fmt=price_0_fmt,
#          hza='center', rot=0)

In [None]:
# Create a dataframe grouped by Outlet_Type displaying the
# average Outlet_Identifiers aggregated sum of Item_Outlet_Sales
average_outlet_sales_by_type_df = pd.DataFrame()
for out_type in eda_df['Outlet_Type'].unique():
    filter = eda_df['Outlet_Type'] == out_type
    filtered_df = eda_df[filter].copy()
    type_sales_sum = filtered_df['Item_Outlet_Sales'].sum()
    type_outlet_count = filtered_df['Outlet_Identifier'].nunique()
    average_type_sales = round(type_sales_sum/type_outlet_count, 2)  
    average_outlet_sales_by_type_df.loc[out_type, 'Average Outlet Sales'] = average_type_sales

In [None]:
# Display the dataframe
average_outlet_sales_by_type_df

In [None]:
# labels = ['Supermarket Type3', 'Supermarket Type1', 
#           'Supermarket Type2', 'Grocery Store']

# bar_plot(average_outlet_sales_by_type_df, 'Average Outlet Sales',
#          file_name='Average Outlet Sales by Outlet Type.png',
#          fs=(10,5), tit_lab='Average Outlet Sales by Outlet Type',
#          label_order=labels,
#          fmt=price_0_fmt,
#          hza='center', rot=0)

In [None]:
outlet_type_df = eda_df.groupby(['Outlet_Type','Outlet_Identifier'])\
                 ['Item_Outlet_Sales'].agg(['sum'])\
                 .sort_values(['sum'], ascending = False)
outlet_type_df['sum'] = round(outlet_type_df['sum'],2)

In [None]:
outlet_type_df

### **Item Types**

In [None]:
# Create a dataframe grouped by Item_Types displaying the
# aggregated sum of Item_Outlet_Sales
item_types_df = eda_df.groupby('Item_Type')['Item_Outlet_Sales'].agg(['sum']).sort_values('sum', ascending = False)
item_types_df['sum'] = round(item_types_df['sum'],2)
item_types_df

In [None]:
# bar_plot(item_types_df, 'sum', 
#          label_order=item_types_df.index, 
#          fs=(10,5), file_name='Total Sales by Item Type.png',
#          tit_lab='Total Sales by Item Type', 
#          fmt=price_0_fmt,
#          hza='right', rot=30)

### **Items**

In [None]:
total_sales_by_item = eda_df.groupby('Item_Identifier')['Item_Outlet_Sales'].agg(['sum']).sort_values('sum', ascending = False)
total_sales_by_item['sum'] = round(total_sales_by_item['sum'],2)

In [None]:
top_10_total_sales_by_item = total_sales_by_item.head(10)

In [None]:
# bar_plot(top_10_total_sales_by_item, 'sum', 
#          label_order=top_10_total_sales_by_item.index, 
#          tit_lab='Total Sales Top 10 Items', 
#          file_name='Total Sales Top 10 Items.png',
#          fmt=price_0_fmt,
#          hza='center', rot=0)

In [None]:
# Calculate 20% of 
round(len(total_sales_by_item)*.2)

In [None]:
top_perc_20_df = total_sales_by_item.head(round(len(total_sales_by_item)*.2))
top_perc_20_df.head(5)

In [None]:
bottom_perc_20_df = total_sales_by_item.tail(round(len(total_sales_by_item)*.2))
bottom_perc_20_df.tail(5).sort_values('sum', ascending = True)

#### **Item_Fat_Content**

In [None]:
top_perc_20_df

## **Preprocessing for Machine Learning**

### **Validation Split**

In [None]:
# Define features (X) and target (y)
X = ml_df.drop(columns = ['Item_Outlet_Sales'])
y = ml_df['Item_Outlet_Sales']

In [None]:
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### **Missing Value Imputation**

In [None]:
# Loop through index values of the dataframe
for ind in X_train.index:
    # Create a filter to select the Item_Identifier corresponding to the index
    item_filter = X_train['Item_Identifier'] == X_train.loc[ind,'Item_Identifier']

    # Calculate the rounded value of the mean 'Item_Weight' 
    # of this rows 'Item_Identifier' 
    mean_item_weight = round(X_train.loc[item_filter,'Item_Weight'].mean(), 3)

    # Assign mean_item_weight to the 'Item_Weight' column of this row
    X_train.loc[ind,'Item_Weight'] = mean_item_weight

In [None]:
# Identify any remaining 'Item_Identifier's without 'Item_Weight' for the eda_df
print(X_train.Item_Weight.isnull().sum())
X_train[X_train.Item_Weight.isnull()].head()

In [None]:
# Loop through index values of the dataframe
for ind in X_test.index:
    # Create a filter to select the Item_Identifier corresponding to the index
    item_filter = X_train['Item_Identifier'] == X_test.loc[ind,'Item_Identifier']

    # Calculate the rounded value of the mean 'Item_Weight' 
    # of this rows 'Item_Identifier' 
    mean_item_weight = round(X_train.loc[item_filter,'Item_Weight'].mean(), 3)

    # Assign mean_item_weight to the 'Item_Weight' column of this row
    X_test.loc[ind,'Item_Weight'] = mean_item_weight

In [None]:
# Identify any remaining 'Item_Identifier's without 'Item_Weight'
print(X_test.Item_Weight.isnull().sum())
X_test[X_test.Item_Weight.isnull()].head()

- These remaining missing values will be imputed in the pipeline using SimpleImputer(strategy='median')  

### **Cardinality**

- **The Curse of Dimensionality** - _As the number of features grows, the amount of data needed to accurately be able to distinguish between these features (in order to make a prediction), and generalize our model grows exponentially._
- The 'Item_Identifier' column has 1559 unique values. 
- The high cardinality of this object column feature may adversely impact machine learning model prediction performance and processing times, as well as  exaggerate it's feature importance, and so will be dropped.

In [None]:
# Drop 'Item_Identifier'
X_train = X_train.drop(columns=['Item_Identifier'])
X_test  = X_test.drop(columns=['Item_Identifier'])

### **Linear Regression Model Assumptions**

In [None]:
# Create a second version of X for Linear Regresson models
X_2_train = X_train.copy()
X_2_test  = X_test.copy()

#### **Assumption of Linearity**

##### **Item_Weight**

In [None]:
reg_plot(eda_df, 'Item_Weight', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Item Weight (Pounds)', 
         y_lab='Item Outlet Sales (dollars)',
         y_fmt=price_0_fmt)

- 'Item_Weight', based on the plot above, does not appear to have a linear relationship to the target, 'Item_Outlet_Sales'.
- 'Item_Weight' should not be retained, it should be dropped from the ml_2_df dataframe.

In [None]:
# Drop 'Item_Weight'
X_2_train = X_2_train.drop(columns=['Item_Weight'])
X_2_test  = X_2_test.drop(columns=['Item_Weight'])

##### **Item_Visibility**

In [None]:
# Utilize function to display regplot 
reg_plot(eda_df, 'Item_Visibility', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Item Visibility (Percentage of display area)', 
         y_lab='Item Outlet Sales',
         x_fmt=perc_0_fmt, y_fmt=price_0_fmt)

- 'Item_Visibility', based on the plot above, may not have a linear relationship to the target, 'Item_Outlet_Sales', but in fact have two distinct tiers, in which each separately is uncorrelated.

In [None]:
reg_2_plot(eda_df, 'Item_Visibility', 'Item_Outlet_Sales', 
           tit_lab='Correlation', 
           x_lab='Item Visibility (Percentage of display area)', 
           y_lab='Item Outlet Sales (dollars)',
           x_fmt=perc_0_fmt, y_fmt=price_0_fmt)

- Based on the plot above 'Item_Visibility' does not appear to have a linear relationship to the target, 'Item_Outlet_Sales'.
- 'Item_Weight' should not be retained, it should be dropped from the ml_2_df dataframe.

In [None]:
# Drop 'Item_Visibility'
X_2_train = X_2_train.drop(columns=['Item_Visibility'])
X_2_test  = X_2_test.drop(columns=['Item_Visibility'])

##### **Item_MRP**

In [None]:
reg_plot(ml_df, 'Item_MRP', 'Item_Outlet_Sales', 
         tit_lab='Correlation', 
         x_lab='Item MRP (dollars)', 
         y_lab='Item Outlet Sales (dollars)')

- 'Item_MRP', based on the plot above, does appear to have a linear relationship to the target, 'Item_Outlet_Sales'.
- 'Item_MRP' should be retained, it should not be dropped from the ml_2_df dataframe.

#### **Assumption of Little-to-No Multicollinearity**

Correlation can be used to identify pairs of features that are too multicollinear to include in the model.
- The threshold utilized will be pairs of features that have a correlation value less then -0.8 or greater than +0.8.


In [None]:
## Calc abs values of correlations
correlation = ml_df.drop(columns='Item_Outlet_Sales').corr().abs()
correlation

In [None]:
# Plot Correlation Heatmap
plt.figure(figsize = (8,8),facecolor='w')
correlation = ml_df.drop(columns='Item_Outlet_Sales').corr().abs()
mask = np.triu(np.ones_like(correlation))
sns.heatmap(correlation,square=True, cmap='viridis', annot=True, mask=mask);
plt.title('Correlation Heatmap', fontsize = 16, weight='bold')
plt.xticks(fontsize = 12, weight='bold', rotation=90)
plt.yticks(fontsize = 12, weight='bold', rotation=0);
plt.tight_layout()
plt.show;

- The highest correlation noted is 0.075 between 'Item_Visibility' and 'Outlet_Establishment_Year', well below the threshold value value of 0.80. 
- No features need to be dropped based on multicollinearity.

In [None]:
X_train.head()

In [None]:
X_2_train.head()

### **Instantiate Column Selectors**

In [None]:
# Selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

# Selectors for low and high cardinality categorical columns
# to differentiate high cardianlity categorical columns
# to analyze dropping or label encoding options
low_unique_cat_selector = [col for col in X_train.columns if X_train[col].dtype=='object' and X_train[col].nunique() < 1500]
high_unique_cat_selector = [col for col in X_train.columns if X_train[col].dtype=='object' and X_train[col].nunique() >= 1500]

In [None]:
# Display categorical column names
cat_selector(X_train)

In [None]:
# Display categorical column names with low number of unique values
low_unique_cat_selector

In [None]:
# Display categorical column names with low number of unique vales
high_unique_cat_selector

In [None]:
# Display numerical column names
num_selector(X_train)

In [None]:
# Display numerical column names
num_selector(X_2_train)

### **Instantiate Transformers**

In [None]:
# Imputers
freq_imputer = SimpleImputer(strategy='most_frequent')
median_imputer = SimpleImputer(strategy='median')
# Scaler
scaler = StandardScaler()
# One Hot Encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
# Leabel Encoder
lab = LabelEncoder()

### **Instantiate Pipelines**

In [None]:
# Numeric pipeline
numeric_pipe = make_pipeline(median_imputer, scaler)
numeric_pipe

In [None]:
# Numeric pipeline without scaling
# This will reduce processing time for -
# Tree based models do not require scaling
numeric_no_scaler_pipe = make_pipeline(median_imputer)
numeric_no_scaler_pipe

In [None]:
# Categorical pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

In [None]:
# Categorical pipeline
categorical_low_unique_pipe = make_pipeline(freq_imputer, ohe)
categorical_low_unique_pipe

In [None]:
categorical_high_unique_pipe = make_pipeline(freq_imputer, lab)
categorical_high_unique_pipe

### **Instantiate ColumnTransformer**

In [None]:
# Tuples for Column Transformer
number_tuple = (numeric_pipe, num_selector)
number_no_scaling_tuple = (numeric_no_scaler_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)
categorical_low_unique_tuple = (categorical_low_unique_pipe, low_unique_cat_selector)
categorical_high_unique_tuple = (categorical_high_unique_pipe, high_unique_cat_selector)

In [None]:
# ColumnTransformer
preprocessor = make_column_transformer(number_tuple, 
                                       category_tuple, 
                                       remainder='passthrough',
                                       verbose_feature_names_out=False)
preprocessor

In [None]:
# ColumnTransformer without scaling
# - Scaling is not required for tree based models
no_scaling_preprocessor = make_column_transformer(number_no_scaling_tuple, 
                                              category_tuple, 
                                              remainder='passthrough',
                                              verbose_feature_names_out=False)
no_scaling_preprocessor

### **Fit and Transform Data**

In [None]:
# Fit on Train
# Default X_train
preprocessor.fit(X_train)
# Transform Train and Test on Default X_train and X_test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [None]:
feature_names = preprocessor.get_feature_names_out()

In [None]:
feature_names[0:10]

In [None]:
# Fit on Train
# Linear regression X_train
preprocessor.fit(X_2_train)
# Transform Train and Test on X_2_train and X_2_test 
# used for Linear Regression models
X_2_train_processed = preprocessor.transform(X_2_train)
X_2_test_processed = preprocessor.transform(X_2_test)

In [None]:
feature_names_2 = preprocessor.get_feature_names_out()

In [None]:
feature_names_2[0:10]

### **Inspect the Result**

In [None]:
# Check for missing values and that data is scaled and one-hot encoded
print(f'There are {np.isnan(X_train_processed).sum().sum()} missing values in X_train_processed.')
print(f'There are {np.isnan(X_test_processed).sum().sum()} missing values in X_test_processed.')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('The shape of X_train_processed is', X_train_processed.shape)
print('The shape of X_test_processed is', X_test_processed.shape)

In [None]:
# Create dataframes from the processed arrays
# Default X_train and X_test
X_train_df = pd.DataFrame(X_train_processed, 
                          columns = feature_names, 
                          index=X_train.index)
X_test_df = pd.DataFrame(X_test_processed, 
                         columns = feature_names, 
                         index=X_test.index)
# Linear Regression X_train and X_test
X_2_train_df = pd.DataFrame(X_2_train_processed, 
                            columns = feature_names_2, 
                            index=X_train.index)
X_2_test_df = pd.DataFrame(X_2_test_processed, 
                           columns = feature_names_2, 
                           index=X_test.index)

In [None]:
X_train_df.describe().round(2)

In [None]:
X_2_train_df.describe().round(2)



---



## **Machine Learning Models**

### **Metric Function**

In [None]:
# Create a dataframe to store model performaance metrics
model_metrics_df = pd.DataFrame()

In [None]:
# Create a function to take the true and predicted values
# and print MAE, MSE, RMSE, and R2 metrics
def evaluation_model(pipe, model_name='', 
                     x_train='X_train', x_test='X_test', params=None):
  # Train
  mae = round(mean_absolute_error(y_train, pipe.predict(x_train)),3)
  model_metrics_df.loc[model_name, 'Train MAE'] = mae
  mse = round(mean_squared_error(y_train, pipe.predict(x_train)),3)
  model_metrics_df.loc[model_name, 'Train MSE'] = mse
  rmse = round(np.sqrt(mean_squared_error(y_train, pipe.predict(x_train))),3)
  model_metrics_df.loc[model_name, 'Train RMSE'] = rmse
  r2 = round(r2_score(y_train, pipe.predict(x_train)),7)
  model_metrics_df.loc[model_name, 'Train R2'] = r2
  print(f'{model_name} Train Scores')
  print(f'MAE: {mae:,.2f} \nMSE: {mse:,.2f} \nRMSE: {rmse:,.2f} \nR2: {r2:.4f}\n')

  # Test
  mae = round(mean_absolute_error(y_test, pipe.predict(x_test)),2)
  model_metrics_df.loc[model_name, 'Test MAE'] = round(mae, 2)
  mse = round(mean_squared_error(y_test, pipe.predict(x_test)),2)
  model_metrics_df.loc[model_name, 'Test MSE'] = round(mse, 2)
  rmse = round(np.sqrt(mean_squared_error(y_test, pipe.predict(x_test))),2)
  model_metrics_df.loc[model_name, 'Test RMSE'] = round(rmse, 2)
  r2 = round(r2_score(y_test, pipe.predict(x_test)),7)
  model_metrics_df.loc[model_name, 'Test R2'] = r2
  print(f'{model_name} Test Scores')
  print(f'MAE: {mae:,.2f} \nMSE: {mse:,.2f} \nRMSE: {rmse:,.2f} \nR2: {r2:.4f}\n')

### **Baseline Model**

In [None]:
# Make an instance of the model
dummy = DummyRegressor(strategy='mean')
# Make a model pipeline
dummy_pipe = make_pipeline(preprocessor, dummy)
# Fit the model
dummy_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=dummy_pipe, model_name='Dummy Model',
                 x_train=X_train, x_test=X_test)

### **Linear Regression Model**

#### **Version 1**

In [None]:
# Make an instance of the model
lin_reg = LinearRegression(positive=True)
# Make a model pipeline
lin_reg_pipe = make_pipeline(preprocessor, lin_reg)
# Fit the model
lin_reg_pipe.fit(X_2_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=lin_reg_pipe, 
                 model_name='Linear Regression', 
                 x_train=X_2_train,
                 x_test=X_2_test)

#### **Version 2**

##### **Assumption of Normality**

In [None]:
# Calculate the residual errors
train_residuals = y_test - lin_reg_pipe.predict(X_2_test)
test_residuals = y_test - lin_reg_pipe.predict(X_2_test)

In [None]:
# Display the first 10 residuals for 
test_residuals[0:10]

In [None]:
# Display a QQ Plot
sm.graphics.qqplot(test_residuals, line='45', fit=True);

In the qqplot above: the distribution of the residuals is plotted on the y-axis and a perfect normal distribution is plotting on the x-axis.

- If it was a perfect normal distribution, the residuals would be equal to the theoretical values.
- The red diagonal line shows the expected values if when the residuals are normal.
- The further the markers deviate from the red line, the more they are violating the assumption of normality.

##### **Assumption of Homoscedasticity**

In [None]:
fig, ax = plt.subplots()
ax.scatter(lin_reg_pipe.predict(X_2_test), test_residuals, ec='white', lw=1)
ax.axhline(0)
ax.set(ylabel='Residuals',xlabel='Predicted Value');

If we are meeting the assumption of homoscedasticity, we should see no clear pattern to the residuals. They should be approximately equally spread out.
- Its okay if there is some variability at various points along the X-axis.
- What we really DON'T want to see is a clear cone-shape to the residuals.
- Looking at our residual plot above, we can clearly see a cone-shape, with tightly clustered residuals on the left that continue to spread out as we move towards the right.

##### **Remove Outliers**

In [None]:
# Convert y_train to z-scores with StandardScaler()
z_scores = scaler.fit_transform(y_train.values.reshape(-1,1))

In [None]:
# Convert the sz-scores back to a pd.Series
# with the same index that it had originally
z_scores = pd.Series(z_scores.flatten(), 
                    index=y_train.index )

In [None]:
# Create an outlier filter
idx_outliers = z_scores > 3

In [None]:
# Display the number of outliers
idx_outliers.sum()

In [None]:
# Create a cleaned version of y_train and X_train with outliers removed
y_train_cln = y_train[~idx_outliers]
X_train_cln = X_2_train_df.loc[y_train_cln.index]

In [None]:
y_train_cln.head()

In [None]:
X_train_cln.head()

##### **Remove Columns with Insignificant P-Values**

- Check the p-values for each coefficient (the |P|>t column) and remove any features that have insignificant p-values.

### **Linear Regression with Elastic Net Model**

In [None]:
# Make an instance of the model
el_net_reg = ElasticNet(positive=True, random_state=42)
# Make a model pipeline
el_net_reg_pipe = make_pipeline(preprocessor, el_net_reg)
# Fit the model
el_net_reg_pipe.fit(X_2_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=el_net_reg_pipe, 
                 model_name='Elastic Net', 
                 x_train=X_2_train,
                 x_test=X_2_test)

In [None]:
# Looking at options for tuning this model
el_net_reg.get_params()

In [None]:
# Create dictionary of hyperparamters to test
param_grid = {'elasticnet__alpha': [2.0, 2.25, 2.5, 2.75, 3],
              'elasticnet__l1_ratio': [0.98, 0.99, 1],
              'elasticnet__max_iter': [100000]}        

In [None]:
# Make an instance of the model
grid_pipe = GridSearchCV(el_net_reg_pipe, param_grid)

In [None]:
# Fit the model
grid_pipe.fit(X_2_train, y_train)

In [None]:
grid_pipe.best_params_

In [None]:
el_net_reg_pipe = grid_pipe.best_estimator_

In [None]:
# Display model performance metrics
evaluation_model(pipe=el_net_reg_pipe, 
                 model_name='Elastic Net', 
                 x_train=X_2_train,
                 x_test=X_2_test)

### **K Nearest Neighbors Model**

In [None]:
# Make an instance of the model
knn = KNeighborsRegressor()
# Make a model pipeline
knn_pipe = make_pipeline(preprocessor, knn)
# Fit the model
knn_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=knn_pipe, model_name='K Nearest Neighbors',
                 x_train=X_train, x_test=X_test)

In [None]:
# Look at the hyperparameters
knn.get_params()

In [None]:
# Create dictionary of hyperparamters to test
param_grid = {'kneighborsregressor__n_neighbors': range(25,40),
              'kneighborsregressor__leaf_size': range(2,5)}

In [None]:
# Make an instance of the model
grid_pipe = GridSearchCV(knn_pipe, param_grid)

In [None]:
# Fit the model
grid_pipe.fit(X_train, y_train)

In [None]:
grid_pipe.best_params_

In [None]:
knn_pipe = grid_pipe.best_estimator_

In [None]:
# Display model performance metrics
evaluation_model(pipe=knn_pipe, model_name='K Nearest Neighbors',
                 x_train=X_train, x_test=X_test)

### **Decision Tree Model**

In [None]:
# Make an instance of the model
dec_tree = DecisionTreeRegressor(random_state = 42)
# Make a model pipeline
dec_tree_pipe = make_pipeline(no_scaling_preprocessor, dec_tree)
# Fit the model
dec_tree_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=dec_tree_pipe, model_name='Decision Tree',
                 x_train=X_train, x_test=X_test)

- The default decision tree had a much higher R2 score on the training data than it did on the test data.  
- This is an indication that there is extremely high variance and the model is overfit.  

In [None]:
# Looking at options for tuning this model
dec_tree.get_params()

In [None]:
# Display the default model depth
dec_tree.get_depth()

In [None]:
# Display the default model number of leaves
dec_tree.get_n_leaves()

In [None]:
# List of values to try for max_depth:
max_depth_range = list(range(2, 20)) # will try every value between 2 and 20
# List to store the score for each value of max_depth:
train_r2 = []
test_r2 = []
for depth in max_depth_range:
    dec_tree = DecisionTreeRegressor(max_depth = depth, 
                             random_state = 42)
    dec_tree_pipe = make_pipeline(no_scaling_preprocessor, dec_tree)
    dec_tree_pipe.fit(X_train, y_train)
    train_score = dec_tree_pipe.score(X_train, y_train)
    test_score = dec_tree_pipe.score(X_test, y_test)
    train_r2.append(train_score)
    test_r2.append(test_score)

In [None]:
# Visualize the max_depths to display which achieves the highest R2 score
plt.plot(max_depth_range, train_r2)
plt.plot(max_depth_range, test_r2)
plt.xlabel('max_depth')
plt.ylabel('R2');

- From the image above, it appears the highest R2 score for the model is achieved when the hyperparameter max_depth equals 5.  

In [None]:
# Make an instance of the model
dec_tree = DecisionTreeRegressor(max_depth=5, random_state = 42)
# Make a model pipeline
dec_tree_pipe = make_pipeline(no_scaling_preprocessor, dec_tree)
# Fit the model
dec_tree_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=dec_tree_pipe, model_name='Decision Tree',
                 x_train=X_train, x_test=X_test)

In [None]:
# Make an instance of the model
dec_tree = DecisionTreeRegressor(random_state = 42)
# Make a model pipeline
dec_tree_pipe = make_pipeline(no_scaling_preprocessor, dec_tree)
# Fit the model
dec_tree_pipe.fit(X_train, y_train)

In [None]:
# Create dictionary of hyperparamters to test
param_grid = {'decisiontreeregressor__max_depth': range(4,8),
              'decisiontreeregressor__min_samples_leaf': range(45,60),
              'decisiontreeregressor__min_samples_split': range(2,4)}       

In [None]:
# Make an instance of the model
grid_pipe = GridSearchCV(dec_tree_pipe, param_grid)

In [None]:
# Fit the model
grid_pipe.fit(X_train, y_train)

In [None]:
grid_pipe.best_params_

In [None]:
dec_tree_pipe = grid_pipe.best_estimator_

In [None]:
# Display model performance metrics
evaluation_model(pipe=dec_tree_pipe, model_name='Decision Tree',
                 x_train=X_train, x_test=X_test)

In [None]:
# Plot the Decision Tree
# Make an instance of the model
dec_tree = DecisionTreeRegressor(max_depth=6, min_samples_leaf=51,
                                 min_samples_split=2, random_state=42)
# Fit the model
dec_tree.fit(X_2_train_df, y_train)
# Display the plot
plt.figure(figsize=(100,10))
a = tree.plot_tree(dec_tree, feature_names=feature_names_2,
                   filled=True, 
                   rounded=True, 
                   fontsize=10)

### **Random Forest Model**

In [None]:
# Make an instance of the model
ran_for = RandomForestRegressor(random_state=42)
# Make a model pipeline
ran_for_pipe = make_pipeline(no_scaling_preprocessor, ran_for)
# Fit the model
ran_for_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=ran_for_pipe, model_name='Random Forest',
                 x_train=X_train, x_test=X_test)

In [None]:
# Looking at some hyperparameters that seem tunable
ran_for.get_params()

In [None]:
est_depths = [estimator.get_depth() for estimator in ran_for.estimators_]
max(est_depths)

In [None]:
depths = range(1, max(est_depths))
scores = pd.DataFrame(index=depths, columns=['Test Score'])
for depth in depths:    
   ran_for = RandomForestRegressor(max_depth=depth, random_state=42)
   ran_for_pipe = make_pipeline(no_scaling_preprocessor, ran_for)
   ran_for_pipe.fit(X_train, y_train)
   scores.loc[depth, 'Train Score'] = ran_for_pipe.score(X_train, y_train)
   scores.loc[depth, 'Test Score'] = ran_for_pipe.score(X_test, y_test)

In [None]:
# Plot the scores
plt.plot(scores['Test Score'])
plt.plot(scores['Train Score'])
plt.show()

In [None]:
# Make an instance of the model
ran_for = RandomForestRegressor(max_depth=5, random_state=42)
# Make a model pipeline
ran_for_pipe = make_pipeline(no_scaling_preprocessor, ran_for)
# Fit the model
ran_for_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=ran_for_pipe, model_name='Random Forest',
                 x_train=X_train, x_test=X_test)

In [None]:
# Looking at some hyperparameters that seem tunable
ran_for.get_params()

In [None]:
# Create dictionary of hyperparamters to test
param_grid = {'randomforestregressor__n_estimators': range(80,95),
               'randomforestregressor__max_depth': [4, 5, 6],
               'randomforestregressor__min_samples_split': [2, 3]}

In [None]:
# Make an instance of the model
grid_pipe = GridSearchCV(ran_for_pipe, param_grid)

In [None]:
# Fit the model
grid_pipe.fit(X_train, y_train)

In [None]:
grid_pipe.best_params_

In [None]:
ran_for_pipe = grid_pipe.best_estimator_

In [None]:
# Display model performance metrics
evaluation_model(pipe=ran_for_pipe, model_name='Random Forest',
                 x_train=X_train, x_test=X_test)

### **Extreme Gradient Boosted Machine Model**

In [None]:
# Make an instance of the model
xgb_reg = XGBRegressor()
# Make a model pipeline
xgb_reg_pipe = make_pipeline(preprocessor, xgb_reg)
# Fit the model
xgb_reg_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=xgb_reg, 
                 model_name='XGBoost',
                 x_train=X_train_processed, x_test=X_test_processed)

In [None]:
# Display the model's hyperparameters available for tuning
xgb_reg.get_params()

In [None]:
# Create a dictionary of hyperparamters to test
param_grid ={'gamma': [0],
             'learning_rate': [.0045], 
             'max_depth': [3],
             'min_child_weight': [0],
             'subsample': [.77],
             'colsample_bytree': [.85],
             'n_estimators': [980]}

In [None]:
# Make an instance of the model
grid_pipe = GridSearchCV(xgb_reg, param_grid,
                        cv = 2, n_jobs = 5,
                        verbose=True)

In [None]:
# Fit the model
grid_pipe.fit(X_train_processed, y_train)

In [None]:
# Display the best hyperparameters
grid_pipe.best_params_

In [None]:
xgb_reg_pipe = grid_pipe.best_estimator_

In [None]:
# Display model performance metrics
evaluation_model(pipe=xgb_reg_pipe, model_name='XGBoost',
                 x_train=X_train_processed, x_test=X_test_processed)

### **Light Gradient Boosted Machine Model**

In [None]:
# Make an instance of the model
lgb_reg = LGBMRegressor()
# Make a model pipeline
lgb_reg_pipe = make_pipeline(preprocessor, lgb_reg)
# Fit the model
lgb_reg_pipe.fit(X_train, y_train)

In [None]:
# Display model performance metrics
evaluation_model(pipe=lgb_reg, 
                 model_name='LGBoost',
                 x_train=X_train_processed, x_test=X_test_processed)

In [None]:
# Display the model's hyperparameters available for tuning
lgb_reg.get_params()

In [None]:
# Create a dictionary of hyperparamters to test
param_grid ={#'objective':['reg:squarederror'],
             'gamma': [0],
             'learning_rate': [.015], 
             'max_depth': [3],
             'min_child_weight': [0],
             'colsample_bytree': [1],
             'n_estimators': [300]}

In [None]:
# Make an instance of the model
grid_pipe = GridSearchCV(lgb_reg, param_grid,
                        cv = 2, n_jobs = 5,
                        verbose=True)

In [None]:
# Fit the model
grid_pipe.fit(X_train_processed, y_train)

In [None]:
# Display the best hyperparameters
grid_pipe.best_params_

In [None]:
# Make an i
lgb_reg_pipe = grid_pipe.best_estimator_

In [None]:
# Display model performance metrics
evaluation_model(pipe=lgb_reg_pipe, model_name='LGBoost',
                 x_train=X_train_processed, x_test=X_test_processed)

## **Model Performance Comparison**

In [None]:
model_metrics_df = model_metrics_df.drop(index=['Dummy Model'])

In [None]:
model_metrics_df.sort_values(by='Test R2', ascending=False)\
                .style.format({"Train MAE":  "${:20,.2f}", 
                               "Train MSE":  "${:20,.2f}", 
                               "Train RMSE": "${:20,.2f}",
                               "Train R2":   "{:.4%}",
                               "Test MAE":   "${:20,.2f}",
                               "Test MSE":   "${:20,.2f}",
                               "Test RMSE":  "${:20,.2f}",
                               "Test R2":    "{:.4%}"})\
                .background_gradient(cmap='Blues_r', 
                subset=['Test RMSE'])\
                .background_gradient(cmap='Blues', 
                subset=['Test R2'])

### **Mean Absolute Error (MAE)**

$$ \Large MAE = \frac{\sum_{i=1}^{n}|y_{i} - \hat y_{i}|}{n}$$
- To prevent a + error and - error from cancelling each other out, we take the absolute value of the erorrs before we sum them.

- MAE will still be in the same units as the original target.
- On average, how off is your model's prediction from the true value?

In [None]:
line_plot(model_metrics_df, column_1='Train MAE', column_2='Test MAE',
             fs=(10,5), file_name='Model Performance MAE Scores.png',
             tit_lab='Model Performance', x_lab='', y_lab='MAE', 
             fmt=price_0_fmt,
             hza='center', rot=20)

### **Mean Squared Error (MSE)**

- To prevent a + error and - error from cancelling each other out, we could also square the erorr (since a negative number squared becomes a positive number).

$$ \Large MSE = \frac{\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2}{n}$$


- Statisticians like MSE over MAE because it punishes larger errors more severely, we can square $y_{i} - \hat y_{i} $ instead of taking the absolute value.

- Unlike MAE, MSE is no longer in the same units as the data, it is in units-squared. 


In [None]:
line_plot(model_metrics_df, column_1='Train MSE', column_2='Test MSE',
             fs=(10,5), file_name='Model Performance MSE Scores.png',
             tit_lab='Model Performance', x_lab='', y_lab='MSE', 
             fmt=price_0_fmt,
             hza='center', rot=20)

### **Root-Mean Squared Error (RMSE)**

- To convert MSE back to the same units as the original taret, we can take the square-root of the MSE to get RMSE.

$$ \Large RMSE = \sqrt{\frac{\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2}{n}}$$

>- RMSE is probably the best/most-useful metric out of MAE, MSE, and RMSE.

In [None]:
line_plot(model_metrics_df, column_1='Train RMSE', column_2='Test RMSE',
             fs=(10,5), file_name='Model Performance RMSE Scores.png',
             tit_lab='Model Performance', x_lab='', y_lab='RMSE', 
             fmt=price_0_fmt,
             hza='center', rot=20)

### **Coefficient of Determination (R2)**


> **The $R^2$ or Coefficient of determination is a statistical measure that is used to assess the goodness of fit of a regression model**

>- Value should be between 0 and 1.
    - $R^2$ is the proportion (%) of the variance in our target that our model could explain.
    - $R^2$=0.8 means our model can explain 80% of the variance in our target.
    - If we have a REALLY BAD model, we may get a negative $R^2$



- The **Sum of Squared Errors (SSE)** for our Models **Predicted Values ($\hat{y}$) vs the true values ($y_i$)**: 
$$\text{SSE of our Predictions } = \sum_i(y_i - \hat y_i)^2$$

- The **SSE if we use the Mean as Our Prediction ($\bar{y}$) vs the true values  ($y_i$)**

 $$\text{SSE of the Mean } = \sum_i(y_i - \overline y_i)^2$$


- $R^2$ (R-Square) calculates how much better our model's predictions are vs if we just used the mean instead. 


$$ \large R^2 = 1 - \dfrac{\text{SSE of our Predictions}}{ \text{SSE of the Mean }}  $$


<br>

$$ \large R^2  = 1 - \dfrac{\sum_i(y_i - \hat y_i)^2}{\sum_i(y_i - \overline y_i)^2} $$



In [None]:
line_plot(model_metrics_df, column_1='Train R2', column_2='Test R2',
             fs=(10,5), file_name='Model Performance R2 Scores.png',
             tit_lab='Model Performance', x_lab='', y_lab='R2', 
             fmt=perc_0_fmt,
             hza='center', rot=20)

## **Feature Importances**

In [None]:
feature_importance_df = pd.DataFrame(ran_for_pipe.named_steps['randomforestregressor']
                                     .feature_importances_,
                                     index = feature_names,  
                                     columns=['Feature Importance'])\
                                     .sort_values('Feature Importance', 
                                     ascending=False)

In [None]:
# Create Feature column from index
feature_importance_df.insert(loc = 0,
          column = 'Feature',
          value = feature_importance_df.index)

In [None]:
# Reset index
feature_importance_df.reset_index(drop=True, inplace=True)

In [None]:
# Display the first (5) rows of the dataframe
feature_importance_df.head(5).style.format({"Feature Importance":  "{:.4%}"})

In [None]:
# Utilize function to display bar plot
bar_plot(df=feature_importance_df, 
         x_column_name='Feature', 
         y_column_name='Feature Importance', 
         fs=(10,5), file_name='Feature Importance.png',
         tit_lab='Top 5 Features by Importance',
         label_order=feature_importance_df.Feature.head(5),
         fmt=perc_0_fmt,
         hza='right', rot=20)

## **Summary**

## **Reccomendations**

---

## **To do List**

- EDA
  - Groupby objects to dataframes


- Writeup: Summary/Recommendation/ReadMe
  - Statistical Equations and explanations
  - DataFrames to images
   - Jupyter NB Download Menu - Dataframe as Image (PDF or MD)
   - df to png: [link](https://pypi.org/project/dataframe-image/)
     - import dataframe_image as dfi
     - dfi.export(styled_df, 'filename.png')
     
###########################################

- Feature Importance
  - Stack 5
  - SKLearn Permutation


- Model Stacking 


- Linear Regression Statsmodels 
 - Linear Regression Model Assumptions


- Functions 
  - utilize **args **kwargs
