# Predicting Supermarket Sales
- Andrea Cohen
- 03.22.23

## Data:
- Original data source
https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

## Data Dictionary:

Variable Name | Description
---| ---
Item_Identifier| Unique product ID
Item_Weight| Weight of product
Item_Fat_Content| Whether the product is low fat or regular
Item_Visibility| The percentage of total display area of all products in a store allocated to the particular product
Item_Type| The category to which the product belongs
Item_MRP| Maximum Retail Price (list price) of the product
Outlet_Identifier| Unique store ID
Outlet_Establishment_Year| The year in which store was established
Outlet_Size| The size of the store in terms of ground area covered
Outlet_Location_Type| The type of area in which the store is located
Outlet_Type| Whether the outlet is a grocery store or some sort of supermarket
Item_Outlet_Sales| Sales of the product in the particular store. This is the target variable to be predicted.


## Preliminary steps

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import joblib
from sklearn.inspection import permutation_importance

### Set the random state for reproducibility

In [2]:
SEED = 321
np.random.seed(SEED)

### Set pandas to display more columns


In [3]:
pd.set_option('display.max_columns', 50)

### Custom functions

In [4]:
# for evaluating a regression model using r-squared and RMSE
def evaluate_regression(model, X_train, y_train, X_test, y_test): 
    y_pred_train = model.predict(X_train)
    r2_train = metrics.r2_score(y_train, y_pred_train)
    rmse_train = metrics.mean_squared_error(y_train, y_pred_train, squared = False)
    print(f"Training Data:\tR^2= {r2_train:.2f}\tRMSE= {rmse_train:.2f}")
    y_pred_test = model.predict(X_test)
    r2_test = metrics.r2_score(y_test, y_pred_test)
    rmse_test = metrics.mean_squared_error(y_test, y_pred_test, squared = False)
    print(f"Test Data:\tR^2= {r2_test:.2f}\tRMSE= {rmse_test:.2f}")

In [5]:
# for feature importance
def get_importances(model, feature_names = None, name = 'Feature Importance', sort = False, ascending = True):
    if feature_names == None:
        feature_names = model.feature_names_in_
    importances = pd.Series(model.feature_importances_, index = feature_names, name = name)
    if sort == True:
        importances = importances.sort_values(ascending = ascending)
    return importances

In [6]:
# for plotting importances
def plot_importance(importances, top_n = None,  figsize = (8,6)):
    if top_n == None:
        plot_vals = importances.sort_values()
        title = "All Features - Ranked by Importance"
    else:
        plot_vals = importances.sort_values().tail(top_n)
        title = f"Top {top_n} Most Important Features"
    ax = plot_vals.plot(kind = 'barh', figsize = figsize)
    ax.set(xlabel = 'Importance', ylabel = 'Feature Names', title = title)
    return ax

In [7]:
# for creating a dictionary of each feature and its color
def get_color_dict(importances, color_rest = '#006ba4' , color_top = 'green', top_n = 7):
    highlight_feats = importances.sort_values(ascending = True).tail(top_n).index
    colors_dict = {col: color_top if col in highlight_feats else color_rest for col in importances.index}
    return colors_dict

In [8]:
# for creating a color-coded plot
def plot_importance_color(importances, top_n = None,  figsize = (8, 6), color_dict = None):
    if top_n == None:
        plot_vals = importances.sort_values()
        title = "All Features - Ranked by Importance"
    else:
        plot_vals = importances.sort_values().tail(top_n)
        title = f"Top {top_n} Most Important Features"
    if color_dict is not None:
        colors = plot_vals.index.map(color_dict)
        ax = plot_vals.plot(kind = 'barh', figsize = figsize, color = colors)
    else:
        ax = plot_vals.plot(kind = 'barh', figsize = figsize)
    ax.set(xlabel = 'Importance', ylabel = 'Feature Names', title = title)
    return ax

### Load the data

In [9]:
df = pd.read_csv('Data/sales_predictions.csv')
display(df.head())
display(df.info())

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


None

### Inspect the data

In [10]:
#how many rows and columns?
df.shape

(8523, 12)

- There are 8523 rows and 12 columns.

In [11]:
#what are the datatypes of each variable?
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

- Item_Identifier, Item_Fat_Content, Item_Type, Outlet_Identifier, Outlet_Size, Outlet_Location_Type, and Outlet_Type are all datatype object.
- Item_Weight, Item_Visibility, Item_MRP, and Item_Outlet Sales are all datatype float64.
- Item_Establishment_Year is datatype int64.

In [12]:
display(df.describe(include='number'))
display(df.describe(exclude='number'))

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
count,8523,8523,8523,8523,6113,8523,8523
unique,1559,5,16,10,3,3,4
top,FDW13,Low Fat,Fruits and Vegetables,OUT027,Medium,Tier 3,Supermarket Type1
freq,10,5089,1232,935,2793,3350,5577


In [13]:
#are there any duplicates?
df.duplicated().sum()

0

- There are 0 duplicates.

In [14]:
#identify missing values
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

- There are 1463 missing values in Item_Weight, and there are 2410 missing values in Outlet_Size.

- For the Column 'Item_Weight':
    - Dropping rows is not a good option because 17% of rows are missing data--too many (>2%) to just eliminate.
    - Dropping columns is not a good option because the weight of the item might be an important property for predicting the sales of that item. Also, <50% of the data are missing, too little to just eliminate.
    - Creating a new category is not a good option because the data are type float instead of type object.
    - Imputing missing values is a great option because the average (mean) value would likely be closest to the correct value that is missing.

- For the column 'Outlet_Size':
    - Dropping rows is not a good option because 28% of rows are missing data--too many (>2%) to just eliminate.
    - Dropping columns is not a good option because the size of the outlet might be an important property for predicting sales. Also, <50% of the data are missing, too little to just eliminate.
    - Imputing missing values is not a good option because the data are type object instead of type float or int.
    - Creating a new category is a good option because the information is categorical, and there might be a pattern to the missing data.

In [15]:
#find and fix any inconsistent categories of data
dtypes = df.dtypes
str_cols = dtypes[dtypes=='object'].index
for col in str_cols:
  print(f'Column= {col}')
  print(df[col].value_counts())
  print(' ')

Column= Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64
 
Column= Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64
 
Column= Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64
 
Column= Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018  

- From the data dictionary, we know that Item_Fat_Content, Item_Type, Outlet_Size, Outlet_Location Type, and Outlet_Type should be categorical data types.
- For Item_Fat_Content, Low Fat, LF, and low fat are all probably the same category.
- Also Regular and reg are probably the same category.
- For the rest of the categorical columns, all data categories appear distinct.

In [16]:
df['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'}, inplace=True)
print('Column = Item_Fat_Content')
display(df['Item_Fat_Content'].value_counts())

Column = Item_Fat_Content


Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

- There are no more inconsistent categories of data.

In [17]:
#ordinal encoding
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [18]:
replacement_dictionary = {'High':2, 'Medium':1, 'Small':0}
df['Outlet_Size'].replace(replacement_dictionary, inplace=True)
df['Outlet_Size']

0       1.0
1       1.0
2       1.0
3       NaN
4       2.0
       ... 
8518    2.0
8519    NaN
8520    0.0
8521    1.0
8522    0.0
Name: Outlet_Size, Length: 8523, dtype: float64

In [19]:
#for any numerical columns obtain the summary statistics of each (min, max, mean)
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,6113.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,0.761819,2181.288914
std,4.643456,0.051598,62.275067,8.37176,0.697463,1706.499616
min,4.555,0.0,31.29,1985.0,0.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,0.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,1.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,2.0,13086.9648


- The min item weight is 4.56, the max item weight is 21.35, and the mean item weight is 12.86.
- The min item visibility is 0.00, the max item visibility is .33, and the mean item visibility is .07.
- The min item MRP is 31.29, the max item MRP is 266.89, and the mean item MRP is 140.99.
- The min outlet establishment year is 1985, the max outlet establishment year is 2009, and the mean outlet establishment year is 1997.83.
- The min item outlet sales is 33.29, the max item outlet sales is 13086.96, and the mean item outlet sales is 2181.29.

In [20]:
# drop unnecessary columns
df = df.drop(columns = ['Item_Identifier'])
df.columns

Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
       'Item_MRP', 'Outlet_Identifier', 'Outlet_Establishment_Year',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type',
       'Item_Outlet_Sales'],
      dtype='object')

- According to the data dictionary, the Item Identifier is a unique product ID. This information will not help with making predictions.

## Remaking, Saving, and Explaining the Models

### Make X_train and X_test as DataFrames with the feature names extracted from the column transformer

#### Train Test Split

In [21]:
## Make x and y variables
y = df['Item_Outlet_Sales'].copy()
X = df.drop(columns=['Item_Outlet_Sales']).copy()
## train-test-split with random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=SEED)

In [22]:
# make categorical selector and verify it works 
cat_sel = make_column_selector(dtype_include='object')
cat_sel(X_train)

['Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Location_Type',
 'Outlet_Type']

In [23]:
# make numeric selector and verify it works 
num_sel = make_column_selector(dtype_include='number')
num_sel(X_train)

['Item_Weight',
 'Item_Visibility',
 'Item_MRP',
 'Outlet_Establishment_Year',
 'Outlet_Size']

In [24]:
# make pipelines for categorical vs numeric data
cat_pipe = make_pipeline(SimpleImputer(strategy = 'constant', fill_value = 'MISSING'), OneHotEncoder(handle_unknown = 'ignore', sparse = False))
num_pipe = make_pipeline(SimpleImputer(strategy = 'mean'))

In [25]:
# make the preprocessing column transformer
preprocessor = make_column_transformer((num_pipe, num_sel), (cat_pipe, cat_sel), verbose_feature_names_out = False)                                  
preprocessor

In [26]:
# fit column transformer and run get_feature_names_out
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()
feature_names

array(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Outlet_Size',
       'Item_Fat_Content_Low Fat', 'Item_Fat_Content_Regular',
       'Item_Type_Baking Goods', 'Item_Type_Breads',
       'Item_Type_Breakfast', 'Item_Type_Canned', 'Item_Type_Dairy',
       'Item_Type_Frozen Foods', 'Item_Type_Fruits and Vegetables',
       'Item_Type_Hard Drinks', 'Item_Type_Health and Hygiene',
       'Item_Type_Household', 'Item_Type_Meat', 'Item_Type_Others',
       'Item_Type_Seafood', 'Item_Type_Snack Foods',
       'Item_Type_Soft Drinks', 'Item_Type_Starchy Foods',
       'Outlet_Identifier_OUT010', 'Outlet_Identifier_OUT013',
       'Outlet_Identifier_OUT017', 'Outlet_Identifier_OUT018',
       'Outlet_Identifier_OUT019', 'Outlet_Identifier_OUT027',
       'Outlet_Identifier_OUT035', 'Outlet_Identifier_OUT045',
       'Outlet_Identifier_OUT046', 'Outlet_Identifier_OUT049',
       'Outlet_Location_Type_Tier 1', 'Outlet_Location_Type_Tier 2',
       'Outlet_

In [27]:
# create a preprocessed DataFrame for the training set.
X_train_df = pd.DataFrame(preprocessor.transform(X_train), columns = feature_names, index = X_train.index)
X_train_df.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Identifier_OUT010,Outlet_Identifier_OUT013,Outlet_Identifier_OUT017,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
8269,7.22,0.064142,61.251,1998.0,0.760582,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
7604,6.135,0.079294,111.286,2009.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2762,12.15,0.028593,151.0708,2004.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6464,5.945,0.093009,127.8652,2004.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4707,18.2,0.066285,247.2092,2004.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [28]:
# create a preprocessed DataFrame for the test set
X_test_df = pd.DataFrame(preprocessor.transform(X_test), columns = feature_names, index = X_test.index)
X_test_df.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Identifier_OUT010,Outlet_Identifier_OUT013,Outlet_Identifier_OUT017,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
8077,15.25,0.061531,132.2968,2007.0,0.760582,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2391,17.85,0.044463,127.102,1997.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
163,7.27,0.071078,114.2518,1997.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
4608,12.822634,0.075142,145.8444,1985.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5544,13.5,0.121633,161.692,1998.0,0.760582,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


## LinearRegression

### Fit and evaluate the LinearRegresion model using the dataframe X_train and X_test data

### Extract and visualize the coefficients that the model determined

#### Select the top 3 most impactful features and interpret their coefficients

### Save the figure as a .png file inside the repository

## Tree-Based Model

### Fit and evaluate the tree-based regression model using the dataframe X_train and X_test data

### Extract and visualize the feature importances that the model determined

#### Identify the top 5 most important features

### Save the figure as a .png file inside the repository

## Serialize the Best Models with Joblib

### Save the following key: value pairs as a dictionary in a joblib file named "best-models.joblib"
- "preprocessor": preprocessing column transformer
- "X_train": training features.
- "X_test": test features.
- "y_train": training target.
- "y_test": test target.
- "LinearRegression": best linear regression
- "RandomForestRegressor"/"DecisionTreeRegressor": best tree-based model
#### Save the joblib file inside the repository

Update your README.

Insert your exported figures from above into your README file. You should have the following:
Your LinearRegression coefficients plot.
 Your interpretation of your coefficients.
Your tree-based model's feature importances.
Your interpretation of your feature importances. 