<a href="https://colab.research.google.com/github/andreacohen7/food-sales-predictions/blob/main/Food_Sales_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Food Sales Predictions
- Andrea Cohen
- 12.14.22

## Task:
To help the retailer by using machine learning to make predictions about future sales based on the data provided

## Loading Data

In [28]:
#mount the drive
from google.colab import drive
drive.mount ('/content/drive', force_remount=True)

Mounted at /content/drive


In [29]:
#imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

In [30]:
# Import the data
filename = '/content/sales_predictions.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [31]:
#explore the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


  - There are missing data in float64 columns and object type columns.  There are no missing data in int64 columns.

In [33]:
dtypes = df.dtypes
str_cols = dtypes[dtypes=='object'].index
for col in str_cols:
  print(f'Column= {col}')
  print(df[col].value_counts())
  print('\n')

Column= Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64


Column= Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64


Column= Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64


Column= Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018  

  - The column 'Outlet_Size' has ordinal categorical data.
  - The column 'Item_Fat_Content' has inconsistencies in the data categories.

In [34]:
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


  - The numeric data columns have different scales.

### Drop duplicates

In [35]:
display(df.duplicated())
df.duplicated().sum()

0       False
1       False
2       False
3       False
4       False
        ...  
8518    False
8519    False
8520    False
8521    False
8522    False
Length: 8523, dtype: bool

0

  - There are 0 duplicates.

### Fix inconsistencies in categorical data

In [36]:
#There are inconsistencies in the Item_Fat_Content column
#Low Fat, LF, and low fat are probably the same category
#Regular, reg are probably the same category
df['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'}, inplace=True)
print('Column = Item_Fat_Content')
display(df['Item_Fat_Content'].value_counts())

Column = Item_Fat_Content


Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

### Ordinal encoding

In [37]:
#ordinal encoding
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [38]:
replacement_dictionary = {'High':2, 'Medium':1, 'Small':0}
df['Outlet_Size'].replace(replacement_dictionary, inplace=True)
df['Outlet_Size']

0       1.0
1       1.0
2       1.0
3       NaN
4       2.0
       ... 
8518    2.0
8519    NaN
8520    0.0
8521    1.0
8522    0.0
Name: Outlet_Size, Length: 8523, dtype: float64

### Identify the features (X) and target (y):   
Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.

In [39]:
#assign X and y
y = df['Item_Outlet_Sales']
X = df.drop(columns = ['Item_Outlet_Sales'])

### Perform a train test split

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
display(len(X_train))
display(len(X_test))
display(len(y_train))
display(len(y_test))

6392

2131

6392

2131

### Create a preprocessing object to prepare the dataset for Machine Learning

  - Mean imputation will be used in the float columns (because the average value would likely be closest to the correct value that is missing). 
  - Constant imputation will be used in the object columns (because there might be a pattern to the missing categorical data). 
  - Then the numeric data will be scaled and the categorical (object) columns will be one-hot encoded.

#### Make sure your imputation of missing values occurs after the train test split using SimpleImputer.

In [41]:
#instantiate column selectors
#selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [42]:
#instantiate transformers
#imputers
constant_imputer = SimpleImputer(strategy = 'constant', fill_value = 'Missing')
mean_imputer = SimpleImputer(strategy = 'mean')
#scaler
scaler = StandardScaler()
#One-hot encoder
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

In [43]:
#instantiate pipelines
#numeric pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [44]:
#categorical pipeline
categorical_pipe = make_pipeline(constant_imputer, ohe)
categorical_pipe

In [45]:
#instantiate ColumnTransformer
#tuples for column transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)
#ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple, remainder = 'passthrough')
preprocessor

In [46]:
#tranformer data
#fit on train
preprocessor.fit(X_train)

In [47]:
#transform train and test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [48]:
#inspect the result
#check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6392, 1590)




array([[ 0.81724868, -0.71277507,  1.82810922, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  0.        ,
         1.        ,  0.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  1.        ,
         0.        ,  0.        ]])

  - Missing values are removed.
  - Data are all numeric type.
  - Columns have been expanded.
  - Data appear to be scaled.