<a href="https://colab.research.google.com/github/allensheneka/predict-sales/blob/analysis/Predict_Sales_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Predict Sales - Predictions

Sheneka Allen


##Goal:
Help the retailer by using machine learning to make predictions about future sales based on the data provided.

##Assignment:
1. Identify the target (X) and features (y): 
>Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.  
2. Perform a train test split 
3. Create a pre processing pipeline to prepare the dataset for Machine Learning




In [None]:
# import key libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

#import set_config to display a drawing of the ML process/pipeline
from sklearn import set_config
set_config(display='diagram')

In [None]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#upload data

filename = '/content/drive/MyDrive/Data Science/sales_predictions.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


##Quick Data Inspection Summary
Missing data in outlet size, maybe others

5 columns of numerical scales, i.e., item weight, item visibility, item MRP, outlet est year, item outlet sales

Nominal data:  item identifier, item fat content, item type, outlet type, outlet identifier

Ordinal data:  outlet size, outlet location type

##Inspect Data

In [None]:
# overview of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
# inspect missing value total
df.isna().any(axis=1).sum()

3873

In [None]:
# total missing values in data columns
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [None]:
# inspect data rows and column count, i.e., shape
df.isna().shape

(8523, 12)

In [None]:
# inspect ordinal categories
df.Outlet_Size.value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

3 ordinal categories for outlet size:  Small(0), Medium(1), High(2)

In [None]:
# inspect ordinal categories
df.Outlet_Location_Type.value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

3 ordinal categories for outlet location type: Tier 1 (0), Tier 2 (1), Tier 3(2)

In [None]:
# Ordinal Encoding 'Outlet_Size' and 'Outlet_Location_Type'
replacement_dictionary = {'High':2, 'Tier 3':2, 'Medium':1, 'Tier 2':1, 'Small':0, 'Tier 1':0}
df['Outlet_Size'].replace(replacement_dictionary, inplace=True)
df['Outlet_Location_Type'].replace(replacement_dictionary, inplace=True)

df['Outlet_Size']

0       1.0
1       1.0
2       1.0
3       NaN
4       2.0
       ... 
8518    2.0
8519    NaN
8520    0.0
8521    1.0
8522    0.0
Name: Outlet_Size, Length: 8523, dtype: float64

In [None]:
df['Outlet_Location_Type']

0       0
1       2
2       0
3       2
4       2
       ..
8518    2
8519    1
8520    1
8521    2
8522    0
Name: Outlet_Location_Type, Length: 8523, dtype: int64

##Define X (features) and y (target)

In [None]:
# define X (features) and y (target)
X = df.drop(columns = 'Item_Outlet_Sales')
y = df['Item_Outlet_Sales']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

##Instantiate Transformers (Selectors, Imputers, OneHotEncoder)

In [None]:
# instantiate Selectors:  categorical (object) and numerical (number)
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [None]:
# instantiate Imputers:  categorical (most frequent) and numerical (mean)
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')

# instantiate Scaler
scaler = StandardScaler()

# instantiate One-hot encoder
# set 'sparse=False' to prevent returning Error on data that has not been fitted
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

##Instantiate Pipelines

In [None]:
# Numeric pipeline
# pass in (imputer, scaler) to combine into a pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [None]:
# Categorical pipeline
# pass in (imputer, ohe) to combine into a pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe


##Instantiate ColumnTransformer and Create one Preprocessing Object

In [None]:
# instantiate tuples for Column Transformer
# create tuple by passing in parameters (imputer, selector)
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)

# instantiate ColumnTransformer
# all preprocessing is in ONE preprocessing object
preprocessor = make_column_transformer(number_tuple, category_tuple)
preprocessor

In [None]:
# fit preprocessor on the training data
preprocessor.fit(X_train)

In [None]:
# transform both the training and testing data (this will output a NumPy array)
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [None]:
# Check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')

X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6392, 1591)




array([[ 0.81724868, -0.71277507,  1.82810922, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  0.        ,
         1.        ,  0.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  1.        ,
         0.        ,  0.        ]])

The preprocessed data looks good and ready for use with machine learning model.

Yea! No missing data.

All train and test data are floats. 

Training data rows reduced by ~25% from 8523 to 6392; columns increased from 12 to 1591.