We are going to measure which ketchup is the greatest! We will be pulling in data from the ketchup data, and finding out a target variable to select from the dataset

In [13]:
import pandas as pd
import numpy as np

In [14]:
ketchup_data = pd.read_csv('./Ketchup.csv', index_col=0)

ketchup_data.shape

(4956, 7)

In [15]:
ketchup_data.columns

Index(['Ketchup.hid', 'Ketchup.id', 'Ketchup.choice', 'price.heinz',
       'price.hunts', 'price.delmonte', 'price.stb'],
      dtype='object')

#### Our target variable has been selected to be `Ketchup.choice`, which we will plug into a few different models

In [16]:
# First we must do some data cleaning, checking if any of the columns contain null values

null_columns = [col for col in ketchup_data.columns if ketchup_data[col].hasnans]

null_columns # the data is completely clean?

[]

In [17]:
## now we must separate the columns into integer and categorical columns to plug into our pipeline

In [18]:
X = ketchup_data.copy()
y = X.pop('Ketchup.choice')

In [19]:
numerical_columns = [col for col in ketchup_data.columns if ketchup_data[col].dtype in ['int64', 'float64']]

numerical_columns

['Ketchup.hid',
 'Ketchup.id',
 'price.heinz',
 'price.hunts',
 'price.delmonte',
 'price.stb']

##### I don't think we need the ids in this equation, as they don't have any feature value to contribute to the prediction, the only features the consumer is knowledgeable about is the price of heinz, the price of hunts, the price of delmonte, and the price of stb

In [20]:
modified_x = X.drop(columns=['Ketchup.hid', 'Ketchup.id'])


modified_numerical_columns = [col for col in modified_x if modified_x[col].dtype in ['int64', 'float64']]

modified_numerical_columns

['price.heinz', 'price.hunts', 'price.delmonte', 'price.stb']

In [21]:
modified_categorical_columns = [col for col in modified_x if modified_x[col].dtype == 'object']

modified_categorical_columns # which there are none?

[]

In [22]:
## Create the pipeline
from sklearn.pipeline import Pipeline # the pipeline to make the data modeling process easier and cleaner
from sklearn.ensemble import RandomForestRegressor # the model we will be using with the data
from sklearn.model_selection import train_test_split # split the data into sections
from sklearn.metrics import mean_absolute_error # analyze the prediction score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

y_df = pd.DataFrame(y)

y_encoder = OneHotEncoder(sparse=False)

encoded_y = pd.DataFrame(y_encoder.fit_transform(y_df))

new_y = encoded_y

pipeline = Pipeline(steps=[('model', RandomForestRegressor())])

train_x, validation_x, train_y, validation_y = train_test_split(modified_x, new_y)

pipeline.fit(train_x, train_y)

mean_absolute_error(pipeline.predict(validation_x), validation_y)

0.2472264511242455

#### As you can see, with our one-hot-encoding method, since we have to avoid imputation entirely (as there are no NaN values), we can focus on encoding the categorical value, the Ketchup.choice, which happens to be our target variable. Encoding it using One-Hot Encoding yields a fantastic result of almost exact precision, but slightly less.

In [23]:
## Let's try ordinal encoding instead

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()

o_encoded_y = pd.DataFrame(encoder.fit_transform(y_df))

pipelinev2 = Pipeline(steps=[('encode', OrdinalEncoder()), ('model', RandomForestRegressor())])

o_e_train_x, o_e_validation_x, o_e_train_y, o_e_validation_y = train_test_split(modified_x, o_encoded_y)

pipelinev2.fit(o_e_train_x, o_e_train_y)

print(o_e_validation_x.head())

mean_absolute_error(pipelinev2.predict(o_e_validation_x), o_e_validation_y)

      price.heinz  price.hunts  price.delmonte  price.stb
1902         1.39         1.36            1.49       0.95
222          1.46         1.43            1.45       0.99
1830         0.79         1.43            1.45       0.99
326          1.39         0.79            1.39       0.95
4684         1.46         1.43            0.99       0.79


  self._final_estimator.fit(Xt, y, **fit_params_last_step)


0.6382125649968083

##### As you can see, the OneHotEncoding method is more viable then the OrdinalEncoding method, however just slightly, we can try passing it into a neural network instead, using Keras.

In [24]:
from tensorflow import keras
from tensorflow.keras import layers

ketchup_nn = keras.Sequential(layers=[layers.Dense(units=512, input_shape=[4]), layers.Dense(units=512, activation='relu'), layers.Dense(units=512, activation='relu'), layers.Dense(units=1)])

predictions = ketchup_nn.predict(o_e_validation_x)

mean_absolute_error(predictions, o_e_validation_y)



1.5301726231647654

training_model = keras.Sequential([
    layers.Dense(units=1, input_shape=[3])
])

