# Pizza Price Prediction Project Documentation

## Project Overview

The purpose of this project is to develop a machine learning model that can predict the price of a pizza based on various features such as company, topping, variant, size, extra sauce, and extra cheese. The dataset used for this project is stored in a CSV file named "pizza_v1.csv".

## Data Preprocessing

1. **Importing Libraries**: The necessary libraries for data manipulation, visualization, and machine learning are imported. These include `pandas`, `matplotlib.pyplot`, `sklearn.linear_model`, `sklearn.metrics`, and `numpy`.

2. **Loading and Understanding Data**: The data is loaded from the CSV file using the `pd.read_csv()` function, and basic information about the dataset is printed using `data.info()`.

3. **Data Cleaning**: The dataset is cleaned by removing unwanted characters ('Rp') from the 'price_rupiah' column, and converting it to a numeric format by removing commas and using `pd.to_numeric()`.

4. **One-Hot Encoding**: Categorical columns ('company', 'topping', 'variant', 'size', 'extra_sauce', 'extra_cheese') are selected for one-hot encoding using `pd.get_dummies()`. This process converts categorical variables into binary vectors, making it easier for the machine learning model to process them.

5. **Boolean to Integer Conversion**: The dataset may contain boolean columns after one-hot encoding. To ensure compatibility with the machine learning model, boolean columns are converted to integer format (0 and 1).

6. **Data Splitting**: The dataset is split into training and test sets using `train_test_split()` from `sklearn.model_selection`.

## Model Building and Evaluation

7. **Model Selection**: The selected machine learning model for this project is Linear Regression, implemented using `sklearn.linear_model.LinearRegression`.

8. **Model Training and Prediction**: The model is trained on the training data (X_train and Y_train) using `model.fit()`, and then used to predict pizza prices on the test data (X_test) using `model.predict()`.

9. **Error Metrics Calculation**: Several error metrics are calculated to evaluate the model's performance. These metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Root Mean Squared Logarithmic Error (RMSLE).

## User Interaction

10. **User Input for Prediction**: A function named `get_user_input()` is created to allow users to input the features of a pizza they want to predict the price for. The function guides the user through selecting options for categorical columns and providing the pizza's diameter in centimeters.

11. **User Input Preprocessing**: The user's input is transformed into a DataFrame and one-hot encoded to match the format of the training data.

12. **Price Prediction for User Input**: The trained model is then used to predict the price of the user-defined pizza using the provided features.

## Conclusion

In conclusion, this project demonstrates the process of building a machine learning model to predict the price of pizzas based on various features. The Linear Regression model is used for prediction, and error metrics are calculated to evaluate its performance. Additionally, the project provides a user-friendly interface for users to obtain price predictions for their customized pizzas. The model's accuracy and reliability depend on the quality and size of the dataset, as well as the choice of features used for prediction.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

In [2]:
data=pd.read_csv("./pizza_v1 (3).csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   company       129 non-null    object 
 1   price_rupiah  129 non-null    object 
 2   diameter      129 non-null    float64
 3   topping       129 non-null    object 
 4   variant       129 non-null    object 
 5   size          129 non-null    object 
 6   extra_sauce   129 non-null    object 
 7   extra_cheese  129 non-null    object 
dtypes: float64(1), object(7)
memory usage: 8.2+ KB


In [3]:
df = pd.DataFrame(data)

# Remove 'Rp' from price_rupiah column
df['price_rupiah'] = df['price_rupiah'].str.replace('Rp', '')

# Convert price_rupiah column to numeric
df['price_rupiah'] = pd.to_numeric(df['price_rupiah'].str.replace(',', ''))

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   company       129 non-null    object 
 1   price_rupiah  129 non-null    int64  
 2   diameter      129 non-null    float64
 3   topping       129 non-null    object 
 4   variant       129 non-null    object 
 5   size          129 non-null    object 
 6   extra_sauce   129 non-null    object 
 7   extra_cheese  129 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 8.2+ KB


In [4]:
df2 = pd.DataFrame(df)

# Select categorical columns for one-hot encoding
categorical_cols = ['company', 'topping', 'variant', 'size', 'extra_sauce', 'extra_cheese']

# Apply one-hot encoding
df_encoded = pd.get_dummies(df2, columns=categorical_cols)

# Print the encoded DataFrame
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 49 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   price_rupiah               129 non-null    int64  
 1   diameter                   129 non-null    float64
 2   company_A                  129 non-null    uint8  
 3   company_B                  129 non-null    uint8  
 4   company_C                  129 non-null    uint8  
 5   company_D                  129 non-null    uint8  
 6   company_E                  129 non-null    uint8  
 7   topping_beef               129 non-null    uint8  
 8   topping_black papper       129 non-null    uint8  
 9   topping_chicken            129 non-null    uint8  
 10  topping_meat               129 non-null    uint8  
 11  topping_mozzarella         129 non-null    uint8  
 12  topping_mushrooms          129 non-null    uint8  
 13  topping_onion              129 non-null    uint8  

In [5]:
df_encoded.head()

Unnamed: 0,price_rupiah,diameter,company_A,company_B,company_C,company_D,company_E,topping_beef,topping_black papper,topping_chicken,...,size_XL,size_jumbo,size_large,size_medium,size_reguler,size_small,extra_sauce_no,extra_sauce_yes,extra_cheese_no,extra_cheese_yes
0,235000,22.0,1,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,1,0,1
1,198000,20.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,1
2,120000,16.0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
3,155000,14.0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,248000,18.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0


In [6]:
df3 = pd.DataFrame(df_encoded)

# Iterate over the columns
for column in df3.columns:
    if df3[column].dtype == bool:  # Check if column contains boolean values
        df3[column] = df3[column].astype(int)  # Convert boolean values to integers (0 and 1)

# Print the modified DataFrame
df3.head()

Unnamed: 0,price_rupiah,diameter,company_A,company_B,company_C,company_D,company_E,topping_beef,topping_black papper,topping_chicken,...,size_XL,size_jumbo,size_large,size_medium,size_reguler,size_small,extra_sauce_no,extra_sauce_yes,extra_cheese_no,extra_cheese_yes
0,235000,22.0,1,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,1,0,1
1,198000,20.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,1
2,120000,16.0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
3,155000,14.0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,248000,18.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0


In [42]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 49 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   price_rupiah               129 non-null    int64  
 1   diameter                   129 non-null    float64
 2   company_A                  129 non-null    int32  
 3   company_B                  129 non-null    int32  
 4   company_C                  129 non-null    int32  
 5   company_D                  129 non-null    int32  
 6   company_E                  129 non-null    int32  
 7   topping_beef               129 non-null    int32  
 8   topping_black papper       129 non-null    int32  
 9   topping_chicken            129 non-null    int32  
 10  topping_meat               129 non-null    int32  
 11  topping_mozzarella         129 non-null    int32  
 12  topping_mushrooms          129 non-null    int32  
 13  topping_onion              129 non-null    int32  

In [7]:
X = df3.drop(columns = ['price_rupiah'])
Y = df3['price_rupiah']

In [8]:
def findError(Y_test_array, Y_pred_array):
    # Calculate MAE
    mae = mean_absolute_error(Y_test_array, Y_pred_array)

    # Calculate MSE
    mse = mean_squared_error(Y_test_array, Y_pred_array)

    # Calculate RMSE
    rmse = np.sqrt(mse)

    # Calculate RMSLE
    masked_Y_test = np.maximum(Y_test_array, 0)
    masked_Y_pred = np.maximum(Y_pred_array, 0)
    rmsle = np.sqrt(np.mean(np.square(np.log1p(masked_Y_test) - np.log1p(masked_Y_pred))))

    percentage_error = (np.abs(Y_pred_array - Y_test_array) / Y_test_array) * 100

    # Calculate the mean of the percentage error
    mean_percentage_error = np.mean(percentage_error)

    # Print the results
    print("Mean Percentage Error:", mean_percentage_error)
    print("Mean Absolute Error (MAE):", mae)
    print("Mean Squared Error (MSE):", mse)
    print("Root Mean Squared Error (RMSE):", rmse)
    print("Root Mean Squared Logarithmic Error (RMSLE):", rmsle)

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)

# Store predicted and original values in separate arrays
Y_pred_array = np.array(Y_pred)
Y_test_array = np.array(Y_test)

findError(Y_test_array, Y_pred_array)

Mean Percentage Error: 18.708802325345246
Mean Absolute Error (MAE): 14646.384395494386
Mean Squared Error (MSE): 381879040.839484
Root Mean Squared Error (RMSE): 19541.725636173585
Root Mean Squared Logarithmic Error (RMSLE): 0.19938915397598184


## Price Prediction Function


In [24]:
# Function to get user input for prediction
def get_user_input():
    user_input = {}
    for column in categorical_cols:
        unique_options = df[column].unique()
        print(f"Available options for '{column}': {', '.join(unique_options)}")
        user_choice = input(f"Please select one option from the above for '{column}': ")
        user_input[column] = user_choice

    user_input['diameter'] = float(input("Diameter (cm): "))

    # Create a DataFrame with the user input to match the format of the encoded DataFrame
    user_input_df = pd.DataFrame(user_input, index=[0])

    # Apply one-hot encoding to the user input DataFrame
    user_input_encoded = pd.get_dummies(user_input_df, columns=categorical_cols)

    # Make sure the user input DataFrame matches the columns of the training DataFrame
    missing_cols = set(df_encoded.columns) - set(user_input_encoded.columns)
    for col in missing_cols:
        user_input_encoded[col] = 0

    # Reorder the columns to match the training DataFrame
    user_input_encoded = user_input_encoded[df_encoded.columns]

    return user_input_encoded

# Get user input
user_input = get_user_input()

X = user_input.drop(columns = ['price_rupiah'])

predicted_price = model.predict(X)

print("Predicted Price (in Rupiah):", predicted_price[0])

Available options for 'company': A, B, C, D, E
Please select one option from the above for 'company': A
Available options for 'topping': chicken, papperoni, mushrooms, smoked beef, mozzarella, black papper, tuna, meat, sausage, onion, vegetables, beef
Please select one option from the above for 'topping': mushrooms
Available options for 'variant': double_signature, american_favorite, super_supreme, meat_lovers, double_mix, classic, crunchy, new_york, double_decker, spicy_tuna, BBQ_meat_fiesta, BBQ_sausage, extravaganza, meat_eater, gournet_greek, italian_veggie, thai_veggie, american_classic, neptune_tuna, spicy tuna
Please select one option from the above for 'variant': spicy tuna
Available options for 'size': jumbo, reguler, small, medium, large, XL
Please select one option from the above for 'size': XL
Available options for 'extra_sauce': yes, no
Please select one option from the above for 'extra_sauce': yes
Available options for 'extra_cheese': yes, no
Please select one option from