# Project: Restaurant Revenue Prediction

## Topic: Import libraries
- **Purpose:** The overall purpose of the libraries used in your project is to facilitate machine learning workflows. Each library contributes to specific tasks such as data manipulation, feature selection, model training, evaluation, and deployment.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import pickle
import numpy as np

## Topic: Dataset Loading and Initial Exploration
- **Purpose:** Load and examine the Restaurant Reveue dataset to ensure data readiness for analysis.

In [2]:
# Load dataset
dataset = pd.read_csv("cleaned_restaurant_data.csv")
dataset.head()

Unnamed: 0,Name,Location,Cuisine,Rating,Seating Capacity,Average Meal Price,Marketing Budget,Social Media Followers,Chef Experience Years,Number of Reviews,Avg Review Length,Ambience Score,Service Quality Score,Parking Availability,Weekend Reservations,Weekday Reservations,Revenue
0,Restaurant 0,Rural,Japanese,4.0,38,73.98,2224.0,23406.0,13,185,161.924906,1.3,7.0,Yes,13,4,638945.5
1,Restaurant 1,Downtown,Mexican,3.2,76,28.11,4416.0,42741.0,8,533,148.759717,2.6,3.4,Yes,48,6,490207.8
2,Restaurant 2,Rural,Italian,4.7,48,48.29,2796.0,37285.0,18,853,56.849189,5.3,6.7,No,27,14,541368.6
3,Restaurant 3,Rural,Italian,4.4,34,51.55,1167.0,15214.0,13,82,205.433265,4.6,2.8,Yes,9,17,404556.8
4,Restaurant 4,Downtown,Japanese,4.9,88,75.98,3639.0,40171.0,9,78,241.681584,8.6,2.1,No,37,26,1350758.0


## Topic: Data Preprocessing
- **Purpose:** Prepare the dataset by converting categorical variables into numerical ones and separating features (X) and the target variable (y).

In [3]:
# One-hot encoding (if categorical columns exist)
df = pd.get_dummies(dataset, dtype=int, drop_first=True)
df.head()

Unnamed: 0,Rating,Seating Capacity,Average Meal Price,Marketing Budget,Social Media Followers,Chef Experience Years,Number of Reviews,Avg Review Length,Ambience Score,Service Quality Score,...,Name_Restaurant 998,Name_Restaurant 999,Location_Rural,Location_Suburban,Cuisine_French,Cuisine_Indian,Cuisine_Italian,Cuisine_Japanese,Cuisine_Mexican,Parking Availability_Yes
0,4.0,38,73.98,2224.0,23406.0,13,185,161.924906,1.3,7.0,...,0,0,1,0,0,0,0,1,0,1
1,3.2,76,28.11,4416.0,42741.0,8,533,148.759717,2.6,3.4,...,0,0,0,0,0,0,0,0,1,1
2,4.7,48,48.29,2796.0,37285.0,18,853,56.849189,5.3,6.7,...,0,0,1,0,0,0,1,0,0,0
3,4.4,34,51.55,1167.0,15214.0,13,82,205.433265,4.6,2.8,...,0,0,1,0,0,0,1,0,0,1
4,4.9,88,75.98,3639.0,40171.0,9,78,241.681584,8.6,2.1,...,0,0,0,0,0,0,0,1,0,0


- **Purpose:**  Seperate independent and dependent columns to predictions`

In [4]:
# Separate independent and dependent variables
X = df.drop('Revenue', axis=1)
y = df['Revenue']

In [5]:
X

Unnamed: 0,Rating,Seating Capacity,Average Meal Price,Marketing Budget,Social Media Followers,Chef Experience Years,Number of Reviews,Avg Review Length,Ambience Score,Service Quality Score,...,Name_Restaurant 998,Name_Restaurant 999,Location_Rural,Location_Suburban,Cuisine_French,Cuisine_Indian,Cuisine_Italian,Cuisine_Japanese,Cuisine_Mexican,Parking Availability_Yes
0,4.0,38,73.98,2224.0,23406.0,13,185,161.924906,1.3,7.0,...,0,0,1,0,0,0,0,1,0,1
1,3.2,76,28.11,4416.0,42741.0,8,533,148.759717,2.6,3.4,...,0,0,0,0,0,0,0,0,1,1
2,4.7,48,48.29,2796.0,37285.0,18,853,56.849189,5.3,6.7,...,0,0,1,0,0,0,1,0,0,0
3,4.4,34,51.55,1167.0,15214.0,13,82,205.433265,4.6,2.8,...,0,0,1,0,0,0,1,0,0,1
4,4.9,88,75.98,3639.0,40171.0,9,78,241.681584,8.6,2.1,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8363,3.4,54,34.85,1102.0,11298.0,11,380,253.919515,9.5,5.0,...,0,0,0,1,0,1,0,0,0,1
8364,3.7,49,36.88,1988.0,20432.0,9,713,175.590195,2.7,2.6,...,0,0,1,0,0,1,0,0,0,0
8365,4.7,88,46.87,5949.0,63945.0,6,436,222.953647,4.8,1.7,...,0,0,0,0,0,0,1,0,0,1
8366,3.1,31,44.53,707.0,7170.0,1,729,178.482851,6.1,2.1,...,0,0,1,0,0,0,0,0,0,0


In [6]:
y

0       6.389455e+05
1       4.902078e+05
2       5.413686e+05
3       4.045568e+05
4       1.350758e+06
            ...     
8363    4.346535e+05
8364    4.149779e+05
8365    9.303959e+05
8366    3.114935e+05
8367    5.341430e+05
Name: Revenue, Length: 8368, dtype: float64

## Topic: Feature Selection
- **Purpose:** Select the top 5 features most relevant to the target variable to improve model performance.

In [7]:
# Select top 5 features using f_regression
kbest = SelectKBest(score_func=f_regression, k=5)
X_kbest = kbest.fit_transform(X, y)
selected_features = X.columns[kbest.get_support()]
X_kbest = pd.DataFrame(X_kbest, columns=selected_features)

In [8]:
X_kbest

Unnamed: 0,Seating Capacity,Average Meal Price,Location_Rural,Cuisine_Japanese,Cuisine_Mexican
0,38.0,73.98,1.0,1.0,0.0
1,76.0,28.11,0.0,0.0,1.0
2,48.0,48.29,1.0,0.0,0.0
3,34.0,51.55,1.0,0.0,0.0
4,88.0,75.98,0.0,1.0,0.0
...,...,...,...,...,...
8363,54.0,34.85,0.0,0.0,0.0
8364,49.0,36.88,1.0,0.0,0.0
8365,88.0,46.87,0.0,0.0,0.0
8366,31.0,44.53,1.0,0.0,0.0


In [9]:
selected_features

Index(['Seating Capacity', 'Average Meal Price', 'Location_Rural',
       'Cuisine_Japanese', 'Cuisine_Mexican'],
      dtype='object')

## Topic: Data Splitting and Scaling
- **Purpose:** Split the data into training and testing sets and standardize feature values for better model performance.

In [10]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_kbest, y, test_size=0.25, random_state=0)

In [11]:
X_train

Unnamed: 0,Seating Capacity,Average Meal Price,Location_Rural,Cuisine_Japanese,Cuisine_Mexican
4488,59.0,33.09,0.0,0.0,1.0
4174,51.0,38.30,0.0,0.0,0.0
1108,88.0,51.59,0.0,0.0,0.0
1232,73.0,35.54,0.0,0.0,0.0
4539,80.0,33.03,0.0,0.0,0.0
...,...,...,...,...,...
4373,48.0,30.62,1.0,0.0,1.0
7891,71.0,38.29,0.0,0.0,0.0
4859,69.0,65.44,0.0,0.0,0.0
3264,73.0,56.75,0.0,0.0,0.0


In [12]:
X_test

Unnamed: 0,Seating Capacity,Average Meal Price,Location_Rural,Cuisine_Japanese,Cuisine_Mexican
5845,36.0,61.88,1.0,0.0,0.0
4703,42.0,47.71,1.0,0.0,0.0
852,56.0,25.55,0.0,0.0,1.0
4209,73.0,26.05,0.0,0.0,1.0
140,43.0,65.19,1.0,0.0,0.0
...,...,...,...,...,...
1790,35.0,34.53,1.0,0.0,0.0
6338,47.0,32.84,1.0,0.0,1.0
6751,55.0,30.74,0.0,0.0,0.0
8098,45.0,43.01,1.0,0.0,0.0


In [13]:
y_train

4488     445208.31
4174     462327.40
1108    1029423.31
1232     606621.48
4539     611627.83
           ...    
4373     337069.22
7891     625811.12
4859    1026300.16
3264     934975.75
2732     696806.90
Name: Revenue, Length: 6276, dtype: float64

In [14]:
y_test

5845    517141.51
4703    469501.35
852     342779.65
4209    435994.85
140     642624.58
          ...    
1790    286448.87
6338    366418.56
6751    405363.09
8098    451223.06
235     662599.11
Name: Revenue, Length: 2092, dtype: float64

In [15]:
# Feature scaling
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=selected_features)
X_test = pd.DataFrame(scaler.transform(X_test), columns=selected_features)


In [16]:
X_train

Unnamed: 0,Seating Capacity,Average Meal Price,Location_Rural,Cuisine_Japanese,Cuisine_Mexican
0,-0.069059,-1.028239,-0.701790,-0.440269,2.237351
1,-0.528305,-0.666273,-0.701790,-0.440269,-0.446957
2,1.595710,0.257054,-0.701790,-0.440269,-0.446957
3,0.734622,-0.858025,-0.701790,-0.440269,-0.446957
4,1.136463,-1.032408,-0.701790,-0.440269,-0.446957
...,...,...,...,...,...
6271,-0.700523,-1.199843,1.424928,-0.440269,2.237351
6272,0.619811,-0.666968,-0.701790,-0.440269,-0.446957
6273,0.504999,1.219286,-0.701790,-0.440269,-0.446957
6274,0.734622,0.615546,-0.701790,-0.440269,-0.446957


In [17]:
X_test

Unnamed: 0,Seating Capacity,Average Meal Price,Location_Rural,Cuisine_Japanese,Cuisine_Mexican
0,-1.389393,0.971954,1.424928,-0.440269,-0.446957
1,-1.044958,-0.012510,1.424928,-0.440269,-0.446957
2,-0.241276,-1.552083,-0.701790,-0.440269,2.237351
3,0.734622,-1.517345,-0.701790,-0.440269,2.237351
4,-0.987552,1.201918,1.424928,-0.440269,-0.446957
...,...,...,...,...,...
2087,-1.446798,-0.928195,1.424928,-0.440269,-0.446957
2088,-0.757929,-1.045608,1.424928,-0.440269,2.237351
2089,-0.298682,-1.191506,-0.701790,-0.440269,-0.446957
2090,-0.872740,-0.339044,1.424928,-0.440269,-0.446957


## Topic: Model Training and Evaluation
- **Purpose:** Train and evaluate different machine learning models for Revenue predictions.

In [18]:
# Initialize lists to store R2 scores
acclin = []
accsvml = []
accsvmnl = []
accdes = []
accrf = []

# Linear Regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
acclin.append(r2_score(y_test, y_pred))

# Support Vector Machine (linear) model
svm_l = SVR(kernel='linear')
svm_l.fit(X_train, y_train)
y_pred = svm_l.predict(X_test)
accsvml.append(r2_score(y_test, y_pred))

# Support Vector Machine (non-linear) model
svm_nl = SVR(kernel='rbf')
svm_nl.fit(X_train, y_train)
y_pred = svm_nl.predict(X_test)
accsvmnl.append(r2_score(y_test, y_pred))

# Decision Tree model
d_tree = DecisionTreeRegressor(random_state=0)
d_tree.fit(X_train, y_train)
y_pred = d_tree.predict(X_test)
accdes.append(r2_score(y_test, y_pred))

# Random Forest model
rf = RandomForestRegressor(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accrf.append(r2_score(y_test, y_pred))


In [19]:
# Combine all the results into a DataFrame for easy comparison
result = pd.DataFrame(index=['R2 Score'], columns=['Linear', 'SVMl', 'SVMnl', 'Decision', 'Random'])
result['Linear'] = acclin
result['SVMl'] = accsvml
result['SVMnl'] = accsvmnl
result['Decision'] = accdes
result['Random'] = accrf

# Print the results
print(result)

            Linear      SVMl     SVMnl  Decision    Random
R2 Score  0.957236  0.020826 -0.035836  0.997943  0.998821


## Topic: Saving and Loading Models
- **Purpose:** Persist the trained model and scaler for future predictions without retraining.

In [20]:
# Save Random Forest model and scaler
with open('restaurant_revenue_rf_model.pkl', 'wb') as model_file:
    pickle.dump(rf, model_file)

with open('scaler.pkl', 'wb') as scaler_file:
    pickle.dump(scaler, scaler_file)

In [22]:
import pickle

# Saving a model as `.sav`
with open('restaurant_revenue_rf_model.sav', 'wb') as file:
    pickle.dump(rf, file)

# Loading a model from `.sav`
with open('restaurant_revenue_rf_model.sav', 'rb') as file:
    loaded_model = pickle.load(file)


In [23]:
# Load the saved model and scaler
with open('restaurant_revenue_rf_model.pkl', 'rb') as model_file:
    loaded_rf_model = pickle.load(model_file)

with open('scaler.pkl', 'rb') as scaler_file:
    loaded_scaler = pickle.load(scaler_file)

## Topic: Prediction on New Input
- **Purpose:** Use the trained model to predict Revenue for new input data.

In [24]:
new_input = np.array([[38, 73.98, 1, 1, 0]])  

# Scale the new input data
scaled_input = loaded_scaler.transform(new_input)

# Make the prediction using the trained model
prediction = loaded_rf_model.predict(scaled_input)

# Print the prediction
print("Prediction for the new input:", prediction)

Prediction for the new input: [640552.801]


