## Problem Statement

#### Goal:

- The goal of this problem is to predict the Annual Turnover of a restaurant based on the variables provided in the data set. 

#### Metric to measure

- The measure of accuracy will be `RMSE` (Root mean square error)

- The predicted Annual Turnover for each restaurant in the Test dataset will be compared with the actual Annual Turnover to calculate the RMSE value of the entire prediction. The lower the RMSE value, the better the model will be.

## Data Dictionary

- `Registration Number` This is a restaurant ID
- `Annual Turnover` Annual revenue of the restaurant in INR <font color="red">TARGET</font>
- `Cuisine` Type of cuisine served in the restaurant
- `City` City at which the restaurant is located
- `Restaurant Location` This variable determines whether the restaurant is located near a business hub or a party hub
- `Opening Day of Restaurant` Date of the opening of the restaurant
- `Facebook Popularity Quotient` "Out of 100, this is the popularity of the restaurant on Facebook"
- `Endoresed By` This variable tells us what kind of celebrity endorses the restaurant
- `Instagram Popularity Quotient` "Out of 100, this is the popularity of the restaurant on Instragram"
- `Fire Audit` "This variable tells us whether the fire audit is succesfull in the restaurant. Here 1 means appropriate fire safety is present, 0 means appropriate fire safety is not present"
- `Liquor License Obtained` "This variable tells us whether the restaurant has liquour license or not. 1 means liquor license is present, 0 means otherwise"
- `Situated in a Multi Complex` "This variable tells us whether the restaurant is situated in a multi complex. 1 means the restaurant is present in the multi complex, 0 means otherwise"
- `Dedicated Parking` "This variable tells us whether the restaurant has a dedicated parking space. 1 means dedicated parking space is present, 0 means otherwise"
- `Open Sitting Available` "This variable tells us whether the restaurant has opening sitting. 1 means opening sitting is present, 0 means otherwise"
- `Resturant Tier` This variable tells us what tier the restaurant belongs to.
- `Restaurant Type` This variable tells us the type of restaurant.
- `Restaurant Theme` This variable tells us the theme of the restaurant by which it is designed.
- `Restaurant Zomato Rating` "This variable tells us the Zomato rating of the restaurant on a scale of 1 to 5, 5 being the highest."
- `Restaurant City Tier` This variable tells us the tier that restaurant belongs to
- `Order Wait Time` "This variable rates the waiting time of the restaurant on a scale of 1 to 10, 10 being the highest"
- `Staff Responsivness` "This variable rates the staff responsive of the restaurant on a scale of 1 to 8, 8 being the highest"
- `Value for Money` "This variable rates the staff responsive of the restaurant on a scale of 1 to 7, 7 being the highest"
- `Hygiene Rating` "This is the Hygiene Rating of restuarent on a scale of 1 to 10,10 being the highest"
- `Food Rating` "This is the food Rating of restuarent on a scale of 1 to 10, 10 being the highest"
- `Overall Restaurant Rating` "This is the overall restaurant rating on a scale of 1 to 10, 10 being the highest"
- `Live Music Rating` "This variable gives an indication about the satisfacation from the Live musicon a scale of 1 to 10, 10 being the highest, NA means restuarant do not offer Live music"
- `Comedy Gigs Rating` "This variable gives an indication about the satisfacation from the Comedy Show on a scale of 1 to 6, 6 being the highest NA means restuarant do not offer any comedy gigs"
- `Value Deals Rating` "This variable gives an indication about the satisfacation from the Value Deals on a scale of 1 to 7, 7 being the highest NA means restuarant do not offer any value deals"
- `Live Sports Rating` "This variable gives an indication about the satisfacation from the Live screening of Sports on a scale of 1 to 6, 6 being the highestNA means restuarant do not have live screening"
- `Ambience` "This variable gives us an indication about the ambience feel level rating of the restuarant on a scale of 0 to 10, 10 being the highest"
- `Lively` "This variable rates the lively atmosphere of the restaurant on a scale of 1 to 10, 10 being the highest"
- `Service` This variable gives us an indication about the service satisfaction level rating of the restuarant. Here Rating of 10 means highly Satisfied from the service and 0 means otherwise
- `Comfortablility` "This variable gives us an indication about the comfort level rating of the restuarent on a scale of 0 to 10, 10 being the highest"
- `Privacy` "This variable gives us an indication about the privacy level of the restuarant on a scale of 0 to 10, 10 being the highest"


In [2]:
# this will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black

%pip install xgboost

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split

# to build linear regression_model
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import HuberRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import RANSACRegressor
from catboost import CatBoostRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import DotProduct
from sklearn.ensemble import StackingRegressor

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# to suppress warnings
import warnings

warnings.filterwarnings("ignore")




In [31]:
train_data_df = pd.read_csv("Train_dataset_(1).csv")
test_data_df = pd.read_csv("Test_dataset_(1).csv")

In [32]:
# Identify object dtype columns
object_columns = train_data_df.select_dtypes(include=['object']).columns

# Convert object dtype columns to category
train_data_df[object_columns] = train_data_df[object_columns].astype('category')
train_data_df['Restaurant City Tier'] = train_data_df['Restaurant City Tier'].astype('category')

In [33]:
# Identify object dtype columns
object_columns = test_data_df.select_dtypes(include=['object']).columns

# Convert object dtype columns to category
test_data_df[object_columns] = test_data_df[object_columns].astype('category')
test_data_df['Restaurant City Tier'] = test_data_df['Restaurant City Tier'].astype('category')

In [34]:
train_data_df['Facebook Popularity Quotient'] = \
train_data_df['Facebook Popularity Quotient'].fillna(train_data_df['Facebook Popularity Quotient'].mean())

train_data_df['Comedy Gigs Rating'] = \
train_data_df['Comedy Gigs Rating'].fillna(train_data_df['Comedy Gigs Rating'].mean())

train_data_df['Value Deals Rating'] = \
train_data_df['Value Deals Rating'].fillna(train_data_df['Value Deals Rating'].mean())

train_data_df['Live Sports Rating'] = \
train_data_df['Live Sports Rating'].fillna(train_data_df['Live Sports Rating'].mean())

train_data_df['Instagram Popularity Quotient'] = \
train_data_df['Instagram Popularity Quotient'].fillna(train_data_df['Instagram Popularity Quotient'].mean())

train_data_df['Resturant Tier'] = \
train_data_df['Resturant Tier'].fillna(train_data_df['Resturant Tier'].mean())

train_data_df['Overall Restaurant Rating'] = \
train_data_df['Overall Restaurant Rating'].fillna(train_data_df['Overall Restaurant Rating'].mean())

train_data_df['Live Music Rating'] = \
train_data_df['Live Music Rating'].fillna(train_data_df['Live Music Rating'].mean())

train_data_df['Ambience'] = \
train_data_df['Ambience'].fillna(train_data_df['Ambience'].mean())

In [35]:
test_data_df['Facebook Popularity Quotient'] = \
test_data_df['Facebook Popularity Quotient'].fillna(test_data_df['Facebook Popularity Quotient'].mean())

test_data_df['Comedy Gigs Rating'] = \
test_data_df['Comedy Gigs Rating'].fillna(test_data_df['Comedy Gigs Rating'].mean())

test_data_df['Value Deals Rating'] = \
train_data_df['Value Deals Rating'].fillna(test_data_df['Value Deals Rating'].mean())

test_data_df['Live Sports Rating'] = \
test_data_df['Live Sports Rating'].fillna(test_data_df['Live Sports Rating'].mean())

test_data_df['Instagram Popularity Quotient'] = \
test_data_df['Instagram Popularity Quotient'].fillna(test_data_df['Instagram Popularity Quotient'].mean())

test_data_df['Resturant Tier'] = \
test_data_df['Resturant Tier'].fillna(test_data_df['Resturant Tier'].mean())

test_data_df['Overall Restaurant Rating'] = \
test_data_df['Overall Restaurant Rating'].fillna(test_data_df['Overall Restaurant Rating'].mean())

test_data_df['Live Music Rating'] = \
test_data_df['Live Music Rating'].fillna(test_data_df['Live Music Rating'].mean())

test_data_df['Ambience'] = \
test_data_df['Ambience'].fillna(test_data_df['Ambience'].mean())

In [36]:
combined_data = pd.concat([train_data_df, test_data_df], axis=0)

In [22]:
numeric_columns = combined_data.select_dtypes(include=['number']).columns

# Specify the degree of the polynomial
degree = 2

# Manually create polynomial features
poly_features = np.hstack([combined_data[numeric_columns]**d for d in range(1, degree + 1)])

# Create new column names for the polynomial features
poly_feature_names = [f'poly_{col}^{d}' for col in numeric_columns for d in range(1, degree + 1)]

# Create a DataFrame with the new polynomial features
df_poly = pd.DataFrame(data=poly_features, columns=poly_feature_names, index=combined_data.index)

# Concatenate the original DataFrame with the new polynomial features
combined_data = pd.concat([combined_data, df_poly], axis=1)

In [37]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
columns_to_scalar = ['Facebook Popularity Quotient', 
                     'Instagram Popularity Quotient',
                     'Restaurant Zomato Rating',
                     'Staff Responsivness',
                     'Value for Money',
                     'Overall Restaurant Rating',
                     'Ambience',
                     'Lively',
                     'Comfortablility',
                     'Privacy']
combined_data[columns_to_scalar] = scaler.fit_transform(combined_data[columns_to_scalar])

In [38]:
combined_data_dummies = pd.get_dummies(
    combined_data,
    columns=combined_data.select_dtypes(include=["object","category"]).columns.tolist(),
    drop_first=True,
)

In [39]:
train_data_dummies_df = combined_data_dummies.iloc[:len(train_data_df)]
test_data_dummies_df = combined_data_dummies.iloc[len(train_data_df):]

In [40]:
train_data_dummies_df = train_data_dummies_df.drop('Registration Number', axis=1).copy()

original_index_column = test_data_dummies_df['Registration Number']
test_data_dummies_df = test_data_dummies_df.drop('Registration Number', axis=1).copy()

In [41]:
# defining the dependent and independent variables
X_train = train_data_dummies_df.drop(["Annual Turnover"], axis=1).copy()
y_train = train_data_dummies_df["Annual Turnover"]

## Ensemble Model

In [50]:
ridge_model = Ridge(alpha=0.5, solver='sag')
lasso_model = Lasso(alpha=2.0, max_iter=1000)
elastic_net_model = ElasticNet(alpha=10.0, l1_ratio=0.7)
random_forest_model = RandomForestRegressor(max_depth=None, 
                                            max_features='sqrt', 
                                            min_samples_leaf=1, 
                                            min_samples_split=5, 
                                            n_estimators=200,
                                            random_state=1)
gradient_boosting_model = GradientBoostingRegressor(n_estimators=50, random_state=1)
knn_model = KNeighborsRegressor(n_neighbors=5)
neural_network_model = MLPRegressor(activation='relu', 
                                    alpha=0.1, 
                                    hidden_layer_sizes=(100, 50), 
                                    learning_rate='invscaling', 
                                    max_iter=1000, 
                                    random_state=1)
linear_svr_model = LinearSVR(C=1000.0)
extra_trees_model = ExtraTreesRegressor(max_depth= None, 
                                        max_features= 'sqrt',
                                        min_samples_leaf= 1,
                                        min_samples_split= 2,
                                        n_estimators= 200,
                                        random_state=1)
huber_model = HuberRegressor(alpha= 0.001, epsilon=1.75)
xgboost_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=1)
catboost_model = CatBoostRegressor(iterations=100, learning_rate=0.2, depth=8, random_seed=1)
#kernel_dot = DotProduct(sigma_0=1.0)
kernel_dot = RBF(length_scale=2)
gp_model = GaussianProcessRegressor(kernel=kernel_dot, random_state=1, alpha=0.1)
from sklearn.ensemble import AdaBoostRegressor

In [57]:
base_regressors = [
                   ('ridge', ridge_model),
                  # ('lasso', lasso_model),
                   ('elastic_net', elastic_net_model),
                   ('random_forest', random_forest_model),
                   ('gradient_boosting', gradient_boosting_model),
                   ('knn', knn_model),
                   ('neural_network', neural_network_model),
                   #('linear_svr', linear_svr_model),
                   ('extra_trees', extra_trees_model),
                   ('huber', huber_model),
                   ('xgboost', xgboost_model),    
                   ('catboost', catboost_model),
                   ('gb', gp_model),
                 ]

weighted = [0.10, 0.09, 0.10, 0.08, 0.09, 0.08, 0.10, 0.08, 0.10, 0.10, 0.08]

# Ensembling using VotingRegressor
ensemble_model = VotingRegressor(estimators=base_regressors,
                                    weights=weighted,  # Adjust weights as needed
                                    n_jobs=-1,  # Use all available cores for fitting
                                    verbose=True
                                )

                
'''ensemble_model = StackingRegressor(
    estimators=base_regressors,
    final_estimator=random_forest_model,
    n_jobs=-1,
    cv=5
)'''

ensemble_model = AdaBoostRegressor(
    ensemble_model, 
    n_estimators=50, 
    learning_rate=0.1
)

In [58]:
ensemble_model.fit(X_train, y_train)

In [440]:
from sklearn.model_selection import cross_val_score

scores = []
for name, model in ensemble_model.estimators:
    score = np.mean(cross_val_score(model, X_train, y_train,  cv=5, scoring='neg_mean_squared_error'))
    scores.append((name, score))

# Rank the models based on performance
ranked_models = sorted(scores, key=lambda x: x[1], reverse=True)

# Print the ranking
print("Model Rankings:")
for rank, (name, score) in enumerate(ranked_models, start=1):
    print(f"{rank}. {name}: {score}")

0:	learn: 21959867.2233717	total: 23.2ms	remaining: 2.29s
1:	learn: 21624908.4031818	total: 49.8ms	remaining: 2.44s
2:	learn: 21311729.8474550	total: 74.2ms	remaining: 2.4s
3:	learn: 21042333.2953784	total: 98.3ms	remaining: 2.36s
4:	learn: 20783464.6127909	total: 123ms	remaining: 2.34s
5:	learn: 20566522.2213148	total: 149ms	remaining: 2.33s
6:	learn: 20388424.9735914	total: 173ms	remaining: 2.29s
7:	learn: 20203923.2996273	total: 197ms	remaining: 2.27s
8:	learn: 19981069.9595116	total: 225ms	remaining: 2.27s
9:	learn: 19565889.2129285	total: 252ms	remaining: 2.26s
10:	learn: 19394213.4259519	total: 277ms	remaining: 2.24s
11:	learn: 19248064.7360854	total: 302ms	remaining: 2.22s
12:	learn: 19140734.8687972	total: 326ms	remaining: 2.18s
13:	learn: 19022960.7787437	total: 352ms	remaining: 2.16s
14:	learn: 18906070.1156573	total: 376ms	remaining: 2.13s
15:	learn: 18581143.8762287	total: 400ms	remaining: 2.1s
16:	learn: 18432837.0223562	total: 423ms	remaining: 2.06s
17:	learn: 18309015.37

49:	learn: 14614714.4968256	total: 1.27s	remaining: 1.27s
50:	learn: 14486182.2793430	total: 1.3s	remaining: 1.25s
51:	learn: 14413793.6247665	total: 1.33s	remaining: 1.23s
52:	learn: 14357921.8803957	total: 1.35s	remaining: 1.2s
53:	learn: 14314330.3977822	total: 1.38s	remaining: 1.18s
54:	learn: 14239665.9511769	total: 1.4s	remaining: 1.15s
55:	learn: 14030805.6708163	total: 1.43s	remaining: 1.12s
56:	learn: 13953028.2685724	total: 1.46s	remaining: 1.1s
57:	learn: 13836932.7110945	total: 1.48s	remaining: 1.07s
58:	learn: 13780660.4800114	total: 1.51s	remaining: 1.05s
59:	learn: 13606779.3270795	total: 1.53s	remaining: 1.02s
60:	learn: 13568947.5078587	total: 1.56s	remaining: 998ms
61:	learn: 13505769.9683769	total: 1.59s	remaining: 972ms
62:	learn: 13453103.8710402	total: 1.61s	remaining: 947ms
63:	learn: 13418344.7139927	total: 1.64s	remaining: 921ms
64:	learn: 13386510.8500679	total: 1.66s	remaining: 894ms
65:	learn: 13355517.2193167	total: 1.69s	remaining: 869ms
66:	learn: 1330093

95:	learn: 12187164.8657643	total: 2.42s	remaining: 101ms
96:	learn: 12168178.4861278	total: 2.45s	remaining: 75.8ms
97:	learn: 12139726.3609593	total: 2.48s	remaining: 50.6ms
98:	learn: 12066345.1009771	total: 2.5s	remaining: 25.3ms
99:	learn: 12037475.9594242	total: 2.53s	remaining: 0us
0:	learn: 20867695.0441097	total: 26ms	remaining: 2.57s
1:	learn: 20623414.5754675	total: 49.9ms	remaining: 2.45s
2:	learn: 20493934.3950926	total: 74.3ms	remaining: 2.4s
3:	learn: 20376218.9808488	total: 99.1ms	remaining: 2.38s
4:	learn: 20276053.8970946	total: 111ms	remaining: 2.11s
5:	learn: 19911392.0875077	total: 135ms	remaining: 2.11s
6:	learn: 19789434.2146171	total: 159ms	remaining: 2.11s
7:	learn: 19645139.9410597	total: 183ms	remaining: 2.1s
8:	learn: 19520752.9453448	total: 206ms	remaining: 2.08s
9:	learn: 19401586.8531283	total: 231ms	remaining: 2.08s
10:	learn: 19275799.5943763	total: 258ms	remaining: 2.09s
11:	learn: 18973016.4849004	total: 281ms	remaining: 2.06s
12:	learn: 18866423.3721

44:	learn: 15925677.8519507	total: 1.09s	remaining: 1.34s
45:	learn: 15828864.6929696	total: 1.12s	remaining: 1.32s
46:	learn: 15761690.9773549	total: 1.15s	remaining: 1.29s
47:	learn: 15695829.8171176	total: 1.17s	remaining: 1.27s
48:	learn: 15523381.1871077	total: 1.19s	remaining: 1.24s
49:	learn: 15428028.6364783	total: 1.22s	remaining: 1.22s
50:	learn: 15342382.4269722	total: 1.24s	remaining: 1.19s
51:	learn: 15178767.5083782	total: 1.26s	remaining: 1.17s
52:	learn: 15116169.3126436	total: 1.29s	remaining: 1.14s
53:	learn: 15052964.7706926	total: 1.32s	remaining: 1.12s
54:	learn: 14914309.6122545	total: 1.34s	remaining: 1.1s
55:	learn: 14843319.1644748	total: 1.37s	remaining: 1.07s
56:	learn: 14731648.7461493	total: 1.39s	remaining: 1.05s
57:	learn: 14679344.7743665	total: 1.42s	remaining: 1.02s
58:	learn: 14627578.7168792	total: 1.44s	remaining: 999ms
59:	learn: 14479756.0472021	total: 1.46s	remaining: 976ms
60:	learn: 14435399.9699276	total: 1.49s	remaining: 951ms
61:	learn: 1438

In [19]:
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
        },
        index=[0],
    )

    return df_perf

In [59]:
# Checking model performance on train set
print("Training Performance:")
model_perf_train = model_performance_regression(
    ensemble_model, X_train, y_train
)
model_perf_train

Training Performance:


Unnamed: 0,RMSE
0,11309760.0


In [None]:
weights=[0.10, 0.10, 0.10, 0.08, 0.08, 0.08, 0.10, 0.08, 0.10, 0.10, 0.08] 1.377060e+07
weights=[0.10, 0.09, 0.10, 0.08, 0.09, 0.08, 0.10, 0.08, 0.10, 0.10, 0.08] 1.373557e+07
weights=[0.11, 0.09, 0.10, 0.08, 0.09, 0.08, 0.10, 0.08, 0.10, 0.10, 0.07] 1.386379e+07
robust 1.266516e+07
robust after combine 1.266497e+07 best
stanndard 1.264114e+07
robust after median 1.292834e+07
no weights 1.292834e+07


In [45]:
X_test = test_data_dummies_df.drop(["Annual Turnover"], axis=1).copy()

In [46]:
prediction = ensemble_model.predict(X_test)
prediction[:10]

array([27637932.63354355, 35447905.45848008, 30970284.7248807 ,
       35537906.89339888, 37674036.99498414, 32621834.60350041,
       27682528.67902781, 28505730.63521985, 28357395.21349102,
       26214482.35112416])

In [47]:
solution_df = pd.DataFrame(original_index_column)

In [48]:
solution_df['Annual Turnover'] = prediction
solution_df.head()

Unnamed: 0,Registration Number,Annual Turnover
0,20001,27637930.0
1,20002,35447910.0
2,20003,30970280.0
3,20004,35537910.0
4,20005,37674040.0


In [49]:
## Exporting the data frame to a '.csv' file and setting the index = False as we do want the index

solution_df.to_csv('Submission.csv',index=False)

## Other models