# Problem Statement:

Big Basket, a food delivery service operating in multiple countries, aims to improve the overall customer experience by analysing the ratings provided by its customers. The target variable in this problem is the rating given by the customers after they receive their food delivery. The company wants to understand the factors that influence the customers' ratings and identify the areas that need improvement.



The challenge for Big Basket is to collect and analyse the ratings data from multiple countries, as customer preferences and expectations may vary across different regions. Additionally, the company needs to develop a system that can capture the feedback from customers in a timely and efficient manner to ensure that the issues are addressed promptly.



Thus, the problem statement for Big Basket is to develop a data-driven approach to understand the factors influencing customer ratings across different countries, and to use this information to enhance the overall customer experience by addressing the areas that require improvement.


# Description of Data:
    
    
• Restaurant Id: Unique id of every restaurant across various cities of the world
    
    
• Restaurant Name: Name of the restaurant
    

• Country Code: Country in which restaurant is located
    

• City: City in which restaurant is located
    

• Address: Address of the restaurant
    

• Locality: Location in the city
    

• Locality Verbose: Detailed description of the locality
    

• Longitude: Longitude coordinate of the restaurant's location
    
    
• Latitude: Latitude coordinate of the restaurant's location
    

• Cuisines: Cuisines offered by the restaurant
    

• Average Cost for two: Cost for two people in different currencies 
    
    
• Currency: Currency of the country
    

• Has Table booking: yes/no
    

• Has Online delivery: yes/ no

    
• Is delivering: yes/ no

    
• Switch to order menu: yes/no

    
• Price range: range of price of food

    
• Aggregate Rating: Average rating out of 5
    

• Rating colour: depending upon the average rating colour
    

• Rating text: text on the basis of rating of rating
    

• Votes: Number of ratings casted by people


In [1]:
import pandas as pd
import numpy as np

df1 = pd.read_csv('Big Basket Food Delivery.csv', encoding='ISO-8859-1')
df2 = pd.read_excel('Country-Code.xlsx')

df = pd.merge(df1, df2, on='Country Code')

df.to_csv('merged_file.csv')

df = df.drop(["Restaurant Name", "City", "Locality", "Locality Verbose", "Switch to order menu", "Rating color", "Rating text"], axis=1)

df.isnull().sum()

Restaurant ID           0
Country Code            0
Address                 0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    1
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Price range             2
Aggregate rating        0
Votes                   0
Country                 0
dtype: int64

In [2]:
# Define a function to fill missing values with the mode
def fill_mode(df, col_name):
    mode_val = df[col_name].mode()[0]
    df[col_name].fillna(mode_val, inplace=True)

# Apply the function to fill missing values in 'Cuisines' column
fill_mode(df, 'Cuisines')


In [3]:
# Replace values with NaN and fill missing values with median for 'Average Cost for two'
df['Average Cost for two'].replace(['?', '*', '&'], np.nan, inplace=True)
median_val = df['Average Cost for two'].median()
df['Average Cost for two'].fillna(median_val, inplace=True)

# Replace values with NaN and fill missing values with median for 'Price range'
df['Price range'].replace(['?', '*', '&'], np.nan, inplace=True)
median_val = df['Price range'].median()
df['Price range'].fillna(median_val, inplace=True)
df['Price range'].replace(-1, median_val, inplace=True)  # replace fill_val=-1 with median_val


In [4]:
df.isnull().sum()

Restaurant ID           0
Country Code            0
Address                 0
Longitude               0
Latitude                0
Cuisines                0
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Price range             0
Aggregate rating        0
Votes                   0
Country                 0
dtype: int64

In [5]:
# Define the mapping dictionary
cat_mapping = {
    "Yes": 1,
    "No": 0
}

# Define the mapping function
def map_cats(x):
    return cat_mapping[x]

# Apply the mapping function to the specified columns
df["Has Table booking"] = df["Has Table booking"].apply(map_cats)
df["Has Online delivery"] = df["Has Online delivery"].apply(map_cats)
df["Is delivering now"] = df["Is delivering now"].apply(map_cats)


In [6]:
df.columns

Index(['Restaurant ID', 'Country Code', 'Address', 'Longitude', 'Latitude',
       'Cuisines', 'Average Cost for two', 'Currency', 'Has Table booking',
       'Has Online delivery', 'Is delivering now', 'Price range',
       'Aggregate rating', 'Votes', 'Country'],
      dtype='object')

In [7]:
# Define the function to factorize a single column
def factorize_column(col):
    factorized, _ = pd.factorize(col)
    return factorized

# Factorize the specified columns
df["Address"] = factorize_column(df["Address"])
df["Cuisines"] = factorize_column(df["Cuisines"])
df["Currency"] = factorize_column(df["Currency"])
df["Country"] = factorize_column(df["Country"])


In [8]:
# Define the function to cap the values in each column
def cap_data(df):
    for col in df.columns:
        print("capping the", col)
        if df[col].dtype == 'float64' or df[col].dtype == 'int64':
            percentiles = df[col].quantile([0.20, 0.80]).values
            df.loc[df[col] <= percentiles[0], col] = percentiles[0]
            df.loc[df[col] >= percentiles[1], col] = percentiles[1]
        else:
            df[col] = df[col]
    return df

# Cap the values in the dataframe
df1 = cap_data(df)


capping the Restaurant ID
capping the Country Code
capping the Address
capping the Longitude
capping the Latitude
capping the Cuisines
capping the Average Cost for two
capping the Currency
capping the Has Table booking
capping the Has Online delivery
capping the Is delivering now
capping the Price range
capping the Aggregate rating
capping the Votes
capping the Country


In [9]:
#splitting data into independent and dependant or target variables
X = df1.drop(['Aggregate rating'], axis=1)
y = df1['Aggregate rating']

In [10]:
# Split the input and output data into train and test sets
train_size = int(0.8 * len(df1))   # 80% of the data for training
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]

In [11]:
# Print the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (7640, 14)
Shape of X_test: (1911, 14)
Shape of y_train: (7640,)
Shape of y_test: (1911,)


In [12]:
def preprocessor(df1):
    # Fill missing values with mode for 'Cuisines' column
    df['Cuisines'].fillna(df['Cuisines'].mode()[0], inplace=True)
    
    # Replace values with NaN and fill missing values with median for 'Average Cost for two'
    df['Average Cost for two'].replace(['?', '*', '&'], np.nan, inplace=True)
    median_val = df['Average Cost for two'].median()
    df['Average Cost for two'].fillna(median_val, inplace=True)

    # Replace values with NaN and fill missing values with median for 'Price range'
    df['Price range'].replace(['?', '*', '&'], np.nan, inplace=True)
    median_val = df['Price range'].median()
    df['Price range'].fillna(median_val, inplace=True)
    df['Price range'].replace(-1, median_val, inplace=True)  # replace fill_val=-1 with median_val
    
    # Map categorical variables
    cat_mapping = {
        "Yes": 1,
        "No": 0
    }
    df["Has Table booking"] = df["Has Table booking"].map(cat_mapping)
    df["Has Online delivery"] = df["Has Online delivery"].map(cat_mapping)
    df["Is delivering now"] = df["Is delivering now"].map(cat_mapping)
    
    # Factorize categorical variables
    df["Address"] = pd.factorize(df["Address"])[0]
    df["Cuisines"] = pd.factorize(df["Cuisines"])[0]
    df["Currency"] = pd.factorize(df["Currency"])[0]
    df["Country"] = pd.factorize(df["Country"])[0]

    # Cap data
    df1 = cap_data(df)
    
    return df1

In [13]:
import numpy as np

class StandardScaler:
    def __init__(self):
        self.mean_ = None
        self.std_ = None
        
    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        
    def transform(self, X):
        X = (X - self.mean_) / self.std_
        return X
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

    
class ColumnTransformer:
    def __init__(self, transformers):
        self.transformers = transformers
        
    def fit_transform(self, X):
        transformed_features = []
        for name, transformer, features in self.transformers:
            X_temp = transformer.fit_transform(X[features])
            transformed_features.append(X_temp)
        
        X_transformed = np.hstack(transformed_features)
        return X_transformed
    
    def transform(self, X):
        transformed_features = []
        for name, transformer, features in self.transformers:
            X_temp = transformer.transform(X[features])
            transformed_features.append(X_temp)
        
        X_transformed = np.hstack(transformed_features)
        return X_transformed

    
class Pipeline:
    def __init__(self, steps):
        self.steps = steps
        
    def fit(self, X, y):
        X_temp = X.copy()
        for name, step in self.steps[:-1]:
            X_temp = step.fit_transform(X_temp)
        
        self.steps[-1][1].fit(X_temp, y)
        
    def predict(self, X):
        X_temp = X.copy()
        for name, step in self.steps[:-1]:
            X_temp = step.transform(X_temp)
        
        y_pred = self.steps[-1][1].predict(X_temp)
        return y_pred

In [14]:
class IdentityTransformer:
    def __init__(self):
        pass

    def fit(self, X):
        pass

    def transform(self, X):
        return X

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

        
class KNeighborsRegressor:
    def __init__(self, n_neighbors=4):
        self.n_neighbors = n_neighbors

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        y_pred = np.zeros(len(X))
        for i in range(len(X)):
            distances = np.sum((self.X_train - X[i])**2, axis=1)
            indices = np.argsort(distances)[:self.n_neighbors]
            y_pred[i] = np.mean(self.y_train[indices])
        return y_pred
    
    
    # Define the numerical and categorical features
numerical_features = ['Average Cost for two', 'Price range']
categorical_features = ['Has Table booking', 'Has Online delivery', 'Is delivering now', 'Address', 'Cuisines', 'Currency', 'Country']

preprocessor_step = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', IdentityTransformer(), categorical_features)
    ]
)

# Create the full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_step),
    ('model', KNeighborsRegressor())
])

In [15]:
# Preprocess the data and train the model using the pipeline
preprocessed_X_train = preprocessor_step.fit_transform(X_train)
model = KNeighborsRegressor(n_neighbors=4)
model.fit(preprocessed_X_train, y_train)

# Preprocess the test data and make predictions
preprocessed_X_test = preprocessor_step.transform(X_test)
y_pred = model.predict(preprocessed_X_test)
y_pred

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)
  return std(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


array([3.325, 3.325, 3.325, ..., 3.325, 3.325, 3.325])

In [16]:
# Calculate SSE, MSE and RMSE
sse = np.sum((y_pred - y_test)**2)
mse = sse / len(y_test)
rmse = np.sqrt(mse)

print("MSE:", mse)
print("RMSE:", rmse)


MSE: 3.1582048011512303
RMSE: 1.77713387260252
