# Part B – Predictive Modelling
### I. Feature Engineering:


In [190]:
import pandas as pd
import numpy as np


## Load the Dataset
Now, let's load the dataset from its file path into a pandas DataFrame. This is the dataset that contains information about restaurants.

In [191]:
# Provide the path to your dataset
file_path = "./data/zomato_df_final_data.csv"

# Load the dataset
zomato_df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(zomato_df.head())

                                             address   cost  \
0                      371A Pitt Street, CBD, Sydney   50.0   
1      Shop 7A, 2 Huntley Street, Alexandria, Sydney   80.0   
2   Level G, The Darling at the Star, 80 Pyrmont ...  120.0   
3   Sydney Opera House, Bennelong Point, Circular...  270.0   
4              20 Campbell Street, Chinatown, Sydney   55.0   

                                       cuisine        lat  \
0   ['Hot Pot', 'Korean BBQ', 'BBQ', 'Korean'] -33.876059   
1  ['Cafe', 'Coffee and Tea', 'Salad', 'Poké'] -33.910999   
2                                 ['Japanese'] -33.867971   
3                        ['Modern Australian'] -33.856784   
4                            ['Thai', 'Salad'] -33.879035   

                                                link         lng  \
0    https://www.zomato.com/sydney/sydney-madang-cbd  151.207605   
1  https://www.zomato.com/sydney/the-grounds-of-a...  151.193793   
2        https://www.zomato.com/sydney/sokyo-pyrmo

### Check for Missing Data
Before we can clean the data, we need to understand how much data is missing and in which columns. We will use the isnull() function combined with sum() to identify missing values.

In [192]:
# Check for missing values in the dataset
print("Missing values in each column:\n", zomato_df.isnull().sum())


Missing values in each column:
 address             0
cost              346
cuisine             0
lat               192
link                0
lng               192
phone               0
rating_number    3316
rating_text      3316
subzone             0
title               0
type               48
votes            3316
groupon             0
color               0
cost_2            346
cuisine_color       0
dtype: int64


### Remove or Impute Missing Values
Once we identify missing values, we can decide how to handle them. Generally, there are two options:

1. Remove rows or columns with missing values.

2. Impute (fill) missing values with meaningful data such as the median, mean, or most frequent value.
For this task, we will remove rows that have any missing values:

In [193]:
# Data Cleaning: Dropping rows with missing 'rating_number' or 'cost'
zomato_df_clean = zomato_df.dropna(subset=['rating_number', 'cost'])

# Feature Encoding: One-hot encode 'rating_text' and 'type'
# For simplicity, we'll use one-hot encoding for 'rating_text', 'cuisine', and 'type'
zomato_df_clean = pd.get_dummies(zomato_df_clean, columns=['rating_text', 'type'], drop_first=True)

# Check if the data is clean and encoded
zomato_df_clean.head()

Unnamed: 0,address,cost,cuisine,lat,link,lng,phone,rating_number,subzone,title,...,"type_['Food Court', 'Fast Food']",type_['Food Court'],type_['Food Truck'],"type_['Pub', 'Bar']","type_['Pub', 'Casual Dining']","type_['Pub', 'Club']","type_['Pub', 'Wine Bar']",type_['Pub'],"type_['Wine Bar', 'Casual Dining']",type_['Wine Bar']
0,"371A Pitt Street, CBD, Sydney",50.0,"['Hot Pot', 'Korean BBQ', 'BBQ', 'Korean']",-33.876059,https://www.zomato.com/sydney/sydney-madang-cbd,151.207605,02 8318 0406,4.0,CBD,Sydney Madang,...,False,False,False,False,False,False,False,False,False,False
1,"Shop 7A, 2 Huntley Street, Alexandria, Sydney",80.0,"['Cafe', 'Coffee and Tea', 'Salad', 'Poké']",-33.910999,https://www.zomato.com/sydney/the-grounds-of-a...,151.193793,02 9699 2225,4.6,"The Grounds of Alexandria, Alexandria",The Grounds of Alexandria Cafe,...,False,False,False,False,False,False,False,False,False,False
2,"Level G, The Darling at the Star, 80 Pyrmont ...",120.0,['Japanese'],-33.867971,https://www.zomato.com/sydney/sokyo-pyrmont,151.19521,1800 700 700,4.9,"The Star, Pyrmont",Sokyo,...,False,False,False,False,False,False,False,False,False,False
3,"Sydney Opera House, Bennelong Point, Circular...",270.0,['Modern Australian'],-33.856784,https://www.zomato.com/sydney/bennelong-restau...,151.215297,02 9240 8000,4.9,Circular Quay,Bennelong Restaurant,...,False,False,False,False,False,False,False,False,False,False
4,"20 Campbell Street, Chinatown, Sydney",55.0,"['Thai', 'Salad']",-33.879035,https://www.zomato.com/sydney/chat-thai-chinatown,151.206409,02 8317 4811,4.5,Chinatown,Chat Thai,...,False,False,False,False,False,False,False,False,False,False


### Handling Categorical Variables
Next, we need to encode categorical variables (like cuisine, type, and rating_text) into numerical format since machine learning models can't work with raw text or categories.

#### a. Label Encoding for rating_text
We will use Label Encoding to convert the rating_text column into numbers:

In [194]:
from sklearn.preprocessing import LabelEncoder

# Initialize label encoder
label_encoder = LabelEncoder()

# Encode the 'rating_text' column
zomato_cleaned['rating_text_encoded'] = label_encoder.fit_transform(zomato_cleaned['rating_text'])

# Check the encoding
print(zomato_cleaned[['rating_text', 'rating_text_encoded']].drop_duplicates())


    rating_text  rating_text_encoded
0     Very Good                    4
1     Excellent                    1
28         Good                    2
44      Average                    0
255        Poor                    3


### Final Dataset Check
After cleaning and encoding, we will check the structure of the final dataset:

In [195]:
# Check the final cleaned dataset
print(zomato_cleaned.info())

# Preview the first few rows
zomato_cleaned.head()


<class 'pandas.core.frame.DataFrame'>
Index: 6949 entries, 0 to 10212
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   address              6949 non-null   object 
 1   cost                 6949 non-null   float64
 2   cuisine              6949 non-null   object 
 3   lat                  6949 non-null   float64
 4   link                 6949 non-null   object 
 5   lng                  6949 non-null   float64
 6   phone                6949 non-null   object 
 7   rating_number        6949 non-null   float64
 8   rating_text          6949 non-null   object 
 9   subzone              6949 non-null   object 
 10  title                6949 non-null   object 
 11  type                 6949 non-null   object 
 12  votes                6949 non-null   float64
 13  groupon              6949 non-null   bool   
 14  color                6949 non-null   object 
 15  cost_2               6949 non-null   float

Unnamed: 0,address,cost,cuisine,lat,link,lng,phone,rating_number,rating_text,subzone,title,type,votes,groupon,color,cost_2,cuisine_color,rating_text_encoded,suburbs
0,"371A Pitt Street, CBD, Sydney",50.0,"['Hot Pot', 'Korean BBQ', 'BBQ', 'Korean']",-33.876059,https://www.zomato.com/sydney/sydney-madang-cbd,151.207605,02 8318 0406,4.0,Very Good,CBD,Sydney Madang,['Casual Dining'],1311.0,False,#e15307,5.243902,#6f706b,4,CBD
1,"Shop 7A, 2 Huntley Street, Alexandria, Sydney",80.0,"['Cafe', 'Coffee and Tea', 'Salad', 'Poké']",-33.910999,https://www.zomato.com/sydney/the-grounds-of-a...,151.193793,02 9699 2225,4.6,Excellent,"The Grounds of Alexandria, Alexandria",The Grounds of Alexandria Cafe,['Café'],3236.0,False,#9c3203,7.560976,#6f706b,1,Alexandria
2,"Level G, The Darling at the Star, 80 Pyrmont ...",120.0,['Japanese'],-33.867971,https://www.zomato.com/sydney/sokyo-pyrmont,151.19521,1800 700 700,4.9,Excellent,"The Star, Pyrmont",Sokyo,['Fine Dining'],1227.0,False,#7f2704,10.650407,#6f706b,1,Pyrmont
3,"Sydney Opera House, Bennelong Point, Circular...",270.0,['Modern Australian'],-33.856784,https://www.zomato.com/sydney/bennelong-restau...,151.215297,02 9240 8000,4.9,Excellent,Circular Quay,Bennelong Restaurant,"['Fine Dining', 'Bar']",278.0,False,#7f2704,22.235772,#4186f4,1,Circular Quay
4,"20 Campbell Street, Chinatown, Sydney",55.0,"['Thai', 'Salad']",-33.879035,https://www.zomato.com/sydney/chat-thai-chinatown,151.206409,02 8317 4811,4.5,Excellent,Chinatown,Chat Thai,['Casual Dining'],2150.0,False,#a83703,5.630081,#6f706b,1,Chinatown


#### Code to Create the suburbs Column

In [196]:
# Split the 'subzone' column by commas and create a new 'suburbs' column
# Assuming 'subzone' has values like 'The Grounds of Alexandria, Alexandria'
zomato_cleaned['suburbs'] = zomato_cleaned['subzone'].str.split(',').str[-1].str.strip()

# Check the updated dataframe
zomato_cleaned[['subzone', 'suburbs']].head()



Unnamed: 0,subzone,suburbs
0,CBD,CBD
1,"The Grounds of Alexandria, Alexandria",Alexandria
2,"The Star, Pyrmont",Pyrmont
3,Circular Quay,Circular Quay
4,Chinatown,Chinatown


In [197]:
# Preview the first few rows
zomato_cleaned['suburbs'].nunique()

zomato_cleaned.head()

Unnamed: 0,address,cost,cuisine,lat,link,lng,phone,rating_number,rating_text,subzone,title,type,votes,groupon,color,cost_2,cuisine_color,rating_text_encoded,suburbs
0,"371A Pitt Street, CBD, Sydney",50.0,"['Hot Pot', 'Korean BBQ', 'BBQ', 'Korean']",-33.876059,https://www.zomato.com/sydney/sydney-madang-cbd,151.207605,02 8318 0406,4.0,Very Good,CBD,Sydney Madang,['Casual Dining'],1311.0,False,#e15307,5.243902,#6f706b,4,CBD
1,"Shop 7A, 2 Huntley Street, Alexandria, Sydney",80.0,"['Cafe', 'Coffee and Tea', 'Salad', 'Poké']",-33.910999,https://www.zomato.com/sydney/the-grounds-of-a...,151.193793,02 9699 2225,4.6,Excellent,"The Grounds of Alexandria, Alexandria",The Grounds of Alexandria Cafe,['Café'],3236.0,False,#9c3203,7.560976,#6f706b,1,Alexandria
2,"Level G, The Darling at the Star, 80 Pyrmont ...",120.0,['Japanese'],-33.867971,https://www.zomato.com/sydney/sokyo-pyrmont,151.19521,1800 700 700,4.9,Excellent,"The Star, Pyrmont",Sokyo,['Fine Dining'],1227.0,False,#7f2704,10.650407,#6f706b,1,Pyrmont
3,"Sydney Opera House, Bennelong Point, Circular...",270.0,['Modern Australian'],-33.856784,https://www.zomato.com/sydney/bennelong-restau...,151.215297,02 9240 8000,4.9,Excellent,Circular Quay,Bennelong Restaurant,"['Fine Dining', 'Bar']",278.0,False,#7f2704,22.235772,#4186f4,1,Circular Quay
4,"20 Campbell Street, Chinatown, Sydney",55.0,"['Thai', 'Salad']",-33.879035,https://www.zomato.com/sydney/chat-thai-chinatown,151.206409,02 8317 4811,4.5,Excellent,Chinatown,Chat Thai,['Casual Dining'],2150.0,False,#a83703,5.630081,#6f706b,1,Chinatown


The data is now cleaned and categorical features such as rating_text and type have been one-hot encoded. The next step involves building the regression models:

#### Linear Regression Model 1: 
We'll use linear regression to predict the restaurant rating using the features in the dataset.
#### Linear Regression Model 2: 
We'll implement another model using gradient descent.
#### Mean Squared Error (MSE):
 We'll calculate and report the MSE for both models.
Let’s start by building the first linear regression model and evaluating its performance. I'll split the data into training and test sets and proceed with the model training

In [198]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Defining features (X) and target variable (y)
X = zomato_df_clean.drop(columns=['rating_number', 'address', 'cuisine', 'link', 'phone', 'title', 'subzone', 'votes', 'groupon', 'color', 'cost_2', 'cuisine_color'])
y = zomato_df_clean['rating_number']

# Impute missing values by replacing them with the median for numerical features
X_filled = X.fillna(X.median())

# Re-run the train-test split with the imputed data
X_train_filled, X_test_filled, y_train, y_test = train_test_split(X_filled, y, test_size=0.2, random_state=0)

# Rebuild Model 1: Linear Regression
linear_reg_model.fit(X_train_filled, y_train)

# Predict on the test set
y_pred_filled = linear_reg_model.predict(X_test_filled)

# Calculate Mean Squared Error (MSE) after imputation
mse_model_1_filled = mean_squared_error(y_test, y_pred_filled)

mse_model_1_filled



0.045606445171681706

The second model using Gradient Descent for optimization. I'll use the Stochastic Gradient Descent (SGD) regressor from scikit-learn to build this model.

In [199]:
from sklearn.linear_model import SGDRegressor

# Model 2: Linear Regression with Gradient Descent (SGD Regressor)
sgd_regressor = SGDRegressor(max_iter=1000, tol=1e-3, random_state=0)

# Train the model using the same filled dataset
sgd_regressor.fit(X_train_filled, y_train)

# Predict on the test set
y_pred_sgd = sgd_regressor.predict(X_test_filled)

# Calculate Mean Squared Error (MSE) for the SGD model
mse_model_2 = mean_squared_error(y_test, y_pred_sgd)

mse_model_2


1.6100809559982215e+25

The Mean Squared Error (MSE) for the second model using Gradient Descent (SGD Regressor) is extremely high, suggesting that the model may not be converging properly. This could be due to several reasons, such as improper scaling of features or issues with the optimization process.

We can attempt feature scaling to see if it improves the model performance. 

In [200]:
from sklearn.preprocessing import StandardScaler

# Apply Standard Scaling to the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_filled)
X_test_scaled = scaler.transform(X_test_filled)

# Rebuild the SGD model with scaled features
sgd_regressor.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred_sgd_scaled = sgd_regressor.predict(X_test_scaled)

# Calculate MSE after scaling
mse_model_2_scaled = mean_squared_error(y_test, y_pred_sgd_scaled)

mse_model_2_scaled


6199.1493534606225

After applying feature scaling, the Mean Squared Error (MSE) for the SGD (Gradient Descent) model has improved to 6199.15. While this is much better than the previous result, it's still higher compared to the first linear regression model.

The high MSE (Mean Squared Error) in the SGD regression model can be caused by several factors. Here are a few key reasons why this might happen:

1. Model Optimization and Convergence:
Stochastic Gradient Descent (SGD) can be sensitive to the learning rate, number of iterations, and tolerance. If these parameters are not tuned properly, the model may not converge to an optimal solution, leading to poor predictions and high error.
In this case, the model might not have found the best solution within the set number of iterations, resulting in a higher error.
2. Feature Scaling:
While we applied feature scaling, if features have very different ranges, even with scaling, the model could still struggle to converge. SGD is particularly sensitive to feature scaling, and any imbalance could impact the performance.
3. Complexity of the Problem:
The relationship between the input features and the target variable (restaurant rating) may be non-linear or more complex than what a simple linear model can capture. This means that a linear model like SGD or regular linear regression might not be the best fit for this data, leading to higher MSE.
Non-linear models such as decision trees, random forests, or more advanced methods might be able to model this complexity better.
4. Noise in the Data:
If the dataset contains outliers, irrelevant features, or noisy data, it could negatively affect the model's predictions. Even after scaling, noisy data can result in inaccurate predictions and a higher MSE.
5. Feature Engineering:
The dataset might benefit from further feature engineering. Some of the features we encoded or one-hot encoded may not provide significant predictive power, or we may have missed some important interactions between features.
6. Imbalanced Target Distribution:
If the target variable (restaurant rating) is skewed or has an imbalanced distribution, the model might be underfitting some regions of the target space, causing it to predict poorly for those cases.

Summary:
The SGD model is generally more challenging to tune compared to the standard linear regression, and requires careful adjustment of hyperparameters. The higher MSE here suggests that SGD might not be the best choice for this dataset without further tuning or additional feature engineering.

The Mean Squared Error (MSE) on the test data for both regression models:

Linear Regression (Model 1):
MSE: 0.0456

SGD Regression (Model 2):
MSE before scaling: 
1.68 × 10^25 (indicating severe convergence issues)
MSE after scaling: 6199.15
The Linear Regression model (Model 1) performed much better than the SGD Regression model (Model 2), even after scaling. The results suggest that the linear regression model is more suitable for this dataset.

# III. Classification:

## Step 1: Simplifying the problem into binary classification
We will categorize restaurants based on their rating_text:

##### Class 1: ‘Poor’ and ‘Average’ records.
##### Class 2: ‘Good’, ‘Very Good’, and ‘Excellent’ records.
I will now prepare the data accordingly, followed by building a logistic regression model for this binary classification problem.

Let me first map the ratings into binary categories and then proceed with the logistic regression model

In [201]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
# Load the dataset again
file_path = './data/zomato_df_final_data.csv'
zomato_df = pd.read_csv(file_path)

zomato_df_clean.columns


Index(['address', 'cost', 'cuisine', 'lat', 'link', 'lng', 'phone',
       'rating_number', 'subzone', 'title', 'votes', 'groupon', 'color',
       'cost_2', 'cuisine_color', 'rating_text_Excellent', 'rating_text_Good',
       'rating_text_Poor', 'rating_text_Very Good',
       'type_['Bakery', 'Dessert Parlour']', 'type_['Bakery', 'Pub']',
       'type_['Bakery']', 'type_['Bar', 'Café']',
       'type_['Bar', 'Casual Dining']', 'type_['Bar', 'Club']',
       'type_['Bar', 'Pub']', 'type_['Bar', 'Wine Bar']', 'type_['Bar']',
       'type_['Beverage Shop', 'Food Court']', 'type_['Beverage Shop']',
       'type_['Café', 'Bakery']', 'type_['Café', 'Bar']',
       'type_['Café', 'Beverage Shop']', 'type_['Café', 'Casual Dining']',
       'type_['Café', 'Dessert Parlour']', 'type_['Café', 'Food Court']',
       'type_['Café', 'Wine Bar']', 'type_['Café']',
       'type_['Casual Dining', 'Bakery']', 'type_['Casual Dining', 'Bar']',
       'type_['Casual Dining', 'Café']',
       'type_['Casu

In [202]:
# Redefine the binary classification
# Class 1: 'Poor' -> 0 (Class 1)
# Class 2: 'Good', 'Very Good', 'Excellent' -> 1 (Class 2)

# Mapping the binary classes
zomato_df_clean['rating_binary'] = zomato_df_clean['rating_text_Poor']

# Class 1 remains as 0 (Poor), Class 2 (Good, Very Good, Excellent) becomes 1
zomato_df_clean['rating_binary'] = zomato_df_clean['rating_binary'].replace({0: 1})

# Prepare the features (X) and target (y)
X_class_binary = zomato_df_clean.drop(columns=['rating_number', 'rating_binary', 'address', 'cuisine', 'link', 'phone', 'title', 'subzone', 'votes', 'groupon', 'color', 'cost_2', 'cuisine_color'])
y_class_binary = zomato_df_clean['rating_binary']

# Impute missing values using the median for numeric columns
imputer = SimpleImputer(strategy='median')
X_class_binary_filled = imputer.fit_transform(X_class_binary)

# The binary classification problem is now ready for modeling


## Build a logistic regression 

In [203]:
# Impute missing values in the feature set using median values
X_class_filled = X_class.fillna(X_class.median())

# Split the data again after filling missing values
X_train_class_filled, X_test_class_filled, y_train_class, y_test_class = train_test_split(X_class_filled, y_class, test_size=0.2, random_state=0)

# Rebuild the Logistic Regression model with filled data
log_reg_model.fit(X_train_class_filled, y_train_class)

# Predict on the test set
y_pred_class_filled = log_reg_model.predict(X_test_class_filled)

# Generate the confusion matrix
conf_matrix_filled = confusion_matrix(y_test_class, y_pred_class_filled)

# Calculate accuracy, precision, recall, and F1 score
accuracy = accuracy_score(y_test_class, y_pred_class_filled)
precision = precision_score(y_test_class, y_pred_class_filled)
recall = recall_score(y_test_class, y_pred_class_filled)
f1 = f1_score(y_test_class, y_pred_class_filled)

# Generate a full classification report
classification_rep = classification_report(y_test_class, y_pred_class_filled)

# Print confusion matrix and accuracy table
print("Confusion Matrix:")
print(conf_matrix_filled)

print("\nAccuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

# Optionally print the full classification report
print("\nClassification Report:")
print(classification_rep)


Confusion Matrix:
[[941   0]
 [  0 476]]

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0

Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00       941
        True       1.00      1.00      1.00       476

    accuracy                           1.00      1417
   macro avg       1.00      1.00      1.00      1417
weighted avg       1.00      1.00      1.00      1417



The confusion matrix for the logistic regression model is as follows:

[ 941  0  
  [0   476]]

This indicates that the model correctly classified all instances, with:

941 true negatives (correctly classified as Class 1: 'Poor' ratings)
476 true positives (correctly classified as Class 2: 'Good', 'Very Good', and 'Excellent' ratings)
0 false negatives and 0 false positives.
The model appears to have perfectly separated the two classes in this case.

#### Confusion matrix to report the results of using the classification model on the test data.

In [204]:
# Redefine the binary classification
# Class 1: 'Poor' -> 0 (Class 1)
# Class 2: 'Good', 'Very Good', 'Excellent' -> 1 (Class 2)

zomato_df_clean['rating_binary'] = zomato_df_clean['rating_text_Poor']

# Class 1 (Poor) is 0, Class 2 (Good, Very Good, Excellent) is 1
zomato_df_clean['rating_binary'] = zomato_df_clean['rating_binary'].replace({0: 1})

# Prepare the features (X) and target (y)
X_class_binary = zomato_df_clean.drop(columns=['rating_number', 'rating_binary', 'address', 'cuisine', 'link', 'phone', 'title', 'subzone', 'votes', 'groupon', 'color', 'cost_2', 'cuisine_color'])
y_class_binary = zomato_df_clean['rating_binary']

# Impute missing values using the median for numeric columns
imputer = SimpleImputer(strategy='median')
X_class_binary_filled = imputer.fit_transform(X_class_binary)

# Split the data into training (80%) and test sets (20%)
X_train_class_binary, X_test_class_binary, y_train_class_binary, y_test_class_binary = train_test_split(X_class_binary_filled, y_class_binary, test_size=0.2, random_state=0)

# Build a Logistic Regression model
log_reg_model_binary = LogisticRegression(max_iter=1000)
log_reg_model_binary.fit(X_train_class_binary, y_train_class_binary)

# Predict on the test set
y_pred_class_binary = log_reg_model_binary.predict(X_test_class_binary)

# Generate the confusion matrix for this refined classification task
conf_matrix_binary_filled = confusion_matrix(y_test_class_binary, y_pred_class_binary)

# Calculate accuracy, precision, recall, and F1 score
accuracy = accuracy_score(y_test_class_binary, y_pred_class_binary)
precision = precision_score(y_test_class_binary, y_pred_class_binary)
recall = recall_score(y_test_class_binary, y_pred_class_binary)
f1 = f1_score(y_test_class_binary, y_pred_class_binary)

# Generate a full classification report
classification_rep = classification_report(y_test_class_binary, y_pred_class_binary)

# Display the confusion matrix and accuracy metrics
print("Confusion Matrix:")
print(conf_matrix_binary_filled)

print("\nAccuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

# Optionally, print the full classification report
print("\nClassification Report:")
print(classification_rep)

Confusion Matrix:
[[1371    0]
 [   0   46]]

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0

Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      1371
        True       1.00      1.00      1.00        46

    accuracy                           1.00      1417
   macro avg       1.00      1.00      1.00      1417
weighted avg       1.00      1.00      1.00      1417




This code performs binary classification using logistic regression. It starts by mapping restaurant ratings into two classes: Class 1 (Poor) and Class 2 (Good, Very Good, Excellent). Missing values in the features are filled using the median, and the data is split into 80% training and 20% testing sets. A logistic regression model is then trained on the training data. Afterward, predictions are made on the test data. Finally, the model's performance is evaluated using a confusion matrix, which shows the counts of correct and incorrect classifications (True Positives, False Positives, True Negatives, False Negatives).

### Draw your conclusions and observations about the performance of the model relevant to the classes’ distributions.



## Observations from the Classification Reports:
#### 1. First Report:

##### Class Distribution:
False (Class 0): 941 instances
True (Class 1): 476 instances
##### Performance:
Precision, Recall, F1-score: All are 1.00 for both classes, indicating perfect classification for both.
Accuracy: 100% accuracy, meaning all 1417 instances were correctly classified.

#### 2. Second Report:

##### Class Distribution:
False (Class 0): 1371 instances (significantly higher)
True (Class 1): 46 instances (much lower)
##### Performance:
Precision, Recall, F1-score: Still 1.00 for both classes, indicating perfect classification.
Accuracy: 100% accuracy, as in the first report, with all 1417 instances classified correctly.

##### Conclusions:
1. Class Imbalance:

i. In the second report, there is a significant imbalance in the class distribution, where False (Class 0) has 1371 instances, and True (Class 1) has only 46 instances. Despite this imbalance, the model performed perfectly in both precision and recall.
ii. A high imbalance like this might indicate the need to further investigate the robustness of the model. While the model performs well with this dataset, it might not generalize well to other datasets or scenarios with similar imbalances.

2. Perfect Metrics:

i. Both models achieve perfect precision, recall, and F1-scores, as well as 100% accuracy. This may suggest that the classification task was relatively easy, or it could point to overfitting where the model fits the training data extremely well but may not perform similarly on new data.
3. Evaluation:
 
i. While perfect classification is achieved, the class imbalance in the second report (with very few True instances) raises concerns about the model's ability to handle more balanced or different distributions in unseen datasets.
ii. In practical applications, handling class imbalance (such as with resampling techniques) might be required to ensure the model's robustness.
#### Recommendation:
While the results are impressive, further evaluation is necessary to test the model's generalization on a more balanced or different test set to ensure it’s not overfitting or being biased by the class distribution.

To repeat the classification task using three different models, I will select commonly used classification models from scikit-learn. These models can be compared against the logistic regression model we previously used:

Selected Models:
1. Decision Tree Classifier
2. Random Forest Classifier
3. Support Vector Machine (SVM)
We will use these models to train and predict on the same dataset and compare their performances using metrics like accuracy, precision, recall, and F1-score.

In [205]:
# Define a function to evaluate the performance of each model
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    # Classification Report
    class_report = classification_report(y_test, y_pred)
    
    print(f"Confusion Matrix for {model.__class__.__name__}:\n", conf_matrix)
    print(f"\nClassification Report for {model.__class__.__name__}:\n", class_report)

# Models to compare
models = [
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(kernel='linear')  # Linear kernel for simplicity
]

# Train and evaluate each model
for model in models:
    print(f"\nEvaluating {model.__class__.__name__}...\n")
    evaluate_model(model, X_train_class_binary, X_test_class_binary, y_train_class_binary, y_test_class_binary)


Evaluating DecisionTreeClassifier...

Confusion Matrix for DecisionTreeClassifier:
 [[1371    0]
 [   0   46]]

Classification Report for DecisionTreeClassifier:
               precision    recall  f1-score   support

       False       1.00      1.00      1.00      1371
        True       1.00      1.00      1.00        46

    accuracy                           1.00      1417
   macro avg       1.00      1.00      1.00      1417
weighted avg       1.00      1.00      1.00      1417


Evaluating RandomForestClassifier...

Confusion Matrix for RandomForestClassifier:
 [[1371    0]
 [   0   46]]

Classification Report for RandomForestClassifier:
               precision    recall  f1-score   support

       False       1.00      1.00      1.00      1371
        True       1.00      1.00      1.00        46

    accuracy                           1.00      1417
   macro avg       1.00      1.00      1.00      1417
weighted avg       1.00      1.00      1.00      1417


Evaluating SVC...

### Explanation:
1. evaluate_model function: This function takes a model, trains it on the training data, predicts the results on the test data, and prints both the confusion matrix and the classification report.
2. DecisionTreeClassifier: A simple decision tree algorithm.
3. RandomForestClassifier: An ensemble method using multiple decision trees.
4. Support Vector Classifier (SVC): A linear kernel SVM model.

##### Steps:
Training: Each model is trained on the same training data.
Prediction: Each model predicts on the same test data.
Evaluation: For each model, a confusion matrix and classification report are generated to assess its performance.