### **📌 Objective – Level 3, Task 1: Predictive Modeling**  

In this task, we aim to **build a regression model** to predict a restaurant’s **aggregate rating** based on available features.  

### **🎯 Key Goals:**  
1️⃣ **Build a regression model** to predict restaurant ratings.  
2️⃣ **Split the dataset into training and testing sets** to evaluate model performance.  
3️⃣ **Experiment with different algorithms** (e.g., **Linear Regression, Decision Trees, Random Forest**) and compare their performance.  

This analysis will help:  
✔ Identify the **factors influencing restaurant ratings**.  
✔ Improve **prediction accuracy** by testing different models.  
✔ Understand which **features contribute most to customer ratings**.  


In [20]:
import pandas as pd
df = pd.read_csv('level_2_dataset.csv')

### ** Step 1: Prepare the Dataset for Predictive Modeling**  

Before building a model, we need to **prepare the dataset** by selecting relevant features and handling missing values.  

---

### **🔹 Tasks in Step 1:**  
✔ **Select relevant features** that can help predict restaurant ratings.  
✔ **Handle missing values**, if any, to avoid errors in model training.  
✔ **Ensure categorical features are encoded** so they can be used in regression models.  

---

### **📊 Selecting Features for Prediction**  
We will use the following columns as **predictor variables (features):**  
✔ **Price Range** – Higher prices may impact ratings.  
✔ **Votes** – More reviews might indicate better ratings.  
✔ **Has Table Booking** – May influence customer experience.  
✔ **Has Online Delivery** – Convenience might affect ratings.  
✔ **Cuisines** – Certain cuisines might be rated higher.  
✔ **Restaurant Name Length** – Longer names might indicate pr
mium restaurants.  

Our **target variable (label)**:  
✔ **Aggregate Rating** (the valuing the Dataset for Training and Testing!** 🚀

In [24]:
# Selecting relevant features
features = ['Price range', 'Votes', 'Has Table booking', 'Has Online delivery', 'Restaurant Name Length']
target = 'Aggregate rating'

# Drop missing values (if any)
df_model = df[features + [target]].dropna()

# Display dataset summary
df_model.info()
df_model.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Price range             9551 non-null   int64  
 1   Votes                   9551 non-null   int64  
 2   Has Table booking       9551 non-null   int64  
 3   Has Online delivery     9551 non-null   int64  
 4   Restaurant Name Length  9551 non-null   int64  
 5   Aggregate rating        9551 non-null   float64
dtypes: float64(1), int64(5)
memory usage: 447.8 KB


Unnamed: 0,Price range,Votes,Has Table booking,Has Online delivery,Restaurant Name Length,Aggregate rating
0,3,314,1,0,16,4.8
1,3,591,1,0,16,4.5
2,4,270,1,0,22,4.4
3,4,365,0,0,4,4.9
4,4,229,1,0,11,4.8


### Step 2: Split the Dataset into Training and Testing Sets
To train and evaluate the model, we need to:\

✔ Split the dataset into training (80%) and testing (20%) sets.

✔ Ensure that the training data is used for model learning, while the testing data is used for evaluation.

In [28]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = df_model.drop(columns=['Aggregate rating'])  # Features
y = df_model['Aggregate rating']  # Target

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print dataset sizes
print(f"Training Set Size: {X_train.shape[0]} samples")
print(f"Testing Set Size: {X_test.shape[0]} samples")


Training Set Size: 7640 samples
Testing Set Size: 1911 samples


### Step 3: Train the Regression Model
Now, we will train a Linear Regression model to predict restaurant ratings.

In [31]:
from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Print model coefficients
print("Model trained successfully!")
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")


Model trained successfully!
Intercept: 1.2845738556586235
Coefficients: [ 0.65772842  0.00067349 -0.23765294  0.64155048 -0.00354277]


Great work, Daud! ✅ Your **Linear Regression model** has been successfully trained. 🚀  

---

### **📊 Interpretation of Model Results**  
✔ **Intercept:** `1.28` → This is the **baseline rating** when all other features are `0`.  
✔ **Coefficients:**  
   - **Price Range (`0.6577`)** → Higher-priced restaurants tend to have **higher ratings**.  
   - **Votes (`0.00067`)** → More votes **slightly** increase ratings.  
   - **Has Table Booking (`-0.2376`)** → Restaurants with table booking tend to have **slightly lower ratings**.  
   - **Has Online Delivery (`0.6415`)** → Restaurants with online delivery **tend to have higher ratings**.  
   - **Restaurant Name Length (`-0.0035`)** → The length of the restaurant name **has almost no impact on rd R² values**?  
3️⃣ **Once done, we’ll compare models in Step 5!** 🚀

### Step 4: Evaluate the Model’s Performance
Now, let’s check how well the model performs by making predictions and calculating errors.

In [37]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print results
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")


Mean Absolute Error (MAE): 1.0740211125047767
Mean Squared Error (MSE): 1.6739343644806781
R-squared (R²): 0.2645631770677763


### Step 5: Experiment with Different Models
Since Linear Regression is not performing well, let’s try Decision Trees and Random Forest to see if they improve predictions.

In [42]:
from sklearn.tree import DecisionTreeRegressor

# Initialize the model
tree_model = DecisionTreeRegressor(random_state=42)

# Train the model
tree_model.fit(X_train, y_train)

# Make predictions
y_tree_pred = tree_model.predict(X_test)

# Evaluate the model
mae_tree = mean_absolute_error(y_test, y_tree_pred)
mse_tree = mean_squared_error(y_test, y_tree_pred)
r2_tree = r2_score(y_test, y_tree_pred)

# Print results
print(f"Decision Tree MAE: {mae_tree}")
print(f"Decision Tree MSE: {mse_tree}")
print(f"Decision Tree R²: {r2_tree}")


Decision Tree MAE: 0.3232952912544749
Decision Tree MSE: 0.2405138514404347
Decision Tree R²: 0.894331135958594


### Step 6: Test Random Forest Model
Now, let’s check if a Random Forest model performs even better than Decision Trees.

In [45]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_rf_pred = rf_model.predict(X_test)

# Evaluate the model
mae_rf = mean_absolute_error(y_test, y_rf_pred)
mse_rf = mean_squared_error(y_test, y_rf_pred)
r2_rf = r2_score(y_test, y_rf_pred)

# Print results
print(f"Random Forest MAE: {mae_rf}")
print(f"Random Forest MSE: {mse_rf}")
print(f"Random Forest R²: {r2_rf}")


Random Forest MAE: 0.24866600364045877
Random Forest MSE: 0.14314686050976536
Random Forest R²: 0.9371089604587426




### **📊 Level 3 – Task 1: Predictive Modeling**  

---

## **📌 Objective**  
The goal of this task is to build a **regression model** to predict the **aggregate rating** of a restaurant based on available features.  

### **🎯 Key Goals:**  
1️⃣ **Build a regression model** to predict restaurant ratings.  
2️⃣ **Split the dataset into training and testing sets** to evaluate model performance.  
3️⃣ **Experiment with different algorithms** (Linear Regression, Decision Trees, Random Forest) and compare their performance.  

This analysis helps in understanding:  
✔ **Which factors influence restaurant ratings the most**.  
✔ **Which machine learning model gives the best predictions**.  
✔ **How well we can predict customer preferences based on restaurant features**.  

---

## **1️⃣ Step 1: Prepare the Dataset for Predictive Modeling**  
### **🔹 Process:**  
✔ Selected relevant features:  
   - **Price Range** (pricing level of the restaurant).  
   - **Votes** (total customer votes).  
   - **Has Table Booking** (binary: 1 = Yes, 0 = No).  
   - **Has Online Delivery** (binary: 1 = Yes, 0 = No).  
   - **Restaurant Name Length** (length of the restaurant name).  
✔ Dropped missing values to ensure a clean dataset.  

### **📊 Summary of the Prepared Dataset:**  
- **Total Entries:** 9,551  
- **Columns Used:** 5 predictor variables + 1 target variable (Aggregate Rating).  

---

## **2️⃣ Step 2: Split the Dataset into Training and Testing Sets**  
✔ Split the dataset into:  
   - **Training Set (80%) → 7,640 samples**.  
   - **Testing Set (20%) → 1,911 samples**.  
✔ Ensured that the model was trained on one part of the data and tested on unseen data.  

---

## **3️⃣ Step 3: Train the Regression Models**  
### **🔹 Linear Regression Model**  
- **Intercept:** `1.28`  
- **Coefficients:**  
   - **Price Range (0.6577)** → Higher price → **Higher ratings**.  
   - **Votes (0.00067)** → More votes → **Slightly better ratings**.  
   - **Has Table Booking (-0.2376)** → Table booking **slightly lowers ratings**.  
   - **Has Online Delivery (0.6415)** → Online delivery **increases ratings**.  
   - **Restaurant Name Length (-0.0035)** → **No significant impact** on ratings.  

### **📊 Model Performance (Linear Regression)**  
- **MAE:** `1.07` (average prediction error).  
- **MSE:** `1.67` (error magnitude).  
- **R²:** `0.26` (**only 26% variance explained** → Poor performance).  

---

## **4️⃣ Step 4: Train a Decision Tree Model**  
✔ **Decision Tree Model** was trained to see if it performed better.  

### **📊 Model Performance (Decision Tree)**  
- **MAE:** `0.32` ✅ (better than Linear Regression).  
- **MSE:** `0.24` ✅ (lower error).  
- **R²:** `0.89` ✅ (explains 89% of the variation in ratings).  

🔹 **Key Insight:** Decision Tree performed **significantly better** than Linear Regression.  

---

## **5️⃣ Step 5: Train a Random Forest Model**  
✔ **Random Forest Model** was tested to compare its accuracy with Decision Tree.  

### **📊 Model Performance (Random Forest)**  
- **MAE:** `0.24` ✅ (**Best**).  
- **MSE:** `0.14` ✅ (**Lowest error**).  
- **R²:** `0.94` ✅ (**Explains 94% of rating variations**).  

🔹 **Key Insight:** **Random Forest is the best model** for predicting restaurant ratings.  

---

## **6️⃣ Final Model Comparison**  

| Model               | MAE (Lower is Better) | MSE (Lower is Better) | R² (Higher is Better) |
|--------------------|--------------------|--------------------|--------------------|
| **Linear Regression**  | 1.07  ❌ | 1.67  ❌ | 0.26  ❌ |
| **Decision Tree**      | 0.32  ✅ | 0.24  ✅ | 0.89  ✅ |
| **Random Forest**      | ✅ **0.24** (Best) | ✅ **0.14** (Best) | ✅ **0.94** (Best) |

🔹 **Final Takeaways:**  
✅ **Random Forest is the best model** with the lowest error and highest accuracy.  
✅ The **most influential factors** for predicting ratings are **Price Range, Votes, and Online Delivery**.  
✅ **Table booking negatively impacted ratings**, which may indicate customer dissatisfaction with booking experiences.  

---

## **🎯 Final Summary**  
✔ Successfully **built and tested predictive models** to estimate restaurant ratings.  
✔ Discovered that **Random Forest outperfort

---

✅ **Save this text for your documentation.**  
🚀 **Let me know if you’re ready to start Task 2!** 😊