# Task 4: Determining the Appropriate Machine Learning Model for Each Dataset

## Introduction  
This notebook presents a detailed analysis to determine the suitable machine learning model type (Regression or Classification) for each dataset in the GoRiyadh project. Based on the exploratory data analysis performed in Task 2 (files: `Task_2_Hotels-2.ipynb`, `Task_2_Restaurants.ipynb`, and `GoRiyadh_Cafe_dataset_analysis.ipynb`), we will identify the target variable, propose relevant algorithms, and justify our choices.  

The ultimate goal of the project is to assist tourists in estimating costs and making informed decisions about hotels, restaurants, and cafes in Riyadh.

## 1. Hotels Dataset

### Data Overview (from Task 2)
- **Columns:** `hotel_id`, `hotel_name`, `price`, `base_price`, `checkIn`, `checkOut`, `count` (number of reviews), `rating`, `Info` (services), `latitude`, `longitude`.
- **Target Variable:** `price` (nightly rate) â€“ a continuous numerical value.
- **Key Characteristics:**
  - Missing values in `price` (~1.3%) and in geographic coordinates (~9%).
  - Outliers present (luxury hotels with extremely high prices).
  - Rating distribution: 49.1% of hotels have a rating of 4.5, with ratings ranging from 2.5 to 5.
  - Review counts vary from 1 to 1626.

### Model Type: **Regression**  
Because the goal is to predict a continuous numerical value (price) based on features like location, rating, and services.

### Proposed Algorithms
- **Linear Regression** â€“ A simple, interpretable baseline model.
- **Random Forest Regressor** â€“ Handles non-linear relationships, reduces overfitting, and provides feature importance.
- **XGBoost Regressor** â€“ High predictive accuracy, robust to missing values, and effective with mixed data types.

### Justification
- The presence of outliers and variability in prices favors ensemble methods (Random Forest, XGBoost) that are less sensitive to extreme values.
- Geographic features (latitude/longitude) can be leveraged to capture location-based price trends.
- Linear regression serves as a quick benchmark to evaluate more complex models.

## 2. Restaurants Dataset

### Data Overview (from Task 2)
- **Columns:** `name`, `categories` (cuisine type), `address`, `lat`, `lng`, `price` (price category), `likes`, `photos`, `tips`, `rating`, `ratingsSignals`.
- **Target Variable:** `price` â€“ a categorical variable with five classes:  
  - `Cheap` (55.8%)  
  - `Moderate` (31.2%)  
  - `Unknown` (10.4%)  
  - `Expensive` (2.1%)  
  - `Very Expensive` (0.5%)  
- **Key Characteristics:**
  - Significant class imbalance (Cheap and Moderate dominate).
  - Missing values in `rating` and `ratingsSignals` (~59% missing).
  - Rich set of predictors: cuisine type, location, popularity metrics (likes, photos, tips), and review information.

### Model Type: **Multiclass Classification**  
The objective is to predict the price category of a restaurant using its features.

### Proposed Algorithms
- **Logistic Regression (Multinomial)** â€“ Simple and efficient for multi-class problems, provides probabilistic outputs.
- **Random Forest Classifier** â€“ Handles non-linear patterns, resistant to overfitting, and captures feature interactions.
- **XGBoost Classifier** â€“ State-of-the-art gradient boosting, supports missing values natively, and can handle class imbalance via weighted training.
- **Support Vector Machine (SVM) with RBF Kernel** â€“ Effective in high-dimensional spaces and when classes are not linearly separable.

### Justification
- The target is categorical, so classification is the natural choice.
- The severe class imbalance requires algorithms that can incorporate class weights or resampling techniques (e.g., SMOTE).
- XGBoost and Random Forest can automatically handle missing data in predictors.
- The variety of feature types (numerical, categorical, spatial) makes ensemble methods particularly suitable.

### Optional Consideration
If the price categories are later converted to ordinal numeric values (e.g., 1=Cheap, 2=Moderate, â€¦), **Ordinal Regression** could be explored. However, given the current labels, classification remains the most appropriate approach.

## 3. Cafes Dataset

### Data Overview (from Task 2)
- **Columns:** `coffeeName`, `rating`, `rating_count`, `url`, `24_hours` (boolean), `lon`, `lat`.
- **Available Target Variables:**  
  - `rating` (continuous, 1.0â€“5.0) â€“ currently the most complete numerical feature.  
  - **No price column** is present in the provided data.
- **Key Characteristics:**
  - 62% of cafes have a rating above 4.0.
  - `rating_count` ranges up to 10,000+, with a right-skewed distribution.
  - 27% of cafes operate 24 hours (710 out of 2605).
  - Geographic clustering observed in specific commercial areas.

### Current Objective (Price Unavailable)
Given the absence of price information, the immediate feasible target is **rating prediction** â€“ estimating a cafe's rating based on location, number of reviews, and 24â€‘hour operation. This still provides value to tourists seeking high-quality cafes.

### **Model Type:** **Regression** (since rating is a continuous value).

### Proposed Algorithms (for Rating Prediction)
- **Linear Regression** â€“ Baseline model.
- **Random Forest Regressor** â€“ Captures non-linear effects and interactions.
- **XGBoost Regressor** â€“ Powerful, handles missing data, and often outperforms other models.

### Future Objective (When Price Data Becomes Available)
Once price information is collected, the primary goal shifts to **price prediction** (also Regression), using the same algorithms. The cafe data would then align perfectly with the project's costâ€‘estimation theme.

### Justification for Current Choice
- Rating is a continuous variable, making regression appropriate.
- The combination of numerical (`rating_count`), binary (`24_hours`), and spatial (`lat`, `lon`) features can be effectively modeled by treeâ€‘based ensembles.
- Linear regression offers a simple benchmark.

### ðŸ“Š Task 4: Model Selection Summary

| Dataset | Model Type | Proposed Algorithms | Target Variable |
| :--- | :--- | :--- | :--- |
| Hotels | Regression | Linear Regression, Random Forest Regressor, XGBoost Regressor | price (nightly rate) |
| Restaurants | Multiclass Classification | Logistic Regression (multinomial), Random Forest Classifier, XGBoost Classifier, SVM | price (category) |
| Cafes (Current) | Regression | Linear Regression, Random Forest Regressor, XGBoost Regressor | rating |
| Cafes (Future) | Regression | Linear Regression, Random Forest Regressor, XGBoost Regressor | price (when available) |