# Ride Price Estimation System

This notebook implements an end-to-end machine learning workflow for estimating ride prices using a custom-designed dataset that I created myself in code. I go through the required steps: problem framing, dataset design and justification, data exploration, cleaning and feature engineering, regression and classification models, evaluation, and ethical reflection.

## 1. ML Mindset & Problem Framing

**Problem type:** Supervised learning, regression (predict continuous `ride_price`).

**Real-world scenario:** Estimating the price of a ride (similar to taxi or ride-hailing) based on trip and contextual factors such as distance, duration, time of day, traffic, and demand.

**Why ML instead of fixed rules?**
- Relationships between features and price are not perfectly linear or obvious.
- Factors like traffic, weather, and demand interact in complex ways.
- A fixed formula would be brittle and hard to tune, while an ML model can learn patterns from historical data.

**What the model should learn:**
- Longer distance and duration generally increase price.
- Peak hours, high demand, and heavy traffic tend to increase price (e.g., surge-like behavior).
- Some conditions (e.g., very bad weather) may add extra fees.

In this project we will also create a **classification** target (`high_cost` yes/no) to identify whether a ride is considered expensive compared to typical rides.

In [None]:
# 2. Imports and basic configuration

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

np.random.seed(42)
plt.style.use("seaborn-v0_8")

## 3. Dataset Design and Justification

I create a synthetic dataset that mimics a ride-hailing platform.

**Features (inputs):**
1. `distance_km` (numeric): Distance of the trip in kilometers. Longer trips usually cost more.
2. `duration_min` (numeric): Trip duration in minutes. Captures time-based pricing (e.g., traffic jams increase time).
3. `time_of_day` (categorical: `morning`, `afternoon`, `evening`, `night`): Prices may be higher during peak hours (morning/evening).
4. `traffic_level` (categorical: `low`, `medium`, `high`): High traffic increases duration and often price.
5. `weather` (categorical: `clear`, `rainy`, `stormy`): Bad weather may increase prices (risk, slower traffic) or extra fees.
6. `demand_level` (categorical: `low`, `normal`, `high`): Models surge pricing when demand is high.
7. `pickup_zone` (categorical: `city_center`, `suburbs`, `airport`): Some locations (e.g., airport) often have higher base fares.

**Target variable:**
- `ride_price` (continuous): Final ride price in a chosen currency.

**Feature justification summary:**
- Distance and duration: core components of almost any ride pricing algorithm.
- Time of day and traffic: capture peak hour effects and congestion.
- Weather and demand: capture dynamic pricing behavior.
- Pickup zone: captures fixed-area surcharges (e.g., airport fees, city-center congestion charges).

**Feature considered but excluded:**
- `driver_rating` (1–5 stars). I decided **not** to include this because rating is subjective, can be highly biased, and is not typically used directly for pricing in many systems (it is more related to quality control and matching than to fare calculation). Including it could also introduce ethical issues (e.g., unfairly charging more due to biased ratings).

In [None]:
# 4. Synthetic dataset generation (synthetic data created by me)

# NOTE: For the GitHub version of this project, the same logic also exists in
# `generate_rides.py`, which saves the dataset to `data/rides.csv`.
# Here we regenerate the data inside the notebook so it is fully reproducible.

n_samples = 600  # more than the minimum 150 rows for a richer dataset

# Numeric features
distance_km = np.round(np.random.uniform(1, 25, size=n_samples), 2)  # 1-25 km
base_speed_kmph = np.random.normal(30, 5, size=n_samples)  # average speed
base_speed_kmph = np.clip(base_speed_kmph, 10, 60)

# Duration roughly distance / speed * 60, with some noise
true_duration = distance_km / base_speed_kmph * 60
noise_duration = np.random.normal(0, 5, size=n_samples)
duration_min = np.clip(true_duration + noise_duration, 5, 90)

# Categorical features
time_of_day = np.random.choice(["morning", "afternoon", "evening", "night"], size=n_samples, p=[0.25, 0.3, 0.3, 0.15])
traffic_level = np.random.choice(["low", "medium", "high"], size=n_samples, p=[0.3, 0.4, 0.3])
weather = np.random.choice(["clear", "rainy", "stormy"], size=n_samples, p=[0.7, 0.25, 0.05])
demand_level = np.random.choice(["low", "normal", "high"], size=n_samples, p=[0.2, 0.5, 0.3])
pickup_zone = np.random.choice(["city_center", "suburbs", "airport"], size=n_samples, p=[0.4, 0.4, 0.2])

# Price construction (not visible to model, only to us):
# base fare + distance component + time component + surcharges
base_fare = 2.0
price_per_km = 0.8
price_per_min = 0.3

price = base_fare + distance_km * price_per_km + duration_min * price_per_min

# Add time-of-day effect
for i, tod in enumerate(time_of_day):
    if tod in ["morning", "evening"]:  # peak
        price[i] *= 1.15
    elif tod == "night":
        price[i] *= 1.05

# Add traffic effect
for i, tl in enumerate(traffic_level):
    if tl == "high":
        price[i] *= 1.2
    elif tl == "medium":
        price[i] *= 1.05

# Add weather effect
for i, w in enumerate(weather):
    if w == "rainy":
        price[i] *= 1.05
    elif w == "stormy":
        price[i] *= 1.15

# Add demand effect (surge)
for i, d in enumerate(demand_level):
    if d == "high":
        price[i] *= 1.3
    elif d == "low":
        price[i] *= 0.9

# Add pickup zone effect
for i, z in enumerate(pickup_zone):
    if z == "airport":
        price[i] += 5  # airport surcharge
    elif z == "city_center":
        price[i] += 1.5

# Add random noise
price = price + np.random.normal(0, 2.5, size=n_samples)
price = np.round(np.clip(price, 5, None), 2)

# Build DataFrame
data = pd.DataFrame({
    "distance_km": distance_km,
    "duration_min": duration_min,
    "time_of_day": time_of_day,
    "traffic_level": traffic_level,
    "weather": weather,
    "demand_level": demand_level,
    "pickup_zone": pickup_zone,
    "ride_price": price,
})

data.head()

## 5. Load / Save Dataset

To keep the project reproducible and **separate code from data**, I:

- Generate the synthetic data in this notebook.
- Save it to `../data/rides.csv` so that it can be reused and submitted to GitHub.
- In a real project, this step would be replaced by loading a real CSV exported from a database or data warehouse.

In [None]:
# Save dataset to CSV for the repository (relative to this notebook)
import os
from pathlib import Path

csv_path = Path("../data/rides.csv")
csv_path.parent.mkdir(parents=True, exist_ok=True)

data.to_csv(csv_path, index=False)
print("Saved CSV to:", csv_path.resolve())

# (Optional) reload from CSV to show that it works
loaded = pd.read_csv(csv_path)
loaded.head()

## 6. Data Exploration & Understanding

In this section, I inspect the dataset to understand its structure and quality.
I look at:

- Basic info (column types, non-null counts)
- Summary statistics for numerical variables
- Category distributions for categorical variables
- Simple visualizations to understand relationships and potential outliers.

In [None]:
# Basic info
print("Shape:", data.shape)
print("\nInfo:")
print(data.info())

In [None]:
# Summary statistics for numeric columns
numeric_cols = ["distance_km", "duration_min", "ride_price"]
data[numeric_cols].describe()

In [None]:
# Category distributions
cat_cols = ["time_of_day", "traffic_level", "weather", "demand_level", "pickup_zone"]
for col in cat_cols:
    print(f"\nValue counts for {col}:")
    print(data[col].value_counts())

In [None]:
# Visualizations: relationships and outliers

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

sns.scatterplot(ax=axes[0], x="distance_km", y="ride_price", data=data, alpha=0.6)
axes[0].set_title("Distance vs Ride Price")

sns.scatterplot(ax=axes[1], x="duration_min", y="ride_price", data=data, alpha=0.6)
axes[1].set_title("Duration vs Ride Price")

plt.tight_layout()
plt.show()

# Boxplot for price distribution and potential outliers
plt.figure(figsize=(4, 4))
sns.boxplot(y="ride_price", data=data)
plt.title("Ride Price Distribution")
plt.show()

From the plots and summaries we can see:

- `ride_price` increases with both `distance_km` and `duration_min`, as expected.
- The boxplot shows a few high-price rides, which come from long trips during peak time, heavy traffic, or bad weather.
- The categorical features are reasonably balanced (no category is extremely rare), which is good for model training.

Next, we formalize the preprocessing steps so that data cleaning and feature engineering are done **inside a reproducible pipeline**.

## 7. Data Cleaning & Feature Engineering

Here I prepare the data for modeling. I:

- Separate features (`X`) and target (`y`).
- Define which columns are numerical vs categorical.
- Build a preprocessing pipeline that:
  - Imputes missing values (mean for numeric, most frequent for categorical) – important for real data even if our synthetic data has no missing values.
  - Scales numerical features (so that distance and duration are on a comparable scale).
  - One-hot encodes categorical variables (so models can use them).

Poor data quality (missing values, wrong labels, extreme outliers) could:
- Make the model learn noise instead of real patterns.
- Lead to unstable or biased predictions.
- Reduce generalization to new, unseen rides.

In [None]:
from sklearn.impute import SimpleImputer

# Split features and target
X = data.drop(columns=["ride_price"])
y = data["ride_price"]

numeric_features = ["distance_km", "duration_min"]
categorical_features = ["time_of_day", "traffic_level", "weather", "demand_level", "pickup_zone"]

# Preprocessing for numeric data: impute (mean) + scale
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler()),
    ]
)

# Preprocessing for categorical data: impute (most frequent) + one-hot encode
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

preprocessor

## 8. Regression Model: Price Prediction (Linear Regression)

I now train a **Linear Regression** model to predict the continuous `ride_price`.

Steps:
- Split the data into training and test sets (80% / 20%).
- Use the preprocessing pipeline defined above.
- Fit a Linear Regression model.
- Evaluate using RMSE (root mean squared error) and R².
- Plot predicted vs actual prices to visually inspect performance.

In [None]:
# Train-test split for regression
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

len(X_train), len(X_test)

In [None]:
# Build regression pipeline: preprocessing + linear regression
regression_model = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("model", LinearRegression()),
    ]
)

# Train
regression_model.fit(X_train, y_train)

# Predict
y_pred = regression_model.predict(X_test)

# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R^2 : {r2:.3f}")

In [None]:
# Plot predicted vs actual prices
plt.figure(figsize=(6, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.6)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Ride Prices (Linear Regression)")
lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
plt.plot(lims, lims, "--", color="red")
plt.show()

The regression model achieves a relatively low RMSE and a high R² on this synthetic dataset, which is expected because the data was generated using a mostly linear pricing rule.

The scatter plot shows that most points lie close to the red diagonal line, meaning predicted prices are close to the true prices. A few deviations are caused by the random noise we added when generating the data.