# Mysore Taxi Trip Duration Prediction

## 1. Introduction & Objective
In this notebook, we aim to predict the total trip duration of taxi rides in Mysore. Accurate estimation of trip duration is crucial for ride-hailing services to optimize fleet management and improve user experience.

**Methodology:**
1. **Data Cleaning:** Remove geographic and temporal outliers (e.g., trips > 100km, impossible speeds).
2. **Feature Engineering:** Create features based on distance (Haversine/Manhattan), time (Cyclic encoding), and spatial clustering.
3. **Modeling:** Establish a baseline using Linear Regression and improve upon it using Gradient Boosting (LightGBM).
4. **Error Analysis:** Evaluate where the model performs well and where it fails.

---

## 2. Imports and Setup
We start by loading necessary libraries and configuring the environment.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error, mean_squared_error
import lightgbm as lgb
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.float_format', '{:,.4f}'.format)
plt.style.use('seaborn-v0_8-whitegrid')

---

## 3. Data Loading & Helper Functions
We define the **Haversine** distance function early, as it is critical for both cleaning and feature engineering.

In [None]:
# Load data (Update path as necessary)
try:
    from google.colab import drive
    drive.mount('/drive')
    train = pd.read_csv('/drive/MyDrive/1/train/train.csv')
except:
    train = pd.read_csv('train.csv') # Fallback for local environment

# Convert Timestamps
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'])

# Haversine Distance Function
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

print(f"Dataset Shape: {train.shape}")

## 4. Exploratory Data Analysis & Cleaning

### 4.1 Feature Creation for Cleaning
We calculate distance and speed to identify impossible trips.

In [None]:
# Calculate Distance (Km)
train["distance_km"] = haversine(
    train["pickup_latitude"], train["pickup_longitude"],
    train["dropoff_latitude"], train["dropoff_longitude"]
)

# Calculate Speed (Km/h)
train["duration_hr"] = train["trip_duration"] / 3600
train["avg_speed_kmph"] = train["distance_km"] / train["duration_hr"]

# Filter Passenger Count (Remove 0 or > 6 passengers)
train = train[(train["passenger_count"] > 0) & (train["passenger_count"] <= 6)]

### 4.2 Removing Outliers
We apply logic to remove data that is likely noise or GPS error:
1.  **Distance:** Trips > 100km are removed (likely outside city scope/errors).
2.  **Speed:** Trips < 1 km/h (stationary) or > 80 km/h (illegal/impossible in Mysore) are removed.
3.  **Noise:** Long distance trips with incredibly low speed.

In [None]:
initial_len = len(train)

train = train[
    (train["distance_km"] < 100) &
    (train["avg_speed_kmph"] >= 1) &
    (train["avg_speed_kmph"] <= 80) &
    ~((train["distance_km"] > 2) & (train["avg_speed_kmph"] < 2))
]

print(f"Rows removed: {initial_len - len(train)}")

# Visual check of cleaning
plt.figure(figsize=(10, 6))
plt.scatter(train["distance_km"], train["duration_hr"], s=1, alpha=0.3, c='teal')
plt.xlabel("Distance (km)")
plt.ylabel("Trip Duration (hr)")
plt.title("Distance vs Trip Duration (Cleaned Data)")
plt.show()


---

## 5. Feature Engineering

### 5.1 Temporal & Cyclic Features
Time is cyclical. 23:00 is close to 00:00, but numerically 23 is far from 0. We use Sin/Cos transformations to fix this.


In [None]:
# Extract Components
train["pickup_hour"] = train["pickup_datetime"].dt.hour
train["pickup_weekday"] = train["pickup_datetime"].dt.weekday
train["pickup_month"] = train["pickup_datetime"].dt.month

# Cyclic Encoding
train["hour_sin"] = np.sin(2 * np.pi * train["pickup_hour"] / 24)
train["hour_cos"] = np.cos(2 * np.pi * train["pickup_hour"] / 24)
train["weekday_sin"] = np.sin(2 * np.pi * train["pickup_weekday"] / 7)
train["weekday_cos"] = np.cos(2 * np.pi * train["pickup_weekday"] / 7)


### 5.2 Spatial Features (Geometry & Bearing)
Since Mysore is a grid, "Manhattan Distance" is often more useful than direct flight distance. "Bearing" tells us the direction of travel.

In [None]:
# Manhattan Distance
def manhattan_distance(lat1, lon1, lat2, lon2):
    lat_dist = haversine(lat1, lon1, lat2, lon1)
    lon_dist = haversine(lat1, lon1, lat1, lon2)
    return lat_dist + lon_dist

train["manhattan_km"] = manhattan_distance(
    train["pickup_latitude"], train["pickup_longitude"],
    train["dropoff_latitude"], train["dropoff_longitude"]
)

# Bearing (Direction)
def calculate_bearing(lat1, lon1, lat2, lon2):
    lat1, lat2 = np.radians(lat1), np.radians(lat2)
    diff_lon = np.radians(lon2 - lon1)
    x = np.sin(diff_lon) * np.cos(lat2)
    y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(diff_lon)
    bearing = np.degrees(np.arctan2(x, y))
    return (bearing + 360) % 360

train["bearing"] = calculate_bearing(
    train["pickup_latitude"], train["pickup_longitude"],
    train["dropoff_latitude"], train["dropoff_longitude"]
)
train["bearing_sin"] = np.sin(np.radians(train["bearing"]))
train["bearing_cos"] = np.cos(np.radians(train["bearing"]))


### 5.3 Zone-Based Target Encoding
We divide the map into small grids (zones) and calculate the average trip duration for pickups and dropoffs in those zones. This acts as a proxy for "traffic density" in specific neighborhoods.

In [None]:
# Define Grid Size (~1km)
GRID_SIZE = 0.01

# Create Zone IDs
train["pickup_zone"] = (
    (train["pickup_longitude"] // GRID_SIZE).astype(int).astype(str) + "_" +
    (train["pickup_latitude"] // GRID_SIZE).astype(int).astype(str)
)
train["dropoff_zone"] = (
    (train["dropoff_longitude"] // GRID_SIZE).astype(int).astype(str) + "_" +
    (train["dropoff_latitude"] // GRID_SIZE).astype(int).astype(str)
)

# Calculate Mean Duration per Zone
pickup_mean = train.groupby("pickup_zone")["trip_duration"].mean()
dropoff_mean = train.groupby("dropoff_zone")["trip_duration"].mean()

# Map features to dataset
train["pickup_zone_mean_duration"] = train["pickup_zone"].map(pickup_mean)
train["dropoff_zone_mean_duration"] = train["dropoff_zone"].map(dropoff_mean)


### 5.4 Additional Interaction Features
We create flags for rush hours and long trips, and interaction features between distance and time.

In [None]:
train["distance_hour_interaction"] = train["distance_km"] * train["pickup_hour"]
train["is_rush_hour"] = train["pickup_hour"].isin([10,11,12,13,14,15,16,17,18,19,20]).astype(int)
train["is_long_trip"] = (train["distance_km"] > 10).astype(int)

---

## 6. Model Building & Evaluation

We define our feature set and target variable. Since trip duration is right-skewed, we predict `log(trip_duration + 1)` to normalize the distribution and use RMSLE as our metric.


In [None]:
# Feature Selection
features = [
    "distance_km", "manhattan_km",
    "pickup_zone_mean_duration", "dropoff_zone_mean_duration",
    "hour_sin", "hour_cos", "weekday_sin", "weekday_cos",
    "bearing_sin", "bearing_cos", "passenger_count",
    "distance_hour_interaction", "is_rush_hour", "is_long_trip"
]

X = train[features]
y = train["trip_duration"]

# Train/Validation Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Log Transformation of Target
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)


### 6.1 Baseline: Linear Regression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train_log)

pred_log_lr = lr.predict(X_val)
pred_lr = np.expm1(pred_log_lr) # Inverse log

rmsle_lr = np.sqrt(mean_squared_log_error(y_val, np.maximum(pred_lr, 0)))
print(f"Baseline Linear Regression RMSLE: {rmsle_lr:.4f}")

### 6.2 Advanced Model: LightGBM
Gradient boosting usually handles non-linear relationships (traffic patterns, geographic boundaries) much better than linear models.

In [None]:
lgb_model = lgb.LGBMRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

lgb_model.fit(
    X_train, y_train_log,
    eval_set=[(X_val, y_val_log)],
    eval_metric="rmse",
    callbacks=[lgb.early_stopping(10)]
)

pred_log_lgb = lgb_model.predict(X_val)
pred_lgb = np.expm1(pred_log_lgb)

rmsle_lgb = np.sqrt(mean_squared_log_error(y_val, np.maximum(pred_lgb, 0)))
print(f"\nLightGBM RMSLE: {rmsle_lgb:.4f}")

---
### 7. Error Analysis & Conclusion

We analyze feature importance to see what drives the model and visualize where errors occur.

In [None]:
# Feature Importance
lgb.plot_importance(lgb_model, max_num_features=15, height=0.5)
plt.title("LightGBM Feature Importance")
plt.show()

# Error Distribution
abs_error = np.abs(pred_lgb - y_val)
plt.figure(figsize=(10, 6))
plt.scatter(y_val, abs_error, s=2, alpha=0.3, c='crimson')
plt.xlabel("Actual Trip Duration (sec)")
plt.ylabel("Absolute Error (sec)")
plt.title("Error Analysis: Actual vs Absolute Error")
plt.show()

---
###Summary Findings
- Cleaning Matters: Filtering trips with unrealistic speeds (<2km/h or >80km/h) significantly cleaned the dataset.
- Performance:
  - Baseline (Linear Regression): RMSLE ~0.53
  - LightGBM: RMSLE ~0.34
- Improvement: Significant reduction in error using tree-based boosting.
- Key Drivers: Trip duration is primarily driven by raw distance (distance_km), but local traffic patterns (pickup_zone_mean_duration) and time of day (hour_sin) play substantial roles.