# **Uber Ride Analytics & ML**

#### **This project turns raw trip logs into answers: Where do we lose rides? Why? Can we predict the outcome before it happens?**

# **What’s inside**

##### Exploratory Analytics: time features, demand patterns, hotspots, and failure points.
##### Clean Visuals (Matplotlib): heatmaps, distributions, top locations/routes, revenue and ratings views.

##### ML Model (scikit-learn): predicts whether a ride will complete using a robust preprocessing pipeline.

##### What you can change: tune the decision threshold to trade precision vs recall, and inspect feature influence.

# **Data at a glance**

## A single CSV (~150k rows) of ride bookings with fields like:

##### When: Date, Booking Time → combined to event_time, plus weekday, hour

##### Outcome: Booking Status (Completed / Cancelled / Incomplete / No Driver)

##### Where/What: Pickup Location, Drop Location, Vehicle Type

##### Value/Distance: Booking Value, Ride Distance, Payment Method

##### Why it failed: Reason for cancelling by Customer, Driver Cancellation Reason

# **Questions I Answer:**

##### When and where is demand highest? (hour × weekday heatmap; daily trends)
##### What fails most often and why? (status mix; top cancellation reasons)
##### Who/what drives revenue? (payment methods, vehicle types, top routes)
##### Can we predict completion before the ride starts? (logistic regression)

# **The ML model (brief)**

### **Goal: estimate the probability that a ride will complete.**

### Pipeline:
##### SimpleImputer → StandardScaler (numerics) + OneHotEncoder (categoricals) → LogisticRegression(class_weight='balanced').

### Features: 
##### hour, weekday, distance, ratings, booking value, vehicle type, trimmed pickup/drop (top-k), payment method.

### Why logistic regression?
##### Fast, stable, explainable. Swap in trees/boosters for nonlinear patterns.



# **How to run (Kaggle)**

##### Open the notebook(s) and run the EDA/visualization cells to explore the data.
##### Run the one-cell classifier to train/evaluate (confusion matrix + ROC).
##### (Optional) add the threshold-tuning snippet to maximize F1 or match real costs.

# **Key findings**

##### Logistic regression pipeline (imputation + scaling + one-hot) achieved roughly:
##### Accuracy ~0.79, Precision ~1.00, Recall ~0.68, F1 ~0.81, ROC AUC ~0.82.
##### The model is conservative (no false positives → perfect precision) but misses some completes (lower recall). You can lower the threshold to catch more successful rides.

##### Feature influence (log-odds for “Completed”) showed strong signals from payment method, booking value, and certain pickup/drop zones.

# **Next steps**

##### Try HistGradientBoosting / XGBoost / LightGBM and probability calibration.
##### Enrich with geo features (clustering zones), temporal signals (rolling demand), and weather.
##### Optimize thresholds against business costs of false positives/negatives.