## Input Preparation for API Inference

### Notebook Description

This notebook outlines the process of handling and transforming raw user input into the appropriate format required by the trained machine learning model during API inference. It includes notes and best practices to consider when implementing transformation logic in the backend, ensuring consistency with the preprocessing steps used during model training. The goal is to bridge the gap between real-time input and model-ready data.

### Expected User Input & Handling Notes

* Accept the following fields from the user:

  * `store_and_fwd_flag`: "N" / "Y"
  * `vendor_id`: should be either **1** or **2**
  * `passenger_count`: must be an integer between **1** and **6**
  * `pickup_longitude` / `pickup_latitude`: coordinates must be **within NYC bounds**
  * `dropoff_longitude` / `dropoff_latitude`: also must be **within NYC bounds**
  * `date`: in **YYYY-MM-DD** format
  * `time`: in **HH:MM** format (24-hour)

* Validate the range and format of each input field in the API code:

  * Enforce **valid ranges** for vendor ID and passenger count
  * Validate `store_and_fwd_flag` flag input
  * **Check geographic coordinates** fall within NYC city limits
  * **Parse and validate** the date and time, then break them down into derived features such as:

    * `hour`, `minute`, `dayofweek`, `month`, `weekday`, etc.

* No need to manually encode variables—**the model pipeline handles all encoding and scaling internally**.

### Input Processing and Prediction Plan

The following steps outline the full transformation and prediction flow, from raw user input to final trip duration output:

1. **Convert date and time** fields into a proper `datetime` object.

2. **Feature Engineering** – extract and compute all required features:

   * Calculate `trip_distance` and its variations:

     * `trip_distance`, `trip_distance_sqrt`, `trip_distance_square`, `trip_distance_cube`
     * `log_trip_distance`, `log_trip_distance_sqrt`, `log_trip_distance_square`, `log_trip_distance_cube`
   * Airport flags:

     * `is_jfk_airport`, `is_lg_airport`
   * Coordinate-based features:

     * `coord_square_sum`, `coord_arithmetic_mean`, `coord_harmonic_mean`
   * Time-based features:

     * `hour`, `weekday`, `month`, `season`, `virtual_time`, `virtual_time_dist_sqrt`

3. **Drop all unused or irrelevant features** — only keep the features required by the model.

4. **Ensure feature consistency** — verify all expected columns are present in the DataFrame **in the correct format and order**.

5. **Load the pickled model pipeline** and pass the final processed input to it for prediction.

6. **Convert prediction output**:

   * Since the model predicts `log_trip_duration`, apply the inverse transformation using `expm1()` to get the original scale.
   * Round the result to the nearest integer to represent the trip duration in **seconds**.


In [67]:
import pandas as pd
import numpy as np

In [68]:
import warnings

# ignore FutureWarnings and UserWarnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [69]:
# example to work through

trip_dict = {  
    "store_and_fwd_flag": "N",
    "vendor_id": 1,   
    "passenger_count": 1,
    "pickup_longitude": -73.988609,
    "pickup_latitude": 40.748977,
    "dropoff_longitude": -73.992797,
    "dropoff_latitude": 40.763408,
    "date": "2016-03-23",
    "time": "02:24",
}

# trip_duration: 437 second


In [70]:
trip = pd.DataFrame([trip_dict])
trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24


In [71]:
trip["pickup_datetime"] = trip["date"] + " " + trip["time"] + ":00"
trip["pickup_datetime"] = pd.to_datetime(trip["pickup_datetime"])

trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00


In [72]:
# Using distance formula:
# https://www.chegg.com/homework-help/questions-and-answers/point-latitude-373198-point-longitude-121936-point-b-latitude-373185-point-b-longitude-121-q56508606

R = 6356  # radius of Earth in km

# Convert degrees to radians
lat1 = np.radians(trip["pickup_latitude"])
lat2 = np.radians(trip["dropoff_latitude"])
lon1 = np.radians(trip["pickup_longitude"])
lon2 = np.radians(trip["dropoff_longitude"])

# x and y components of distance
x = R * (lat1 - lat2)
y = R * (lon1 - lon2) * np.cos(lat2)

# Euclidean distance approximation
trip["trip_distance"] = np.sqrt(x**2 + y**2)
trip["trip_distance_sqrt"] = np.sqrt(np.sqrt(x**2 + y**2))
trip["trip_distance_square"] = x**2 + y**2
trip["trip_distance_cube"] = (np.sqrt(x**2 + y**2))**3

trip["log_trip_distance"] = np.log1p(np.sqrt(x**2 + y**2))
trip["log_trip_distance_sqrt"] = np.log1p(np.sqrt(np.sqrt(x**2 + y**2)))
trip["log_trip_distance_square"] = np.log1p(x**2 + y**2)
trip["log_trip_distance_cube"] = np.log1p((np.sqrt(x**2 + y**2))**3)

trip


Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime,trip_distance,trip_distance_sqrt,trip_distance_square,trip_distance_cube,log_trip_distance,log_trip_distance_sqrt,log_trip_distance_square,log_trip_distance_cube
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00,1.639093,1.280271,2.686627,4.403631,0.970435,0.824294,1.304712,1.687071


In [73]:
# Coordinates are taken from Google Maps
JFK_LATITUDE_RANGE = [40.620998, 40.683139]
JFK_LONGITUDE_RANGE = [-73.841476, -73.729188]

LG_LATITUDE_RANGE = [40.763557, 40.787499]
LG_LONGITUDE_RANGE = [-73.899899, -73.848085]

# JFK bounding box
trip["is_jfk_airport"] = (
    ((trip["pickup_latitude"].between(JFK_LATITUDE_RANGE[0], JFK_LATITUDE_RANGE[1])) &
    (trip["pickup_longitude"].between(JFK_LONGITUDE_RANGE[0], JFK_LONGITUDE_RANGE[1])))
    |
    ((trip["dropoff_latitude"].between(JFK_LATITUDE_RANGE[0], JFK_LATITUDE_RANGE[1])) &
    (trip["dropoff_longitude"].between(JFK_LONGITUDE_RANGE[0], JFK_LONGITUDE_RANGE[1])))
).astype("int")

# LaGuardia bounding box
trip["is_lg_airport"] = (
    ((trip["pickup_latitude"].between(LG_LATITUDE_RANGE[0], LG_LATITUDE_RANGE[1])) &
    (trip["pickup_longitude"].between(LG_LONGITUDE_RANGE[0], LG_LONGITUDE_RANGE[1])))
    |
    ((trip["dropoff_latitude"].between(LG_LATITUDE_RANGE[0], LG_LATITUDE_RANGE[1])) &
    (trip["dropoff_longitude"].between(LG_LONGITUDE_RANGE[0], LG_LONGITUDE_RANGE[1])))
).astype("int")

trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime,trip_distance,trip_distance_sqrt,trip_distance_square,trip_distance_cube,log_trip_distance,log_trip_distance_sqrt,log_trip_distance_square,log_trip_distance_cube,is_jfk_airport,is_lg_airport
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00,1.639093,1.280271,2.686627,4.403631,0.970435,0.824294,1.304712,1.687071,0,0


In [74]:
from scipy.stats import hmean

geo_columns = geo_columns = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

# Ensure geo_features is a 2D NumPy array (shape: [n_samples, n_features])
geo_array = trip[geo_columns].to_numpy()

# axis = 1 ensures row wise operations
trip['coord_arithmetic_mean'] = np.mean(geo_array, axis=1)
trip['coord_harmonic_mean'] = hmean(np.abs(geo_array), axis=1)
trip['coord_square_sum'] = np.sum(geo_array ** 2, axis=1)

trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime,...,trip_distance_cube,log_trip_distance,log_trip_distance_sqrt,log_trip_distance_square,log_trip_distance_cube,is_jfk_airport,is_lg_airport,coord_arithmetic_mean,coord_harmonic_mean,coord_square_sum
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00,...,4.403631,0.970435,0.824294,1.304712,1.687071,0,0,-16.617255,52.560538,14271.382828


In [75]:
trip['month'] = trip.pickup_datetime.dt.month
trip['weekday'] = trip.pickup_datetime.dt.weekday
trip['hour'] = trip.pickup_datetime.dt.hour
trip['minute'] = trip.pickup_datetime.dt.minute

trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime,...,log_trip_distance_cube,is_jfk_airport,is_lg_airport,coord_arithmetic_mean,coord_harmonic_mean,coord_square_sum,month,weekday,hour,minute
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00,...,1.687071,0,0,-16.617255,52.560538,14271.382828,3,2,2,24


In [76]:
def get_season(month):
    if month in [12, 1, 2]:
        return 0  # Winter
    elif month in [3, 4, 5]:
        return 1  # Spring
    elif month in [6, 7, 8]:
        return 2  # Summer
    else:
        return 3  # Fall (September, October, November)

trip['season'] = trip['month'].apply(get_season)

In [77]:
trip["is_summer"] = (trip["season"] == 2).astype("int")
trip["is_rush_hour"] = ((trip["hour"].between(7, 9)) | (trip["hour"].between(13, 19))).astype("int")
trip["is_night"] = ((trip["hour"] > 1) & (trip["hour"] < 6)).astype("int")
trip["is_weekend"] =  ((trip["weekday"] // 5) == 1).astype("int")

trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime,...,coord_square_sum,month,weekday,hour,minute,season,is_summer,is_rush_hour,is_night,is_weekend
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00,...,14271.382828,3,2,2,24,1,0,0,1,0


In [78]:
BASE_SPEED = 32

trip['virtual_speed'] = BASE_SPEED / (2 ** (
                        (trip['is_jfk_airport'] | trip["is_lg_airport"]).astype("int") + # cast bool to int
                        (trip['is_rush_hour']).astype("int") +
                        (trip['is_summer']).astype("int") + 
                        (trip['store_and_fwd_flag'] == 'Y').astype("int")
                        ))

trip['virtual_time'] = trip['log_trip_distance'] / trip['virtual_speed']

# Adding the cubes
trip["virtual_time_dist_sqrt"] = trip['trip_distance_sqrt'] / trip['virtual_speed']

trip

Unnamed: 0,store_and_fwd_flag,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,date,time,pickup_datetime,...,hour,minute,season,is_summer,is_rush_hour,is_night,is_weekend,virtual_speed,virtual_time,virtual_time_dist_sqrt
0,N,1,1,-73.988609,40.748977,-73.992797,40.763408,2016-03-23,02:24,2016-03-23 02:24:00,...,2,24,1,0,0,1,0,32.0,0.030326,0.040008


In [79]:
def drop_col(df, col):
    if col in df.columns:
        df.drop(col, axis=1, inplace=True)
    else:
        print(f"[Warning] Column not found: {col}")
    return df

cols_to_drop = [ 
                    'time', 'date', 'pickup_datetime', 'store_and_fwd_flag',
                    'dropoff_latitude', 'dropoff_longitude', 'pickup_latitude', 'pickup_longitude', 
                    'is_weekend', 'is_rush_hour', 'is_summer', 'is_night', 'virtual_speed', 
                ] 

for col in cols_to_drop:
    trip = drop_col(trip, col)

trip

Unnamed: 0,vendor_id,passenger_count,trip_distance,trip_distance_sqrt,trip_distance_square,trip_distance_cube,log_trip_distance,log_trip_distance_sqrt,log_trip_distance_square,log_trip_distance_cube,...,coord_arithmetic_mean,coord_harmonic_mean,coord_square_sum,month,weekday,hour,minute,season,virtual_time,virtual_time_dist_sqrt
0,1,1,1.639093,1.280271,2.686627,4.403631,0.970435,0.824294,1.304712,1.687071,...,-16.617255,52.560538,14271.382828,3,2,2,24,1,0.030326,0.040008


In [80]:
trip.shape  # Meets the expected shape

(1, 22)

In [81]:
import joblib

MODEL_PATH = '../models/final_ridge_pipeline.pkl'

def load_model(path):
    saved = joblib.load(path)
    model = saved["model"]
    train_iqr = saved["train_iqr"]
    return model, train_iqr

model_pipeline, _ = load_model(MODEL_PATH)

model_pipeline

In [82]:
log_trip_duration = model_pipeline.predict(trip)[0]
trip_duration = np.expm1(log_trip_duration).round()

trip_duration  # actual duration: 437

421.0

Since 421 and 437 are very close, and our model achieved an R² score of 0.69, this marks the successful completion of the preparation phase. Everything has worked as expected, so we’re now ready to move on to building the API and applying the work we’ve done here.