### 🚕 Ride Duration Prediction – Batch Mode

In this notebook, I apply a pre-trained ride duration model to NYC Yellow Taxi data (March 2023).  
The workflow follows a batch inference process and replicates key elements from a previous homework,  
but with styling and context adjustments to suit personal use. All important outputs are retained.

#### 🔧 Environment Check

In [1]:
!pip freeze | findstr scikit-learn

scikit-learn==1.7.0


In [2]:
!python -V

Python 3.10.9


#### 📦 Load Model and Tools

In [3]:
import pickle
import pandas as pd

#### 🧠 Load Trained Model (DictVectorizer + LinearRegression)

In [4]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

#### 📄 Read & Prepare Data

In [6]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60
    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()
    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    return df

In [7]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')

#### 🧮 Run Predictions

In [8]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [9]:
round(y_pred.std(), 3)

np.float64(6.247)

#### 🆔 Create ride_id and Save Results

In [21]:
df['ride_id'] = (
    df['tpep_pickup_datetime'].dt.year.astype(str).str.zfill(4) + '/' +
    df['tpep_pickup_datetime'].dt.month.astype(str).str.zfill(2) + '_' +
    df.index.astype(str)
)
df_result = df[['ride_id']].copy()
df_result['duration'] = y_pred
output_file = "output/result.parquet"
df_result.to_parquet(output_file, engine='pyarrow', compression=None, index=False)

In [22]:
import os
size_bytes = os.path.getsize(output_file)
size_mb = size_bytes / (1024 * 1024)
print(f"File size: {size_bytes:,} bytes ({size_mb:.2f} MB)")

File size: 68,640,820 bytes (65.46 MB)


#### 🧪 Dependency Hash Check

In [49]:
from json import load
with open("Pipfile.lock", "rb") as file_in:
    pipfile = load(file_in)
    print(pipfile["default"]["scikit-learn"]["hashes"][0])

sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c


#### 🐳 Run Dockerized Prediction Script

In [56]:
!docker build -t ride-duration-prediction:v1 . > /dev/null 2>&1
!docker run --rm ride-duration-prediction:v1

The mean predicted duration is 0.189 minutes
