### 🚕 Ride Duration Prediction – Batch Mode

In this notebook, I apply a pre-trained ride duration model to perform batch predictions on NYC Yellow Taxi data. The dataset used is from **March 2023**.  
The prediction results are stored locally and can be optionally uploaded to cloud storage for further processing.


#### 🔧 Environment Check

In [None]:
!pip freeze | findstr scikit-learn
!python -V

#### 📦 Load Model and Preprocessing Tools

In [None]:
import pickle
import pandas as pd

# Load model and DictVectorizer
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

#### 🧼 Read & Prepare Input Data

In [None]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)

    # Calculate trip duration in minutes
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    # Filter trips with duration between 1 and 60 minutes
    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    # Prepare categorical features
    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

    return df

In [None]:
# Load March 2023 trip data
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')

#### 🧮 Generate Predictions

In [None]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [None]:
# Check standard deviation of predicted durations
round(y_pred.std(), 3)

#### 🆔 Add Ride ID

In [None]:
df['ride_id'] = (
    df['tpep_pickup_datetime'].dt.year.astype(str).str.zfill(4) + '/' +
    df['tpep_pickup_datetime'].dt.month.astype(str).str.zfill(2) + '_' +
    df.index.astype(str)
)
df['ride_id'].head()

#### 💾 Save Predictions to Parquet

In [None]:
df_result = df[["ride_id"]].copy()
df_result['duration'] = y_pred

output_file = "output/result.parquet"
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [None]:
import os

size_bytes = os.path.getsize(output_file)
size_mb = size_bytes / (1024 * 1024)

print(f"File size: {size_bytes:,} bytes ({size_mb:.2f} MB)")