# Homework 4

In [1]:
import pickle
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
!pip freeze | grep scikit-learn

[01;31m[Kscikit-learn[m[K @ file:///Users/runner/miniforge3/conda-bld/[01;31m[Kscikit-learn[m[K_1716489793468/work/dist/scikit_learn-1.5.0-cp311-cp311-macosx_10_13_x86_64.whl#sha256=8a2a2018571342ef2b1f7b0b62310b5796b5ccfbf7f565a6e689d9acec417aae


In [3]:
!python -V

Python 3.11.8


In [4]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [5]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [6]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')

In [7]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

## Question 1

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* **6.24**
* 12.28
* 18.28

In [8]:
print(f'The standard deviation is {round(np.std(y_pred), 2)}.')

The standard deviation is 6.25.


## Question 2

What's the size of the output file?

* 36M
* 46M
* 56M
* **66M**

In [9]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration
0,2,2023-03-01 00:06:43,2023-03-01 00:16:43,1.0,0.0,1.0,N,238,42,2,8.6,1.0,0.5,0.0,0.0,1.0,11.1,0.0,0.0,10.0
1,2,2023-03-01 00:08:25,2023-03-01 00:39:30,2.0,12.4,1.0,N,138,231,1,52.7,6.0,0.5,12.54,0.0,1.0,76.49,2.5,1.25,31.083333
2,1,2023-03-01 00:15:04,2023-03-01 00:29:26,0.0,3.3,1.0,N,140,186,1,18.4,3.5,0.5,4.65,0.0,1.0,28.05,2.5,0.0,14.366667
3,1,2023-03-01 00:49:37,2023-03-01 01:01:05,1.0,2.9,1.0,N,140,43,1,15.6,3.5,0.5,4.1,0.0,1.0,24.7,2.5,0.0,11.466667
4,2,2023-03-01 00:08:04,2023-03-01 00:11:06,1.0,1.23,1.0,N,79,137,1,7.2,1.0,0.5,2.44,0.0,1.0,14.64,2.5,0.0,3.033333


In [10]:
year, month = 2023, 3
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [11]:
df_results = pd.concat([df['ride_id'], pd.DataFrame(y_pred)], axis=1)

In [12]:
df_results.to_parquet(
    'output.parquet',
    engine='pyarrow',
    compression=None,
    index=False
)

In [13]:
print(f'''The size of the output file is {round(Path('output.parquet').stat().st_size/(1024*1024), 2)} MB.''')

The size of the output file is 65.46 MB.


In [14]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration,ride_id
0,2,2023-03-01 00:06:43,2023-03-01 00:16:43,1.0,0.0,1.0,N,238,42,2,...,1.0,0.5,0.0,0.0,1.0,11.1,0.0,0.0,10.0,2023/03_0
1,2,2023-03-01 00:08:25,2023-03-01 00:39:30,2.0,12.4,1.0,N,138,231,1,...,6.0,0.5,12.54,0.0,1.0,76.49,2.5,1.25,31.083333,2023/03_1
2,1,2023-03-01 00:15:04,2023-03-01 00:29:26,0.0,3.3,1.0,N,140,186,1,...,3.5,0.5,4.65,0.0,1.0,28.05,2.5,0.0,14.366667,2023/03_2
3,1,2023-03-01 00:49:37,2023-03-01 01:01:05,1.0,2.9,1.0,N,140,43,1,...,3.5,0.5,4.1,0.0,1.0,24.7,2.5,0.0,11.466667,2023/03_3
4,2,2023-03-01 00:08:04,2023-03-01 00:11:06,1.0,1.23,1.0,N,79,137,1,...,1.0,0.5,2.44,0.0,1.0,14.64,2.5,0.0,3.033333,2023/03_4


## Q3. Creating the scoring script

Now let's turn the notebook into a script. Which command you need to execute for that?

The command is: `jupyter nbconvert --to script starter.ipynb`.

## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that. Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: `Pipfile` and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency? It is: `sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c`.

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two parameters: year and month. Run the script for April 2023. What's the mean predicted duration?

* 7.29
* **14.29**
* 21.29
* 28.29

Hint: just add a print statement to your script.

The mean prediction is 14.29

## Q6. Docker container

Now run the script with docker. What's the mean predicted duration for May 2023?

* **0.19**
* 7.24
* 14.24
* 21.19

See Dockerfile build with `starter.py` script which uses the model already contained in the Docker build.