In [19]:
import sklearn

In [20]:
sklearn.__version__

'1.0.2'

In [3]:
import pickle
import pandas as pd

In [22]:
pd.__version__

'1.4.2'

In [4]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

## Question 1:

Run this notebook for the February 2021 FVH data.

In [5]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [6]:
df = read_data('https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet')

In [9]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

In [10]:
y_pred.mean()

16.191691679979066

What's the mean predicted duration for this dataset?

    11.19
    16.19
    21.19
    26.19

> **ANSWER:** 16.19

## Q2. Preparing the output
Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```py
    df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, **write the ride id** and the **predictions to a dataframe with results**.

Save it as parquet:
``` py
    df_result.to_parquet(
        output_file,
        engine='pyarrow',
        compression=None,
        index=False
    )
```

What's the size of the output file?

- 9M
- 19M
- 29M
- 39M

Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the dtypes of the columns and use pyarrow, not fastparquet.

In [11]:
df.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173,82,,B00021,10.666667
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173,56,,B00021,14.566667
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82,129,,B00021,7.95
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,-1,225,,B00037,13.8
5,B00037,2021-02-01 00:00:37,2021-02-01 00:09:35,-1,61,,B00037,8.966667


We will take the `pickup_datetime` as the `ride_id`'s `year` and `month`

In [12]:
df["pickup_datetime"].describe()

  df["pickup_datetime"].describe()


count                  990113
unique                 699320
top       2021-02-24 08:00:00
freq                       83
first     2021-02-01 00:00:23
last      2021-02-28 23:59:55
Name: pickup_datetime, dtype: object

In [13]:
year = df["pickup_datetime"].dt.year.astype('int').values[0]
month = df["pickup_datetime"].dt.month.astype('int').values[0]

df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [14]:
df["ride_id"]

1                2021/02_1
2                2021/02_2
3                2021/02_3
4                2021/02_4
5                2021/02_5
                ...       
1037687    2021/02_1037687
1037688    2021/02_1037688
1037689    2021/02_1037689
1037690    2021/02_1037690
1037691    2021/02_1037691
Name: ride_id, Length: 990113, dtype: object

In [15]:
df_result = pd.DataFrame()
df_result["ride_id"] = df["ride_id"]
df_result["predictions"] = y_pred

In [75]:
!pip install pyarrow



In [16]:
taxi_type = "fvh"
output_file = f"{taxi_type}-{year}-{month}_prediction.parquet"

df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [18]:
!ls -la ./output

total 19252
drwxrwxrwx 1 bengsoon bengsoon      512 Jun 27 09:54 .
drwxrwxrwx 1 bengsoon bengsoon      512 Jun 27 22:28 ..
-rwxrwxrwx 1 bengsoon bengsoon 19711443 Jun 27 09:54 fvh_2021_02_prediction.parquet


> **ANSWER:** 19.7 MB

## Q4. Virtual environment
Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: check the starter notebook for details.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

In [24]:
!ls

Dockerfile  Pipfile  Pipfile.lock  model.bin  output  predict.py  starter.ipynb


> **ANSWER:** sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b

## Q5. Parametrize the script
Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for March 2021.

What's the mean predicted duration?

- 11.29
- 16.29
- 21.29
- 26.29
Hint: just add a print statement to your script.

In [27]:
!cat predict.py

#!/usr/bin/env python
# coding: utf-8

import pickle
import pandas as pd
import click


with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)
    

def read_data(filename):
    print(f"Reading data from {filename}")
    categorical = ['PUlocationID', 'DOlocationID']
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')

    dicts = df[categorical].to_dict(orient='records')
    
    return df, dicts

def save_prediction(df, y_pred):
    year = df["pickup_datetime"].dt.year.astype('int').values[0]
    month = df["pickup_datetime"].dt.month.astype('int').values[0]

    df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

    print(f"Generating dataframe from prediction with ride-id")
    df_result = pd.DataFrame()
   

In [33]:
%%bash
python predict.py --year 2021 --month 3

Generating dataframe from prediction with ride-id
Saving prediction output to output/fvh-2021-3_prediction.parquet
Reading data from https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-03.parquet
Making prediction
Prediction Mean: 16.298821614015107
Generating dataframe from prediction with ride-id
Saving prediction output to output/fvh-2021-3_prediction.parquet


> **ANSWER:** 16.2988

## Question 5:

Now run the script with docker. What's the mean predicted duration for April 2021?

- 9.96
- 16.55
- 25.96
- 36.55

In [34]:
!cat Dockerfile

FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim

RUN pip install -U pip
RUN pip install pipenv

WORKDIR /app

COPY [ "Pipfile", "Pipfile.lock", "./" ]

RUN pipenv install --system --deploy

COPY [ "predict.py", "./" ]



ENTRYPOINT ["python", "predict.py", "--year", "2021", "--month", "4"]

I've already built the Docker image locally under the tag name `ride-duration-prediction:v1`. I will just run it here

In [35]:
!docker run ride-duration-prediction:v1

Reading data from https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-04.parquet
Making prediction
Prediction Mean: 9.967573179784523


> **ANSWER:** 9.96