### Homework

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset.

You'll find the starter code in the [homework](homework) directory.

### Q1. Notebook

We'll start with the same notebook we ended up with in homework 1. We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* 6.24
* 12.28
* 18.28

In [1]:
!jupyter nbconvert --log-level 50 --execute --stdout --to markdown --no-input ./homework/starter.ipynb | grep "standard deviation"

    The standard deviation of the predicted duration is 6.247488852238703


### Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 36M
* 46M
* 56M
* 66M

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the dtypes of the columns and use `pyarrow`, not `fastparquet`.

In [2]:
!jupyter nbconvert  --log-level 50 --execute --stdout --to markdown --no-input ./homework/starter.ipynb | grep "size"

    The size of the output file is 65.46185111999512 MB


### Q3. Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?

In [3]:
!jupyter nbconvert --to script ./homework/starter.ipynb --no-prompt --output "prediction"

[NbConvertApp] Converting notebook ./homework/starter.ipynb to script
[NbConvertApp] Writing 1462 bytes to homework/prediction.py


### Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: `Pipfile` and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

In [4]:
!pipenv lock

[1mrequirements.txt[0m found in 
[1;33m/home/chishien/projects/mlops/dtc-mlops-zoomcamp/mlops-zoomcamp-2024/[0m[1;33m04-deployme[0m
[1;33mnt[0m instead of [1mPipfile[0m! Converting[33m...[0m
[2K✔ Success! Importing requirements.....
[2K[32m⠇[0m Importing requirements...
did. 
We recommend updating your [1mPipfile[0m to specify the [1;32m"*"[0m version, instead.
Locking[0m [33m[packages][0m dependencies...[0m
[?25lBuilding requirements[33m...[0m
[2KResolving dependencies[33m...[0m
[2K✔ Success! Locking packages...
[2K[32m⠴[0m Locking packages...
[1A[2KLocking[0m [33m[dev-packages][0m dependencies...[0m
[1mUpdated Pipfile.lock (f2bbe130ffbae427203c9609ce559bf378f98be367e290d7b1a9bd31bf4ac873)![0m


In [5]:
print('First hash for the Scikit-Learn dependency:')
!jq '.default."scikit-learn".hashes[0]' Pipfile.lock

First hash for the Scikit-Learn dependency:
[0;32m"sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c"[0m


### Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

* 7.29
* 14.29
* 21.29
* 28.29

Hint: just add a print statement to your script.

In [6]:
!python homework/prediction_parameterized.py --year 2023 --month 4

Mean predicted duration: 14.292282936862449


### Q6. Docker container 

Finally, we'll package the script in the docker container.
For that, you'll need to use a base image that we prepared. 

This is what the content of this image is:
```
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed it to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo), which you need to use as your base image.

That is, your Dockerfile should start with:
```docker
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration for May 2023? 

* 0.19
* 7.24
* 14.24
* 21.19

In [7]:
!docker build -t homework4 . > /dev/null

[1A[1B[0G[?25l[+] Building 0.0s (0/0)  docker:default
[?25h[1A[0G[?25l[+] Building 0.0s (0/0)  docker:default
[?25h[1A[0G[?25l[+] Building 0.0s (0/0)  docker:default
[?25h[1A[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 323B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops  0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.2s (1/3)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 323B                                       0.0s
[0m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops 

In [8]:
!docker run --rm homework4:latest --year 2023 --month 5

Mean predicted duration: 0.19174419265916945
