# Q1. Notebook

Run this notebook for the February 2021 FVH data.

What's the mean predicted duration for this dataset?

In [4]:
import os
import pickle
import pandas as pd


PATH = "../src/"

with open(os.path.join(PATH, 'model.bin'), 'rb') as f_in:
    dv, lr = pickle.load(f_in)

categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

df = read_data(os.path.join(PATH, "data/fhv_tripdata_2021-02.parquet"))

dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

y_pred.mean()

16.191691679979066

What's the mean predicted duration for this dataset? 

**16.19**

# Q2. Preparing the output

In [42]:
year = df["pickup_datetime"].dt.year
month = df["pickup_datetime"].dt.strftime('%m')

In [48]:
df['ride_id'] = year.map(str) +"/"+month.map(str) +"_" + df.index.astype('str')
df['prediction'] = y_pred

In [50]:
df_result = df[['ride_id', 'prediction']]
df_result.head()

Unnamed: 0,ride_id,prediction
1,2021/02_1,14.539865
2,2021/02_2,13.740422
3,2021/02_3,15.593339
4,2021/02_4,15.188118
5,2021/02_5,13.817206


In [52]:
output_file = os.path.join(PATH, "data", "results.parquet")
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [53]:
file_stats = os.stat(output_file)
print(f'The file {output_file} has a size in Bytes of: {file_stats.st_size}')

The file ../src/data\results.parquet has a size in Bytes of: 19711435


What's the size of the output file?

**19.7 M**

# Q3. Creating the scoring script

In [None]:
! pip install jupyter
! pip install nbconvert

In [57]:
! jupyter nbconvert --to script ../src/starter.ipynb

[NbConvertApp] Converting notebook ../src/starter.ipynb to script
[NbConvertApp] Writing 860 bytes to ..\src\starter.py


# Q4. Virtual environment

In [58]:
! pip install pipenv

Collecting pipenv
  Downloading pipenv-2022.6.7-py2.py3-none-any.whl (3.9 MB)
Collecting virtualenv
  Downloading virtualenv-20.14.1-py2.py3-none-any.whl (8.8 MB)
Collecting virtualenv-clone>=0.2.5
  Downloading virtualenv_clone-0.5.7-py3-none-any.whl (6.6 kB)
Collecting pip>=22.0.4
  Downloading pip-22.1.2-py3-none-any.whl (2.1 MB)
Collecting distlib<1,>=0.3.1
  Downloading distlib-0.3.4-py2.py3-none-any.whl (461 kB)
Collecting platformdirs<3,>=2
  Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)
Collecting filelock<4,>=3.2
  Downloading filelock-3.7.1-py3-none-any.whl (10 kB)
Installing collected packages: platformdirs, filelock, distlib, virtualenv-clone, virtualenv, pip, pipenv
  Attempting uninstall: pip
    Found existing installation: pip 21.2.2
    Uninstalling pip-21.2.2:
      Successfully uninstalled pip-21.2.2
Successfully installed distlib-0.3.4 filelock-3.7.1 pip-22.1.2 pipenv-2022.6.7 platformdirs-2.5.2 virtualenv-20.14.1 virtualenv-clone-0.5.7


In [60]:
! pipenv install scikit-learn==1.0.2

Installing scikit-learn==1.0.2...
Installing dependencies from Pipfile.lock (9791c1)...
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.


Creating a virtualenv for this project...

Pipfile: C:\Users\cami1\Escritorio\Github\mlops-zoomcamp\00-homework\04-deployment\notebooks\Pipfile

Using C:/Users/cami1/AppData/Local/Programs/Python/Python310/python.exe (3.10.0) to create virtualenv...


[    ] Creating virtual environment..
[=   ] Creating virtual environment..
[==  ] Creating virtual environment..
[=== ] Creating virtual environment..
[ ===] Creating virtual environment..
[  ==] Creating virtual environment..
[   =] Creating virtual environment..
[    ] Creating virtual environment..
[   =] Creating virtual environment..
[  ==] Creating virtual environment..
[ ===] Creating virtual environment..
[====] Creating virtual environment..
[=== ] Creating virtual environment..
[==  ] Creating virtual environment..
[=   ] Creating virtual environment..
[    ] Creating virtual environment..
[=   ] Creating virtual environment..
[==  ] Creating virtual environment..
[=== ] Creating virtual environment..
[ ===] Creating virtual en

In [70]:
import json
  
with open('Pipfile.lock') as f:
    data = json.load(f)

In [73]:
library = "scikit-learn"
hash = data["default"][library]["hashes"][0]
version = data["default"][library]["version"]
print(f"The library {library} with the version {version} has as first hash {hash}")

The library scikit-learn with the version ==1.0.2 has as first hash sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b


What's the first hash for the Scikit-Learn dependency?

**sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b**

# Q5. Parametrize the script

In [79]:
! python ../src/predict.py -y "2021" -m "03"

16.298821614015107


What's the mean predicted duration?

**16.298**

# Q6. Docker container