# Q1. Refactoring

Before we can start coverting our code with tests, we need to 
refactor it. We'll start by getting rid of all the global variables. 

* Let's create a function `main` with two parameters: `year` and
`month`.
* Move all the code (except `read_data`) inside `main`
* Make `categorical` a parameter for `read_data` and pass it inside `main`

Now we need to create the "main" block from which we'll invoke
the main function. How does the `if` statement that we use for
this looks like? 


Hint: after refactoring, check that the code still works. Just run
it e.g. for Feb 2021 and see if it finishes successfully. 

To make it easier to run it, you can write results to your local
filesystem.

In [14]:
! python batch.py "2021" "03"

starting
predicted mean duration: 16.298821614015107
Saved on taxi_type=fhv_year=2021_month=03.parquet
ending


In [15]:
! ls

batch.py
data
homework.ipynb
model.bin
taxi_type=fhv_year=2021_month=03.parquet
tests


As suggested, the `taxi_type=fhv_year=2021_month=03.parquet`is there after executing the refactored code.

 How does the if statement that we use for this looks like? 
 
 `if __name__ == "__main__":`

# Q2. Installing pytest

In [16]:
! pipenv install --dev pytest

Installing pytest...
Installing dependencies from Pipfile.lock (2e21b6)...
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.


Creating a virtualenv for this project...

Pipfile: C:\Users\cami1\Escritorio\Github\mlops-zoomcamp\00-homework\06-best-practices\notebooks\Pipfile

Using C:/Users/cami1/AppData/Local/Programs/Python/Python310/python.exe (3.10.0) to create virtualenv...


[    ] Creating virtual environment..
[=   ] Creating virtual environment..
[==  ] Creating virtual environment..
[=== ] Creating virtual environment..
[ ===] Creating virtual environment..
[  ==] Creating virtual environment..
[   =] Creating virtual environment..
[    ] Creating virtual environment..
[   =] Creating virtual environment..
[  ==] Creating virtual environment..
[ ===] Creating virtual environment..
[====] Creating virtual environment..
[=== ] Creating virtual environment..
[==  ] Creating virtual environment..
[=   ] Creating virtual environment..
[    ] Creating virtual environment..
[=   ] Creating virtual environment..
[==  ] Creating virtual environment..
[=== ] Creating virtual environment..
[ ===] Creating virtua

Next, create a folder tests and create two files. One will be the file with tests. We can name if test_batch.py.

What should be the other file?

Hint: to be able to test batch.py, we need to be able to import it. Without this other file, we won't be able to do it.

In [18]:
! ls tests

__init__.py
test_batch.py


# Q3. Writing first unit test

How many rows should be there in the expected dataframe?

In [10]:
! pytest tests\test_batch.py

platform win32 -- Python 3.10.0, pytest-7.1.2, pluggy-1.0.0
rootdir: c:\Users\cami1\Escritorio\Github\mlops-zoomcamp\00-homework\06-best-practices\notebooks
collected 1 item

tests\test_batch.py .                                                    [100%]



In [12]:
! python tests\test_batch.py

   PUlocationID  DOlocationID     pickup_datetime    dropOff_datetime
2           1.0           1.0 2021-01-01 01:02:00 2021-01-01 01:02:50
0           NaN           NaN 2021-01-01 01:02:00 2021-01-01 01:10:00
1           1.0           1.0 2021-01-01 01:02:00 2021-01-01 01:10:00
3           1.0           1.0 2021-01-01 01:02:00 2021-01-01 02:02:01
# rows: 4


4 rows

# Q4.  Mocking S3 with Localstack

With AWS CLI, this is how we create a bucket:

```bash
aws s3 mb s3://nyc-duration
```

Adjust it for localstack. How does the command look like?

Check that the bucket was successfully created. With AWS, this is how we typically do it:

```bash
aws s3 ls
```

In [None]:
# In an external terminal
! docker-compose up

In [71]:
! aws --endpoint-url=http://localhost:4566 s3 ls

In [72]:
! aws --endpoint-url=http://localhost:4566 s3 mb s3://nyc-duration

make_bucket: nyc-duration


In [73]:
! aws --endpoint-url=http://localhost:4566 s3 ls

2022-07-20 20:06:25 nyc-duration


Adjust it for localstack. How does the command look like?
```bash
aws --endpoint-url=http://localhost:4566 s3 mb s3://nyc-duration
```

In [13]:
! export INPUT_FILE_PATTERN="s3://nyc-duration/in/{year:04d}-{month:02d}.parquet"
! export OUTPUT_FILE_PATTERN="s3://nyc-duration/out/{year:04d}-{month:02d}.parquet"

In [24]:
! pytest

platform linux -- Python 3.8.8, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /media/camilocf/329087059086CEB3/Users/cami1/Escritorio/Github/mlops-zoomcamp/00-homework/06-best-practices/notebooks
collected 1 item                                                               [0m

tests/test_batch.py [32m.[0m[32m                                                    [100%][0m



# Q5. Creating test data

In [126]:
! python tests/integration_test.py

s3://nyc-duration/taxi_type=fhv/year=2021/month=01/predictions.parquet


In [119]:
! aws --endpoint-url=http://localhost:4566 s3 ls s3://nyc-duration/taxi_type=fhv/year=2021/month=01/

2022-07-20 20:25:10       3504 predictions.parquet


In [127]:
! aws --endpoint-url=http://localhost:4566 s3 ls s3://nyc-duration/taxi_type=fhv/year=2021/month=01/

2022-07-20 20:26:41       3504 predictions.parquet


What's the size of the file?

**A/** 3504 around 3512

# Q6. Finish the integration test

In [135]:
! export INPUT_FILE_PATTERN="s3://nyc-duration/taxi_type=fhv/year=2021/month=01/predictions.parquet"
! export OUTPUT_FILE_PATTERN="s3://nyc-duration/taxi_type=fhv/year=2021/month=01/predictions_out.parquet"

In [None]:
! python batch.py 2021 1

In [None]:
import pandas as pd

S3_ENDPOINT_URL = "http://localhost:4566"
options = {
    'client_kwargs': {
        'endpoint_url': S3_ENDPOINT_URL
    }
}

file = "s3://nyc-duration/taxi_type=fhv/year=2021/month=01/predictions.parquet"
df = pd.read_parquet(file, storage_options = options)

In [None]:
df["predicted_duration"].sum()

What's the sum of predicted durations for the test dataframe?

86.31436303194275