In [21]:
!pip freeze | grep scikit-learn

scikit-learn==1.5.0


In [22]:
!python -V

Python 3.9.6


In [23]:
import pickle
import pandas as pd
import os

### Setting up parametrized input

In [24]:
month = 1
year = 2023

#### Setting up functions

In [25]:
def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60
    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

def load_model(model_name='model.bin'):
    with open(model_name, 'rb') as f_in:
        dv, model = pickle.load(f_in)
    return dv, model

def apply_model(year=2023,month=1,model_name='model.bin'):
    # set up filenames
    taxi_type = 'yellow'
    input_file = f'https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
    output_file = f'output/{taxi_type}/{year:04d}-{month:02d}.parquet'

    # read datasets
    df = read_data(input_file)
    dv, model = load_model(model_name)

    # applying model
    categorical = ['PULocationID', 'DOLocationID']
    dicts = df[categorical].to_dict(orient='records')
    X_val = dv.transform(dicts)
    y_pred = model.predict(X_val)

    # set up ride id
    df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

    # saving the results
    save_results(df,y_pred,output_file)

    return y_pred, output_file

def save_results(df: pd.DataFrame,y_pred,output_file):
    # creating the final results df
    df_result = pd.DataFrame()
    df_result['ride_id'] = df['ride_id']
    df_result['prediction'] = y_pred
    # saving as parquet file
    create_outfolder(output_file)
    df_result.to_parquet(
        output_file,
        engine='pyarrow',
        compression=None,
        index=False
    )
    return None

def create_outfolder(output_file):
    path = os.path.dirname(output_file)
    current_directory = os.getcwd()
    final_directory = os.path.join(current_directory, path)
    if not os.path.exists(final_directory):
        os.makedirs(final_directory)

### Reading dataset and making a prediction

In [26]:
y_pred, output_file = apply_model(year=year,month=month,model_name='model.bin')

## Q1. Notebook
What's the standard deviation of the predicted duration for this dataset?

In [27]:
print(f'the standard deviation is {y_pred.std():.2f}')

the standard deviation is 6.35


## Q2. Preparing the output
What is the size of the output?

In [28]:
file_stats = os.stat(output_file)
print(f'File Size in MegaBytes is {(file_stats.st_size/(1024*1024)):.1f}')

File Size in MegaBytes is 59.2


## Q3. Creating the scoring script
Now let's turn the notebook into a script. Which command you need to execute for that?

```bash
jupyter nbconvert --to script starter.ipynb
```

### Q4. Virtual environment



ow let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

We generated a pipenv environment via:
```bash
pipenv install scikit-learn==1.5.0 pandas --python=3.9
```
Which generated a `Pipfile` and a `Pipfile.lock` in this folder. Note that pickle and os are part already of 3.9 and are not needed on pipenv

#### Answer

The first hash for the Scikit-Learn dependency is `sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c`

### Q5. Parametrize the script
Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

#### Answer
We Run the following in bash to evaluate that:
```bash
pipenv shell
python starter.py 2023 4
```

For which the mean predicted duration is `14.29 minutes`

In [29]:
print(f'For {month:02d}/{year:04d} the mean predicted duration is {y_pred.mean():.2f} minutes')


For 01/2023 the mean predicted duration is 14.20 minutes
