# Refactor the Taxi Fare Prediction Problem with a Pipeline

We will refactor the model you built for the Taxi Fare Prediction Problem using:
- Custom encoders for the distance and time features
- OneHot Encoder in order to encode the hour and day of week features
- SimpleImputer to fill missing values
- A simple linear regression
- A pipeline to put all together

Then: 
- train this pipeline
- apply the pipeline on test data
- generate predictions and submit these new predictions to Kaggle

## First pipeline

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.841610,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.761270,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.987130,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
...,...,...,...,...,...,...,...,...
995,2011-07-30 07:33:13.0000001,16.9,2011-07-30 07:33:13 UTC,-73.884801,40.755707,-73.980472,40.765556,1
996,2011-09-24 23:20:00.000000112,13.7,2011-09-24 23:20:00 UTC,-73.953603,40.779203,-73.995763,40.726701,3
997,2011-12-22 11:07:00.00000037,29.3,2011-12-22 11:07:00 UTC,-73.942380,40.837712,-73.864372,40.769985,2
998,2014-11-03 12:40:00.00000014,6.5,2014-11-03 12:40:00 UTC,-73.961151,40.774578,-73.972251,40.785640,1


In [9]:
import pandas as pd
df = pd.read_csv('../01-Kaggle-Taxi-Fare/data/train.csv',sep=',',nrows=1_000)


In [55]:
df["pickup_datetime"] = pd.to_datetime(df.pickup_datetime, format="%Y-%m-%d %H:%M:%S %Z").dt.tz_convert("America/New_York")
df = df.set_index("pickup_datetime")
df[['key','fare_amount']]

Unnamed: 0_level_0,key,fare_amount
pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2009-06-15 13:26:21-04:00,2009-06-15 17:26:21.0000001,4.5
2010-01-05 11:52:16-05:00,2010-01-05 16:52:16.0000002,16.9
2011-08-17 20:35:00-04:00,2011-08-18 00:35:00.00000049,5.7
2012-04-21 00:30:42-04:00,2012-04-21 04:30:42.0000001,7.7
2010-03-09 02:51:00-05:00,2010-03-09 07:51:00.000000135,5.3
...,...,...
2011-07-30 03:33:13-04:00,2011-07-30 07:33:13.0000001,16.9
2011-09-24 19:20:00-04:00,2011-09-24 23:20:00.000000112,13.7
2011-12-22 06:07:00-05:00,2011-12-22 11:07:00.00000037,29.3
2014-11-03 07:40:00-05:00,2014-11-03 12:40:00.00000014,6.5


In [10]:
# prepare X and y
X = df.drop('fare_amount', axis=1)
y = df.fare_amount
X

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.841610,40.712278,1
1,2010-01-05 16:52:16.0000002,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,2011-08-18 00:35:00 UTC,-73.982738,40.761270,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,2012-04-21 04:30:42 UTC,-73.987130,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
...,...,...,...,...,...,...,...
995,2011-07-30 07:33:13.0000001,2011-07-30 07:33:13 UTC,-73.884801,40.755707,-73.980472,40.765556,1
996,2011-09-24 23:20:00.000000112,2011-09-24 23:20:00 UTC,-73.953603,40.779203,-73.995763,40.726701,3
997,2011-12-22 11:07:00.00000037,2011-12-22 11:07:00 UTC,-73.942380,40.837712,-73.864372,40.769985,2
998,2014-11-03 12:40:00.00000014,2014-11-03 12:40:00 UTC,-73.961151,40.774578,-73.972251,40.785640,1


In [11]:
# Hold out ( train and test dplit )
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
X_train.head()

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
29,2013-08-11 00:52:00.00000026,2013-08-11 00:52:00 UTC,-73.98102,40.73776,-73.980668,40.730497,2
535,2013-01-13 13:38:00.000000119,2013-01-13 13:38:00 UTC,-73.989647,40.74729,-73.985648,40.735408,5
695,2013-11-24 20:54:00.000000162,2013-11-24 20:54:00 UTC,-73.971542,40.750487,-73.988967,40.72994,1
557,2010-02-03 20:51:29.0000003,2010-02-03 20:51:29 UTC,-73.954191,40.764029,-73.918043,40.766876,1
836,2012-03-16 07:52:00.000000155,2012-03-16 07:52:00 UTC,-73.960797,40.818232,-73.953255,40.810163,1


### Custom transformers

With the Taxi Fare Prediction Challenge data, using `BaseEstimator` and `TransformerMixin`, implement:

- a transformer that computes the haversine distance between the pickup and dropoff locations
- a custom encoder that extracts the time features from `pickup_datetime`

In [13]:
import numpy as np

def haversine_vectorized(df, 
                         start_lat="pickup_latitude",
                         start_lon="pickup_longitude",
                         end_lat="dropoff_latitude",
                         end_lon="dropoff_longitude"):
    """ 
        Calculates the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df.
        Computes the distance in kms.
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [6]:
# create a DistanceTransformer
from sklearn.base import BaseEstimator, TransformerMixin

class DistanceTransformer(BaseEstimator, TransformerMixin):
    """
        Computes the haversine distance between two GPS points.
        Returns a copy of the DataFrame X with only one column: 'distance'.
    """

    def __init__(self,
                 start_lat="pickup_latitude",
                 start_lon="pickup_longitude",
                 end_lat="dropoff_latitude",
                 end_lon="dropoff_longitude"):
        self.start_lat = start_lat
        self.start_lon = start_lon
        self.end_lat = end_lat
        self.end_lon = end_lon


    def fit(self, X, y=None):
        return self


    def transform(self, X, y=None):
        X_ = X.copy()
        X_["distance"] = haversine_vectorized(X_, 
                                    self.start_lat, self.start_lon,
                                    self.end_lat, self.end_lon)
        return X_[['distance']]

In [25]:
dist_trans = DistanceTransformer()
distance = dist_trans.fit_transform(X_train, y_train)
distance.head()

Unnamed: 0,distance
29,0.808153
535,1.363497
695,2.71572
557,3.060721
836,1.099034


In [26]:
distance.loc[286]

distance    21.374788
Name: 286, dtype: float64

In [14]:
# test the DistanceTransformer

dist_trans = DistanceTransformer()
distance = dist_trans.fit_transform(X_train, y_train)
distance.head()

Unnamed: 0,distance
29,0.808153
535,1.363497
695,2.71572
557,3.060721
836,1.099034


In [26]:
# create a TimeFeaturesEncoder
class TimeFeaturesEncoder(BaseEstimator, TransformerMixin):
    """
        Extracts the day of week (dow), the hour, the month and the year from a time column.
        Returns a copy of the DataFrame X with only four columns: 'dow', 'hour', 'month', 'year'.
    """

    def __init__(self,time_col ='pickup_datetime', timezone = 'America/New_York'):
       self.time_col = time_col
       self.timezone = timezone


    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_ = X.copy()
        X_[self.time_col] = pd.to_datetime(X_[self.time_col], format="%Y-%m-%d %H:%M:%S %Z").dt.tz_convert(self.timezone)
        X_['dow']=X_[self.time_col].dt.dayofweek
        X_['hour'] = X_[self.time_col].dt.hour
        X_['month'] = X_[self.time_col].dt.month
        X_['year'] = X_[self.time_col].dt.year
        X_ = X_.set_index(self.time_col)
        
        return X_[['dow', 'hour']]
        #, 'month', 'year'

In [29]:
time_enc = TimeFeaturesEncoder('pickup_datetime')
time_features = time_enc.fit_transform(X_train, y_train)
time_features.head()

Unnamed: 0_level_0,dow,hour
pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-08-10 20:52:00-04:00,5,20
2013-01-13 08:38:00-05:00,6,8
2013-11-24 15:54:00-05:00,6,15
2010-02-03 15:51:29-05:00,2,15
2012-03-16 03:52:00-04:00,4,3


In [30]:
# test the TimeFeaturesEncoder

time_enc = TimeFeaturesEncoder('pickup_datetime')
time_features = time_enc.fit_transform(X_train, y_train)
time_features.head()

Unnamed: 0_level_0,dow,hour
pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-08-10 20:52:00-04:00,5,20
2013-01-13 08:38:00-05:00,6,8
2013-11-24 15:54:00-05:00,6,15
2010-02-03 15:51:29-05:00,2,15
2012-03-16 03:52:00-04:00,4,3


###  Prepocessing pipeline

In [31]:
# visualizing pipelines in HTML
from sklearn import set_config; set_config(display='diagram')

#### Distance pipeline

Create a pipeline for distances:
- convert the pickup and dropoff coordinates into distances with the DistanceTransformer
- standardize these distances

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# create distance pipeline dist_pipe
dist_pipe = Pipeline([('dist_transformer',DistanceTransformer()),('std_scaler',StandardScaler())])

# display distance pipeline
dist_pipe

#### Time features pipeline

Create a pipeline for time features
- extract time features from pickup datetime with the TimeFeaturesEncoder
- encode these categorical time features with the OneHotEncoder

In [33]:
from sklearn.preprocessing import OneHotEncoder
# create time pipeline time_pipe
# time_cat_transformer = OneHotEncoder()
time_pipe = Pipeline([('time_encoder',TimeFeaturesEncoder()),('one_hot',OneHotEncoder(handle_unknown="ignore"))])

# display time pipeline
time_pipe

#### Preprocessing pipeline

Wrap up the distance pipeline and the time pipeline into a preprocessing pipeline.

In [34]:
df.columns

Index(['key', 'fare_amount', 'pickup_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count'],
      dtype='object')

In [35]:
from sklearn.compose import ColumnTransformer
# create preprocessing pipeline preproc_pipe
Preprocessor = ColumnTransformer([
    ('time_pipe',time_pipe,['pickup_datetime']),
    ('dist_pipe',dist_pipe,['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude'])
    ])
# display preprocessing pipeline
Preprocessor

In [36]:
from sklearn.compose import ColumnTransformer
# create preprocessing pipeline preproc_pipe
Preprocessor = ColumnTransformer([
    ('time_pipe',time_pipe,['pickup_datetime']),
    ('dist_pipe',dist_pipe,['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude'])
    ])
# display preprocessing pipeline
Preprocessor

In [37]:
pd.DataFrame(Preprocessor.fit_transform(df)).shape
Preprocessor.fit_transform(df).shape


(1000, 32)

In [38]:
pd.DataFrame.sparse.from_spmatrix(Preprocessor.fit_transform(df))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.039712
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.012621
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.038402
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.033255
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.036176
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.013783
996,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.018525
997,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.006994
998,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.037835


### Model pipeline

Create a pipeline containing the preprocessing and the regression model of your choice.

In [67]:
from sklearn.linear_model import LinearRegression

# Add the model of your choice to the pipeline nammed pipe
final_pipe = Pipeline([
    ('preprocessing',Preprocessor),
    ('linear_regression',LinearRegression())

])
# display the pipeline with model
final_pipe

In [None]:
from sklearn.linear_model import LinearRegression

# Add the model of your choice to the pipeline nammed pipe
final_pipe = Pipeline([
    ('preprocessing',Preprocessor),
    ('linear_regression',LinearRegression())

])
# display the pipeline with model
final_pipe

### Training and performance

Train the pipelined model and compute the prediction on the test set:

In [80]:
# train the pipelined model
final_pipe_trained = final_pipe.fit(X_train,y_train)


# compute y_pred on the test set
y_test_pred = final_pipe_trained.predict(X_test)

Use the RMSE to evaluate the performance of the model:

In [81]:
def compute_rmse(y_pred, y_true):
    return np.sqrt(((y_pred - y_true) ** 2).mean())

In [82]:
# call compute_rmse
compute_rmse(y_test_pred, y_test)

8.66643237874745

In [84]:
final_pipe_trained.score(X_test, y_test)

-0.08097835832948652

## Complete the workflow with a pipeline

Here we will implement the whole workflow for our Taxifare kaggle challenge.

For that we will refactor the code in functions for more clarity.

Implement the following functions:
- `get_data()` to fetch the data 
- `clean_data()` to clean the data
- `get_pipeline()` to get the pipeline defined earlier
- `train()` to train our model
- `evaluate()` to evaluate our model on test data

## transformer distance to center

In [17]:
def haversine_vectorized_center(df, 
                         start_lat="pickup_latitude",
                         start_lon="pickup_longitude"):
    """ 
        Calculates the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df.
        Computes the distance in kms.
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(40.730610), np.radians(-73.935242)
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [18]:
from sklearn.base import BaseEstimator, TransformerMixin

class DistToCenterAdder(BaseEstimator, TransformerMixin):
    """
        Computes the haversine distance between two GPS points.
        Returns a copy of the DataFrame X with only one column: 'distance'.
    """

    def __init__(self,
                add_dist_to_center = True,
                start_lat="pickup_latitude",
                start_lon="pickup_longitude"):
        self.add_dist_to_center = add_dist_to_center
        self.start_lat = start_lat
        self.start_lon = start_lon


    def fit(self, X, y=None):
        return self


    def transform(self, X, y=None):
        X_ = X.copy()
        if self.add_dist_to_center : 
            X_["dist_to_center"] = haversine_vectorized_center(X_, 
                                        self.start_lat, self.start_lon)
        return X_[["dist_to_center"]]


In [19]:
dist_to_center = DistToCenterAdder()
distance = dist_to_center.fit_transform(X_train, y_train)
distance.head()

Unnamed: 0,dist_to_center
29,3.938222
535,4.944724
695,3.773325
557,4.044371
836,9.977929


## condensé

In [1]:
# implement get_data() function
def get_data(nrows=10000):
    '''returns a DataFrame with nrows from s3 bucket'''
    df = pd.read_csv('../01-Kaggle-Taxi-Fare/data/train.csv',sep=',',nrows=nrows)
    return df

In [3]:
# implement clean_data() function
def clean_data(df, test=False):
    df = df[
        (df.fare_amount > 0) &
        (df.passenger_count <= 8) &
        (df.passenger_count > 0)&
        (df["pickup_latitude"].between(left = 40, right = 42 ))&
        (df["pickup_longitude"].between(left = -74.3, right = -72.9 ))&
        (df["dropoff_latitude"].between(left = 40, right = 42 ))&
        (df["dropoff_longitude"].between(left = -74, right = -72.9 ))

        ]
    return df

In [4]:
# implement set_pipeline() function
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

def set_pipeline():
    dist_pipe = Pipeline([('dist_transformer',DistanceTransformer()),('std_scaler',StandardScaler())])
    time_pipe = Pipeline([('time_encoder',TimeFeaturesEncoder()),('one_hot',OneHotEncoder(handle_unknown="ignore"))])

    Preprocessor = ColumnTransformer([
        ('time_pipe',time_pipe,['pickup_datetime']),
        ('dist_pipe',dist_pipe,['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude'])
        ])

    final_pipe = Pipeline([
        ('preprocessing',Preprocessor),
        ('linear_regression',LinearRegression())
        ])
    
    return final_pipe

In [77]:

def set_pipeline_choose():
    dist_pipe = Pipeline([('dist_transformer',DistanceTransformer()),('std_scaler',StandardScaler())])
    time_pipe = Pipeline([('time_encoder',TimeFeaturesEncoder()),('one_hot',OneHotEncoder(handle_unknown="ignore"))])
    dist_to_center_pipe = Pipeline([('dist_to_center_transformer',DistToCenterAdder()),('std_scaler',StandardScaler())])

    Preprocessor = ColumnTransformer([
        ('time_pipe',time_pipe,['pickup_datetime']),
        ('dist_pipe',dist_pipe,['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude']),
        ('dist_to_center',dist_to_center_pipe,['pickup_longitude', 'pickup_latitude'])
        ])

    final_pipe = Pipeline([
        ('preprocessing',Preprocessor),
        ('linear_regression',LinearRegression())
        ])
    
    return final_pipe

In [78]:
# implement train() function
def train(X_train, y_train, pipeline):
    final_pipe_trained = pipeline.fit(X_train,y_train)
    return final_pipe_trained

In [53]:
# implement evaluate() function
def evaluate(X_test, y_test, pipeline):
    y_pred = pipeline.predict(X_test)
    rmse = np.sqrt(((y_pred - y_test) ** 2).mean())
    print(rmse)
    return rmse

### Test the complete worflow

Use the above functions to test the complete workflow.

In [79]:
df = get_data()

# set X and y
y = df["fare_amount"]
X = df.drop("fare_amount", axis=1)

# hold out
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15)

# build pipeline
pipeline = set_pipeline()

# train the pipeline
train(X_train, y_train, pipeline)

# evaluate the pipeline
rmse = evaluate(X_val, y_val, pipeline)

9.634400346156237


In [80]:
# store the data in a DataFrame
df = get_data()

# set X and y
y = df["fare_amount"]
X = df.drop("fare_amount", axis=1)

# hold out
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15)

# build pipeline
pipeline = set_pipeline()

# train the pipeline
train(X_train, y_train, pipeline)

# evaluate the pipeline
rmse = evaluate(X_val, y_val, pipeline)

10.880530078144382


In [81]:
df = get_data()

# set X and y
y = df["fare_amount"]
X = df.drop("fare_amount", axis=1)

# hold out
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15)

# build pipeline
pipeline = set_pipeline_choose()
# train the pipeline
train(X_train, y_train, pipeline)

# evaluate the pipeline
rmse = evaluate(X_val, y_val, pipeline)

9.913438998964281


In [65]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessing', 'linear_regression', 'preprocessing__n_jobs', 'preprocessing__remainder', 'preprocessing__sparse_threshold', 'preprocessing__transformer_weights', 'preprocessing__transformers', 'preprocessing__verbose', 'preprocessing__verbose_feature_names_out', 'preprocessing__time_pipe', 'preprocessing__dist_pipe', 'preprocessing__dist_to_center', 'preprocessing__time_pipe__memory', 'preprocessing__time_pipe__steps', 'preprocessing__time_pipe__verbose', 'preprocessing__time_pipe__time_encoder', 'preprocessing__time_pipe__one_hot', 'preprocessing__time_pipe__time_encoder__time_col', 'preprocessing__time_pipe__time_encoder__timezone', 'preprocessing__time_pipe__one_hot__categories', 'preprocessing__time_pipe__one_hot__drop', 'preprocessing__time_pipe__one_hot__dtype', 'preprocessing__time_pipe__one_hot__handle_unknown', 'preprocessing__time_pipe__one_hot__sparse', 'preprocessing__dist_pipe__memory', 'preprocessing__dist_pipe__steps', 'preproce

### Congrats!

Now we are ready to convert this complete workflow into a packaged code 🚀