# Feature Engineering In Advance of HPO

Often our data is not optimally arranged for training.  In this notebook we perform basic dataframe operations like filtering out outliers, repartitioning, expanding dates, and so on.  Everything here should be familiar to a modestly experienced pandas user.  It should also vary by dataset.

We do encourage a few specific general optimizations:

-   Categorization
-   Efficient datatypes (like pyarrow strings)
-   Repartitioning

In [None]:
import coiled
import dask.config
import dask.dataframe as dd
from distributed import Client
import pandas as pd

### Start Coiled cluster

In [None]:
dask.config.set({"dataframe.dtype_backend": "pyarrow"})

cluster = coiled.Cluster(
    worker_vm_types=["m6i.xlarge"],
    scheduler_vm_types=["m6i.large"],
    package_sync=True,  # align remote packages to local ones
    n_workers=10,
    backend_options={
        "region": "us-east-2",
        "multizone": True,
        "spot": True,
        "spot_on_demand_fallback": True,
    },
)
client = Client(cluster)

In [None]:
# Temporary workaround to https://github.com/dask/dask/issues/9840
from distributed import WorkerPlugin


class SetPandasOptions(WorkerPlugin):
    def setup(self, worker):
        pd.set_option("string_storage", "pyarrow")


pd.set_option("string_storage", "pyarrow")  # Set on the client
_ = client.register_worker_plugin(SetPandasOptions())  # Set on the workers

### Load data

In [None]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/prefect-dask/nyc-uber-lyft/processed_data.parquet",
    index=False,
    columns=[
        "hvfhs_license_num",
        "PULocationID",
        "DOLocationID",
        "trip_miles",
        "trip_time",
        "tolls",
        "congestion_surcharge",
        "airport_fee",
        "wav_request_flag",
        "on_scene_datetime",
        "pickup_datetime",
    ],
)
ddf.npartitions

In [None]:
ddf.head()

The size of the partitions in the input dataset varies substantially, between 22 and 836 MiB. At the cost of reading the whole dataset twice, we must avoid processing the biggest chunks without first breaking them down into a more manageable size.
Note that this won't stop the 836 MiB partitions from being read into memory all at once; however, instead of having to crunch the whole processing pipeline on them, if we rechunk early we can break them down and forget them immediately after they have been loaded.

In [None]:
ddf = ddf.repartition(partition_size="100MB")
ddf.npartitions

### Postprocess columns

In [None]:
ddf = ddf.assign(
    accessible_vehicle=ddf.on_scene_datetime.isnull(),
    pickup_month=ddf.pickup_datetime.dt.month,
    pickup_dow=ddf.pickup_datetime.dt.dayofweek,
    pickup_hour=ddf.pickup_datetime.dt.hour,
)
ddf = ddf.drop(columns=["on_scene_datetime", "pickup_datetime"])
ddf["airport_fee"] = ddf["airport_fee"].replace("None", 0).astype(float).fillna(0)

### Filter rows

In [None]:
ddf = ddf.dropna(how="any")

# Remove outliers
# Based on our earlier EDA, we will set the lower bound at zero, which is consistent
# with our domain knowledge that no trip should have a duration less than zero.
# e calculate the upper_bound and filter the IQR.
lower_bound = 0
Q3 = ddf["trip_time"].quantile(0.75)
upper_bound = Q3 + (1.5 * (Q3 - lower_bound))
ddf = ddf.loc[(ddf["trip_time"] >= lower_bound) & (ddf["trip_time"] <= upper_bound)]

### Join with domain information

Downloaded the "Taxi Zone Lookup Table (CSV) from [here](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

In [None]:
taxi_zone_lookup = pd.read_csv(
    "s3://coiled-datasets/prefect-dask/nyc-uber-lyft/taxi+_zone_lookup.csv",
    usecols=["LocationID", "Borough"],
)
BOROUGH_MAPPING = {
    "Manhattan": "Superborough 1",
    "Bronx": "Superborough 1",
    "EWR": "Superborough 1",
    "Brooklyn": "Superborough 2",
    "Queens": "Superborough 2",
    "Staten Island": "Superborough 3",
    "Unknown": "Unknown",
}

taxi_zone_lookup["Superborough"] = [
    BOROUGH_MAPPING[k] for k in taxi_zone_lookup["Borough"]
]
taxi_zone_lookup = taxi_zone_lookup.astype(
    {"Borough": "string[pyarrow]", "Superborough": "string[pyarrow]"}
)
taxi_zone_lookup

In [None]:
ddf = dd.merge(
    ddf,
    taxi_zone_lookup,
    left_on="PULocationID",
    right_on="LocationID",
    how="inner",
)
ddf = ddf.rename(columns={"Borough": "PUBorough", "Superborough": "PUSuperborough"})
ddf = ddf.drop(columns="LocationID")

ddf = dd.merge(
    ddf,
    taxi_zone_lookup,
    left_on="DOLocationID",
    right_on="LocationID",
    how="inner",
)
ddf = ddf.rename(columns={"Borough": "DOBorough", "Superborough": "DOSuperborough"})
ddf = ddf.drop(columns="LocationID")

ddf["PUSuperborough_DOSuperborough"] = ddf.PUSuperborough.str.cat(
    ddf.DOSuperborough, sep="-"
)
ddf = ddf.drop(columns=["PUSuperborough", "DOSuperborough"])

### Categorize
Convert column data to categories, with homogeneous domains across partitions.

categorize() works in four steps:
1. compute the whole input dataframe
2. collect local categorical domains from each partition and send them back to the client
3. on the client, generate dataframe-wide categorical domains as the boolean union of the local ones
4. return a new graph where each chunk is converted to category using the global domains

This means that, once you compute the output of categorize(), you will have computed the
whole thing twice. To avoid this, we persist() to hide the graph so far from categorize().

This however has in turn the drawback that we need to have the whole dataframe in memory
at once. In order to reduce the amount of memory we need for it, we *locally* categorize
each partition - which in turn drastically reduces its size - before we persist.

Read more: https://github.com/dask/dask/issues/9847

In [None]:
categories = [
    "hvfhs_license_num",
    "PULocationID",
    "DOLocationID",
    "wav_request_flag",
    "accessible_vehicle",
    "pickup_month",
    "pickup_dow",
    "pickup_hour",
    "PUBorough",
    "DOBorough",
    "PUSuperborough_DOSuperborough",
]

ddf = ddf.astype(dict.fromkeys(categories, "category"))
ddf = ddf.persist()
# This blocks until the whole workload so far has been persisted
ddf = ddf.categorize(categories)

### Prepare for output
Make all partitions the same size (they aren't due to the row filtering earlier on) and define the final partition size.
Again, repartition() needs to compute its whole input twice, since the latest persist().
In order to avoid this, we persist() immediately before.

In [None]:
ddf = ddf.persist()
ddf = ddf.repartition(partition_size="100MB")
ddf.npartitions

In [None]:
ddf.head()

## Output

In [None]:
# Temporary workaround to https://github.com/apache/arrow/issues/33727
ddf = ddf.astype(
    {
        col: pd.CategoricalDtype(dt.categories.astype(object))
        for col, dt in ddf.dtypes.items()
        if isinstance(dt, pd.CategoricalDtype)
        and dt.categories.dtype == "string[pyarrow]"
    }
)

In [None]:
ddf.to_parquet(
    "s3://coiled-datasets/prefect-dask/nyc-uber-lyft/feature_table.parquet",
    overwrite=True,
)

In [None]:
client.shutdown()