In [None]:
import sys

sys.path.insert(0, "../")
sys.path.insert(0, "packages")

In [None]:
import os

if os.environ.get("CIRCLECI"):
    default_env = os.environ.get("CONDA_DEFAULT_ENV")
    os.environ["PYSPARK_DRIVER_PYTHON"] = (
        f"/home/circleci/miniconda/envs/{default_env}/bin/python"
    )
    os.environ["PYSPARK_PYTHON"] = (
        f"/home/circleci/miniconda/envs/{default_env}/bin/python"
    )

# Flags

This section outlines how to use the `flags` submodule.

If you have already seen the `tags` module, you will notice that they are similar.
Historically, the `flag` module was developed first. However, under certain circumstances,
the run time performance of `flags` takes a significant hit, when the `tags` module does
not. The reason is that `flags` manifest as columns for each `flag`, where as all `tags`
are stored in a single array column.

Imagine carrying around a dataframe with 3000 columns,
or a single array column. Which do you think would be more efficient to carry around?
As it turns out, when translating unit of analysis (i.e. translating from event level to
day level), performing a `collect_set` on a single column is much more performant than
a `groupby.agg` on 3000 columns.

This raises the question? Should we always use `tags` then? Unfortunately this is to be
determined on a case-by-case basis. Sometimes, it's much easier to go with flags as it is
the more direct path. Other times (big data), it might be more performant to go with `tags`
and accept the increased complexity. Ultimately, it is about trade-offs and so we have
decided to maintain both.

## Available Functions
This section gives the details of node function and parameter names which can be used
while creating flags.


Below are the entry point functions:
```text
+-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Function to create features             | Description                                                              | Function path                                                                               |
+-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| `create_columns_from_config`            | Creates basic, derived and custom features                               | pmpx_pkg/utilities/feature_generation/v1/nodes/features/create_column.py                    |
+-----------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------------------------------------+
```

Below are the available flag functions within the `flags` module:

In [None]:
from feature_generation.v1.core.features import flags
from types import FunctionType

list_of_flags = [
    x
    for x in dir(flags)
    if isinstance(getattr(flags, x), FunctionType) and not x.startswith("_")
]

In [None]:
from tabulate import tabulate

rows = []
for x in list_of_flags:
    x_doc = getattr(flags, x).__doc__.split("\n")[0]
    rows.append({"function": x, "description": x_doc})

import pandas as pd

table = pd.DataFrame(rows)
print(tabulate(table, headers=table.columns, tablefmt="psql"))

Each function will have an example of usage in both core API and node API. Core API
generally refers to calling the function in code, while node API generally refers to
calling a function based on parameters.

Note: The functions mentioned below can be used to create features either on `tags`
that have been converted to columns or base columns of a dataframe.

The `create_columns_from_config` function can be considered as the base or entry function which
takes an input config to create both base/derived features. This function creates
a new column/feature for every entry in the config.

This function is flexible and allows the users to choose how the feature is built.
One can choose from:

1. `built-in` functions
2. any utils available in `feature_generation` utility
3. a custom function to apply on the input column(s).



## Flags

The concept of creating flags is to help identify whether a condition is true.
Flag columns take only 2 values one for true and the other false. Generally represented
as a boolean `0` or `1`.

The module is called `flags` because it was first created within the Real World Evidence
space where there was a lot of flags created based on diagnosis codes, drug codes and so on.
It was later extended to also be able to contain continuous numbers instead of just a
1 or 0.

Flags can be created using the `create_column_from_config` from before, based on a
config(dictionary/yaml).
Usually when creating flags for features, a lot of ``IF...ELSE`` or
``CASE WHEN`` are needed. By providing a config, this reduces the need for
massive ``IF...ELSE`` statements, and with the added benefit that a non-technical
person can verify the values corresponding to a certain flag. Several default functions are
provided (``rlike``, ``isin``, and ``rlike_extract``), or the user is free to
provide their own flag function.


Throughout this section, we will be using the following sample dataframe:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    ArrayType,
    IntegerType,
    StringType,
    StructField,
    StructType,
)


spark = (
    SparkSession.builder.config("spark.sql.shuffle.partitions", 1)
    .config("spark.ui.showConsoleProgress", False)
    .getOrCreate()
)
schema = StructType(
    [
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("gender", StringType(), True),
        StructField("occupation", StringType(), True),
        StructField("house", ArrayType(StringType()), True),
        StructField("number", ArrayType(IntegerType()), True),
    ]
)
data = [
    ("Gendry", 31, "male", "Data Engineer", ["House Baratheon"], [1, 2, 3]),
    ("Jaime", 12, "male", "Data Scientist", ["House Lannister"], [2, 3, 4]),
    ("Tyrion", 65, "m", "Data Analyst", ["House Lannister", "House Stark"], [3, 4, 5]),
    ("Cersei", 40, "female", "Engagement Manager", ["House Lannister"], [5, 6, 7]),
    (
        "Jon",
        31,
        "male",
        "Software Engineer",
        ["House Targaryen", "House Stark"],
        [6, 7, 8],
    ),
    ("Arya", 27, "f", "MLE", ["House Stark"], [7, 8, 9]),
    (
        "Sansa",
        26,
        "f",
        "daata translator",
        ["House Stark", "House Lannister"],
        [8, 9, 10],
    ),
    ("Daenerys", 25, "female", "Mother of Dragons", ["House Targaryen"], [9, 11, 13]),
]

df_sample = spark.createDataFrame(data, schema)

In [None]:
df_sample.show()

### `isin`
This generates a column if any of the list values is found in the column. This flag
function can be used for creating fundamental and derived features.

Code example below:

In [None]:
from feature_generation.v1.core.features.create_column import create_columns_from_config
from feature_generation.v1.core.features.flags import isin

isin_code_config = [
    isin(input="gender", values=["m", "male"], output="gender_is_male"),
    isin(input="gender", values=["f", "female"], output="gender_is_female"),
]

df_isin_code = create_columns_from_config(df_sample, isin_code_config)

df_isin_code.show()

Node example below:

In [None]:
from feature_generation.v1.nodes.features.create_column import (
    create_columns_from_config,
)

isin_config = [
    {
        "object": "feature_generation.v1.core.features.flags.isin",
        "output": "gender_is_male",
        "input": "gender",
        "values": ["m", "male"],
    },
    {
        "object": "feature_generation.v1.core.features.flags.isin",
        "output": "gender_is_female",
        "input": "gender",
        "values": ["f", "female"],
    },
]

df_isin = create_columns_from_config(df_sample, isin_config)

df_isin.show()

Notice the different import paths.

### `rlike`
This generates a flag feaure if the regex pattern can be found in the column.
This function can be used for creating basic and derived features.

Core example below:

In [None]:
from feature_generation.v1.core.features.create_column import create_columns_from_config
from feature_generation.v1.core.features.flags import rlike

rlike_code_config = [
    rlike(input="occupation", values=["Data", "daata"], output="is_data_job")
]

df_rlike_code = create_columns_from_config(df_sample, rlike_code_config)

df_rlike_code.show()

Node example below:

In [None]:
from feature_generation.v1.nodes.features.create_column import (
    create_columns_from_config,
)

rlike_config = [
    {
        "object": "feature_generation.v1.core.features.flags.rlike",
        "output": "is_data_job",
        "input": "occupation",
        "values": ["Data", "daata"],
    },
]

df_rlike = create_columns_from_config(df_sample, rlike_config)

df_rlike.show()

### `arrays_overlap`
This generates a flag column if any element in the list can be found in the array.
This function can be used for creating basic and derived features.


Core example below:

In [None]:
from feature_generation.v1.core.features.create_column import create_columns_from_config
from feature_generation.v1.core.features.flags import arrays_overlap

arrays_overlap_code_config = [
    arrays_overlap(input="number", values=[2, 4, 6, 8, 10], output="even_numbers")
]

df_arrays_overlap_code = create_columns_from_config(
    df_sample, arrays_overlap_code_config
)

df_arrays_overlap_code.show()

Node example below:

In [None]:
from feature_generation.v1.nodes.features.create_column import (
    create_columns_from_config,
)

arrays_overlap_config = [
    {
        "object": "feature_generation.v1.core.features.flags.arrays_overlap",
        "output": "even_numbers",
        "input": "number",
        "values": [2, 4, 6, 8, 10],
    },
]

df_arrays_overlap = create_columns_from_config(df_sample, arrays_overlap_config)

df_arrays_overlap.show()

### `rlike_extract`
This generates 2 columns:
1. a flag column if the if the regex pattern can be found in the column.
2. column containing regex matched word from string in column
This function can be used for creating basic and derived features.

Core example below:

In [None]:
from feature_generation.v1.core.features.create_column import create_columns_from_config
from feature_generation.v1.core.features.flags import rlike_extract

rlike_extract_code_config = [
    rlike_extract(
        input="gender", values=["female"], output="is_female", suffix="_match"
    )
]

df_rlike_extract_code = create_columns_from_config(df_sample, rlike_extract_code_config)

df_rlike_extract_code.show()

Node example below:

In [None]:
from feature_generation.v1.nodes.features.create_column import (
    create_columns_from_config,
)

rlike_extract_config = [
    {
        "object": "feature_generation.v1.core.features.flags.rlike_extract",
        "output": "is_female",
        "input": "gender",
        "values": ["female"],
        "suffix": "_match",
    },
]

df_rlike_extract = create_columns_from_config(df_sample, rlike_extract_config)

df_rlike_extract.show()

### `regexp_extract`
Column containing regex matched word from string in column
This function can be used for creating basic and derived features.

Core example below:

In [None]:
from feature_generation.v1.core.features.create_column import create_columns_from_config
from feature_generation.v1.core.features.flags import regexp_extract

regexp_extract_code_config = [
    regexp_extract(input="name", values=[".*e.*"], output="name_contains_e")
]

df_regexp_extract_code = create_columns_from_config(
    df_sample, regexp_extract_code_config
)

df_regexp_extract_code.show()

Node example below:

In [None]:
from feature_generation.v1.nodes.features.create_column import (
    create_columns_from_config,
)

regexp_extract_config = [
    {
        "object": "feature_generation.v1.core.features.flags.regexp_extract",
        "output": "name_contains_e",
        "input": "name",
        "values": [".*e.*"],
    },
]

df_regexp_extract = create_columns_from_config(df_sample, regexp_extract_config)

df_regexp_extract.show()

### `expr_flag`
Wrapper for the `f.expr` function.
This function can be used for creating basic and derived features.

Core example below:

In [None]:
from feature_generation.v1.core.features.create_column import create_columns_from_config
from feature_generation.v1.core.features.flags import expr_flag

expr_flag_code_config = [expr_flag(expr="age < 30", output="age_lt_30")]

df_expr_flag_code = create_columns_from_config(df_sample, expr_flag_code_config)

df_expr_flag_code.show()

Node example below:

In [None]:
from feature_generation.v1.nodes.features.create_column import (
    create_columns_from_config,
)

expr_flag_config = [
    {
        "object": "feature_generation.v1.core.features.flags.expr_flag",
        "output": "age_lt_30",
        "expr": "age < 3",
    },
]

df_expr_flag = create_columns_from_config(df_sample, expr_flag_config)

df_expr_flag.show()