Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various documentation fix #477

Merged
merged 29 commits into from
Aug 2, 2022
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
57574ed
Update README.md
xiaoyongzhu Jun 28, 2022
5cf8c41
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu Jul 15, 2022
91533d3
update docs per feedback
xiaoyongzhu Jul 15, 2022
e1c16c6
Update feathr-concepts-for-beginners.md
xiaoyongzhu Jul 15, 2022
1fa08a7
Update feathr-concepts-for-beginners.md
xiaoyongzhu Jul 15, 2022
e1b0aec
update materialization setting doc
xiaoyongzhu Jul 19, 2022
1f26445
Update get-offline-features.md
xiaoyongzhu Jul 19, 2022
c752e01
Update get-offline-features.md
xiaoyongzhu Jul 19, 2022
132e1bd
Update feathr-concepts-for-beginners.md
xiaoyongzhu Jul 19, 2022
85a1c44
resolve comments
xiaoyongzhu Jul 20, 2022
3f26272
Update job_utils.py
xiaoyongzhu Jul 20, 2022
bea2a17
fix typos
xiaoyongzhu Jul 20, 2022
16b5c6d
Update job_utils.py
xiaoyongzhu Jul 20, 2022
f2e06e0
Update client.py
xiaoyongzhu Jul 21, 2022
7d9d488
format doc
xiaoyongzhu Jul 30, 2022
283f06d
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu Jul 30, 2022
f7bdc21
Address comments
xiaoyongzhu Jul 30, 2022
1e12b97
Update WriteToHDFSOutputProcessor.scala
xiaoyongzhu Jul 30, 2022
2bb7369
Update WriteToHDFSOutputProcessor.scala
xiaoyongzhu Jul 30, 2022
f38a903
resolve comments
xiaoyongzhu Aug 1, 2022
d17c5fa
Resolve comments
xiaoyongzhu Aug 1, 2022
e8266da
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu Aug 1, 2022
64760ce
fix test failures and typos
xiaoyongzhu Aug 1, 2022
8916be4
Update job_utils.py
xiaoyongzhu Aug 1, 2022
9a213ba
fix comments and formats/typos
xiaoyongzhu Aug 1, 2022
8edb706
fix typos and test failures
xiaoyongzhu Aug 1, 2022
5a97bcd
update test names
xiaoyongzhu Aug 1, 2022
2f00667
Update test_fixture.py
xiaoyongzhu Aug 1, 2022
e0c7427
Update test_fixture.py
xiaoyongzhu Aug 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions docs/concepts/feathr-concepts-for-beginners.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,18 @@ client.get_online_features(feature_table = "agg_features",
## Illustration

An illustration of the concepts and process that we talked about is like this:
![Feature Join Process](../images/observation_data.jpg)
![Observation Data and Feature Query Process](../images/observation_data.jpg)

## Point in time joins and aggregations
## Miscellaneous topics
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved

### A bit more on `Observation Data`

The "Observation Data" is a concept that is a bit confusing for some beginners, and simply think it as an immutable dataset, but this dataset could be enhanced by other dataset. For example, you usually cannot drop a column for your "observation data", but you can add additional columns to it.

### What's the relationship between `Source` and `Anchor`?
Usually an Anchor can only have one source, but one source can be consumed by different anchors. From `Source` to `Anchor`, there might be an intermediate step, which is the "preprocessing" function and allows you to customize the input a bit.

### Point in time joins and aggregations - why we need them?

Assuming users are already familiar with the "regular" joins, for example inner join or outer join, and in many of the use cases, we care about time.

Expand Down
94 changes: 0 additions & 94 deletions docs/concepts/feature-join.md

This file was deleted.

85 changes: 85 additions & 0 deletions docs/concepts/get-offline-features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
layout: default
title: Getting Offline Features using Feature Query
parent: Feathr Concepts
---

# Getting Offline Features using Feature Query
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved

## Intuitions

After the feature producers have defined the features (as described in the [Feature Definition](./feature-definition.md) part), the feature consumers may want to consume those features.

For example, the dataset is like below, where there are 3 tables that feature producers want to extract features from: `user_profile_mock_data`, `user_purchase_history_mock_data`, and `product_detail_mock_data`.

For feature consumers, they will usually use a central dataset ("observation data", `user_observation_mock_data` in this case) which contains a couple of IDs (`user_id` and `product_id` in this case), timestamps, and other columns. Feature consumers will use this "observation data" to query from different feature tables (using `Feature Query` below).

![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/product_recommendation_advanced.jpg?raw=true)

As we can see, the use case for getting offline features using Feathr is straightforward. Feature consumers want to get a few features - for a particular user, what's the gift card balance? What's the total purchase in the last 90 days; Feature consumers can also get a few features for other entities in the same `Feature Query`. For example, in the meanwhile, feature consumers can also query the product feature such as product quantity and price.

In this case, Feathr users can simply specify the feature name that they want to query, and specify for which entity/key that they want to query on, like below. Note that for feature consumers, they don't have to query all the features; instead they can just query a subset of the features that the feature producers have defined.

```python
user_feature_query = FeatureQuery(
feature_list=["feature_user_age",
"feature_user_tax_rate",
"feature_user_gift_card_balance",
"feature_user_has_valid_credit_card",
"feature_user_total_purchase_in_90days",
"feature_user_purchasing_power"
],
key=user_id)

product_feature_query = FeatureQuery(
feature_list=[
"feature_product_quantity",
"feature_product_price"
],
key=product_id)
```

And specify the location for the observation data:

```python
settings = ObservationSettings(
observation_path="wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/product_recommendation_sample/user_observation_mock_data.csv",
event_timestamp_column="event_timestamp",
timestamp_format="yyyy-MM-dd")
```

And finally, specify the feature query and finally trigger the computation:

```python
client.get_offline_features(observation_settings=settings,
feature_query=[user_feature_query, product_feature_query],
output_path=output_path)

```

More details for the above APIs can be read from:

- [ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.ObservationSettings)
- [client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_offline_features)

## More on `Observation data`

The path of a dataset as the 'spine' for the to-be-created training dataset. We call this input 'spine' dataset the 'observation' dataset. Typically, each row of the observation data contains:

1. **Entity ID Column:** Column(s) representing entity id(s), which will be used as the join key to query feature value.

2. **Timestamp Column:** A column representing the event time of the row. By default, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. Refer to [Point in time Joins](./point-in-time-join.md) for more details.

3. **Other columns** will be simply pass through to the output training dataset, which can be treated as immutable columns.

## More on `Feature Query`

After you have defined all the features, you probably don't want to use all of them in this particular program. In this case, instead of putting every features in this `FeatureQuery` part, you can just put a selected list of features. Note that they have to be of the same key.

## Difference between `materialize_features` and `get_offline_features` API
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved

It is sometimes confusing between "getting offline features" in this document and the "[getting materialized features](./materializing-features.md)" part, given they both seem to "get features and put it somewhere". However there are some differences and you should know when to use which:

1. For `get_offline_features` API, feature consumers usually need to have a central `observation data` so they can use `Feature Query` to query different features for different entities from different tables. For `materialize_features` API, feature consumers don't have the `observation data`, because they don't need to query from existing feature definitions. In this case, feature consumers only need to specify for a specific entity (say `user_id`), which features they want to materialize to offline or online store. Note that for a feature table in the materialization settings, feature consumers can only materialize features for the same key for the same table.

2. For the timestamps, `get_offline_features` API, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. For `materialize_features` API, Feathr will always materialize the latest feature available in the dataset.
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
layout: default
title: Feature Generation and Materialization
title: Feature Materialization (also known as feature generation)
parent: Feathr Concepts
---

# Feature Generation and Materialization
# Feature Materialization (also known as feature generation)

Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
Feature materialization (also known as feature generation) is the process to create features for a certain entity from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved

User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact.
User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact. Also, please note that you can only materialize features for a specific entity/key in the same `materialize_features` call.

## Generating Features to Online Store
## Materializing Features to Online Store

When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example:

Expand Down Expand Up @@ -119,7 +119,7 @@ client.materialize_features(settings, execution_configurations={ "spark.feathr.o
For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store:

```python

from feathr import get_result_df
path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/"
res = get_result_df(client=client, format="parquet", res_url=path)
```
Expand Down
25 changes: 24 additions & 1 deletion docs/quickstart_databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,32 @@ For Databricks, you can simply upload [this notebook](./samples/databricks/datab

![Import Notebooks](./images/databricks_quickstart1.png)


2. Paste the [link to Databricks getting started notebook](./samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb):

![Import Notebooks](./images/databricks_quickstart2.png)

3. Run the whole notebook. It will automatically install Feathr in your cluster and run the feature ingestion jobs.

# Authoring Feathr jobs in local environment and submit to remote Databricks cluster

Not everyone wants to use databricks notebook as the main development environment, and the above part is more for quick start purpose. For a more serious development, we usually recommend using Visual Studio Code, where [it has native support for Python and Jupyter Notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks) with many great features such as syntax highlight and IntelliSense.
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved

In [this notebook](./samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb), there are a few lines of code like this:

```python
# Get current databricks notebook context
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = ctx.tags().get("clusterId").get()
```

This is the only part you need to change to author the Feathr job in local environment and submit to a remote Databricks cluster. When running those code in Databricks, Feathr will automatically read the current cluster's host name and authentication token using the above code, but this is not true if authoring the job locally. In that case, you will need to change the above lines to below:

```python
# Authoring Feathr jobs in local environment and submit to remote Databricks cluster
host_name = 'https://adb-6885802458123232.12.azuredatabricks.net/'
xiaoyongzhu marked this conversation as resolved.
Show resolved Hide resolved
host_token = 'dapi11111111111111111111'
```

And that's it! Feathr will automatically submit the job to the cluster you specified.
2 changes: 1 addition & 1 deletion feathr_project/feathr/definition/transformation.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class WindowAggTransformation(Transformation):
agg_func: aggregation function. Available values: `SUM`, `COUNT`, `MAX`, `MIN`, `AVG`, `MAX_POOLING`, `MIN_POOLING`, `AVG_POOLING`, `LATEST`
window: Time window length to apply the aggregation. support 4 type of units: d(day), h(hour), m(minute), s(second). The example value are "7d' or "5h" or "3m" or "1s"
group_by: Feathr expressions applied after the `agg_expr` transformation as groupby field, before aggregation, same as 'group by' in SQL
filter: Feathr expression applied to each row as a filter before aggregation
filter: Feathr expression applied to each row as a filter before aggregation. This should be a string and a valid Spark SQL Expression. For example: filter = 'age > 3'. This is similar to PySpark filter operation and more details can be learned here: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html
"""
def __init__(self, agg_expr: str, agg_func: str, window: str, group_by: Optional[str] = None, filter: Optional[str] = None, limit: Optional[int] = None) -> None:
super().__init__()
Expand Down
4 changes: 3 additions & 1 deletion feathr_project/test/test_fixture.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,9 @@ def basic_test_setup(config_path: str):
feature_type=FLOAT,
transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
agg_func="AVG",
window="90d")),
window="90d",
filter="fare_amount > 0",
)),
Feature(name="f_location_max_fare",
key=location_id,
feature_type=FLOAT,
Expand Down