Skip to content

Commit

Permalink
Add docs for input and output format and expected behaviors (feathr-a…
Browse files Browse the repository at this point in the history
…i#575)

* Create feathr-input-format.md

* update docs

* Update feathr-input-format.md

* address comments

* Update feathr-job-configuration.md

* Update feathr-input-format.md

* Update build-and-push-feathr-registry-docker-image.md

* move file to the right hierachy

* Update README.md
  • Loading branch information
xiaoyongzhu authored and ahlag committed Aug 26, 2022
1 parent cffa764 commit 90f328d
Show file tree
Hide file tree
Showing 6 changed files with 49 additions and 9 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ Follow the [quick start Jupyter Notebook](./samples/product_recommendation_demo.
| Feature Registry and Governance | Azure Purview, ANSI SQL such as Azure SQL Server |
| Compute Engine | Azure Synapse Spark Pools, Databricks |
| Machine Learning Platform | Azure Machine Learning, Jupyter Notebook, Databricks Notebook |
| File Format | Parquet, ORC, Avro, JSON, Delta Lake |
| File Format | Parquet, ORC, Avro, JSON, Delta Lake, CSV |
| Credentials | Azure Key Vault |

## 🚀 Roadmap
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: How to build and push feathr registry docker image
parent: How-to Guides
parent: Developer Guides
---

# How to build and push feathr registry docker image
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
layout: default
title: Feathr REST API Deployment
parent: How-to Guides
parent: Developer Guides
---

# Feathr REST API

The API currently supports following functionality
The REST API currently supports following functionalities:

1. Get Feature by Qualified Name
2. Get Feature by GUID
Expand Down
20 changes: 20 additions & 0 deletions docs/how-to-guides/feathr-input-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
layout: default
title: Input File Format for Feathr
parent: How-to Guides
---

# Input File Format for Feathr

Feathr supports multiple file formats, including Parquet, ORC, Avro, JSON, Delta Lake, and CSV. The formats are recognized in the following order:

1. If the input path has a suffix, that will be honored. For example, `wasb://demodata@demodata/user_profile.csv` will be recognized as csv, while `wasb://demodata@demodata/product_id.parquet` will be recognized as parquet. Note that this is a per file behavior.
2. If the input file doesn't have a name, say `wasb://demodata@demodata/user_click_stream`, users can optionally set a parameter to let Feathr know which format to read those files. Refer to the `spark.feathr.inputFormat` setting in [Feathr Job Configuration](./feathr-job-configuration.md) for more details on how to set those, as well as for code examples. Note that this is a global setting that will apply to every input which the format is not recognized.
3. If all the above conditions are not recognized, Feathr will use `avro` as the default format.

## Special note for spark outputs

Many Spark users will use delta lake format to store the results. In those cases, the result folder will be something like this:
![Spark Output](../images/spark-output.png)

Please note that although the results are shown as "parquet", you should use the path of the parent folder and use `delta` format to read the folder.
30 changes: 25 additions & 5 deletions docs/how-to-guides/feathr-job-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,28 @@ parent: How-to Guides

Since Feathr uses Spark as the underlying execution engine, there's a way to override Spark configuration by `FeathrClient.get_offline_features()` with `execution_configurations` parameters. The complete list of the available spark configuration is located in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:

| Property Name | Default | Meaning | Since Version |
| ------------------------- | ------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ------------- |
| spark.feathr.inputFormat | None | Specify the input format if the file cannot be tell automatically. By default, Feathr will read files by parsing the file extension name; However the file/folder name doesn't have extension name, this configuration can be set to tell Feathr which format it should use to read the data. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to read delta lake. | 0.2.1 |
| spark.feathr.outputFormat | None | Specify the output format. "avro" is the default behavior if this value is not set. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to write delta lake. | 0.2.1 |
| spark.feathr.inputFormat.csvOptions.sep | None | Specify the delimiter. For example, "," for commas or "\t" for tabs. (Supports both csv and tsv) | 0.6.0 |
| Property Name | Default | Meaning | Since Version |
| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| spark.feathr.inputFormat | None | Specify the input format if the file cannot be tell automatically. By default, Feathr will read files by parsing the file extension name; However the file/folder name doesn't have extension name, this configuration can be set to tell Feathr which format it should use to read the data. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to read delta lake. | 0.2.1 |
| spark.feathr.outputFormat | None | Specify the output format. "avro" is the default behavior if this value is not set. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to write delta lake. | 0.2.1 |
| spark.feathr.inputFormat.csvOptions.sep | None | Specify the delimiter. For example, "," for commas or "\t" for tabs. (Supports both csv and tsv) | 0.6.0 |

## Examples on using job configurations

Examples when using the above job configurations when get offline features:

```python
client.get_offline_features(
observation_settings=settings,
feature_query=feature_query,
output_path=output_path,
execution_configurations=SparkExecutionConfiguration({"spark.feathr.inputFormat": "parquet", "spark.feathr.outputFormat": "parquet"}),
verbose=True
)
```

Examples when using the above job configurations when materializing features:

```python
client.materialize_features(settings, execution_configurations=SparkExecutionConfiguration({"spark.feathr.inputFormat": "parquet", "spark.feathr.outputFormat": "parquet"}))
```
Binary file added docs/images/spark-output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 90f328d

Please sign in to comment.