Add docs for input and output format and expected behaviors (feathr-a…

…i#575) * Create feathr-input-format.md * update docs * Update feathr-input-format.md * address comments * Update feathr-job-configuration.md * Update feathr-input-format.md * Update build-and-push-feathr-registry-docker-image.md * move file to the right hierachy * Update README.md
ahlag · Aug 26, 2022 · 90f328d · 90f328d
1 parent cffa764
commit 90f328d
Show file tree

Hide file tree

Showing 6 changed files with 49 additions and 9 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -164,7 +164,7 @@ Follow the [quick start Jupyter Notebook](./samples/product_recommendation_demo.
 | Feature Registry and Governance | Azure Purview, ANSI SQL such as Azure SQL Server                            |
 | Compute Engine                  | Azure Synapse Spark Pools, Databricks                                       |
 | Machine Learning Platform       | Azure Machine Learning, Jupyter Notebook, Databricks Notebook               |
-| File Format                     | Parquet, ORC, Avro, JSON, Delta Lake                                        |
+| File Format                     | Parquet, ORC, Avro, JSON, Delta Lake, CSV                                   |
 | Credentials                     | Azure Key Vault                                                             |
 
 ## 🚀 Roadmap

diff --git a/docs/dev_guide/build-and-push-feathr-registry-docker-image.md b/docs/dev_guide/build-and-push-feathr-registry-docker-image.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: How to build and push feathr registry docker image
-parent: How-to Guides
+parent: Developer Guides
 ---
 
 # How to build and push feathr registry docker image

diff --git a/...-to-guides/deploy-feathr-api-as-webapp.md → .../dev_guide/deploy-feathr-api-as-webapp.md b/...-to-guides/deploy-feathr-api-as-webapp.md → .../dev_guide/deploy-feathr-api-as-webapp.md
@@ -1,12 +1,12 @@
 ---
 layout: default
 title: Feathr REST API Deployment
-parent: How-to Guides
+parent: Developer Guides
 ---
 
 # Feathr REST API
 
-The API currently supports following functionality
+The REST API currently supports following functionalities:
 
 1. Get Feature by Qualified Name
 2. Get Feature by GUID

diff --git a/docs/how-to-guides/feathr-input-format.md b/docs/how-to-guides/feathr-input-format.md
@@ -0,0 +1,20 @@
+---
+layout: default
+title: Input File Format for Feathr
+parent: How-to Guides
+---
+
+# Input File Format for Feathr
+
+Feathr supports multiple file formats, including Parquet, ORC, Avro, JSON, Delta Lake, and CSV. The formats are recognized in the following order:
+
+1. If the input path has a suffix, that will be honored. For example, `wasb://demodata@demodata/user_profile.csv` will be recognized as csv, while `wasb://demodata@demodata/product_id.parquet` will be recognized as parquet. Note that this is a per file behavior.
+2. If the input file doesn't have a name, say `wasb://demodata@demodata/user_click_stream`, users can optionally set a parameter to let Feathr know which format to read those files. Refer to the `spark.feathr.inputFormat` setting in [Feathr Job Configuration](./feathr-job-configuration.md) for more details on how to set those, as well as for code examples. Note that this is a global setting that will apply to every input which the format is not recognized.
+3. If all the above conditions are not recognized, Feathr will use `avro` as the default format.
+
+## Special note for spark outputs
+
+Many Spark users will use delta lake format to store the results. In those cases, the result folder will be something like this:
+![Spark Output](../images/spark-output.png)
+
+Please note that although the results are shown as "parquet", you should use the path of the parent folder and use `delta` format to read the folder.
diff --git a/docs/how-to-guides/feathr-job-configuration.md b/docs/how-to-guides/feathr-job-configuration.md
@@ -8,8 +8,28 @@ parent: How-to Guides
 
 Since Feathr uses Spark as the underlying execution engine, there's a way to override Spark configuration by `FeathrClient.get_offline_features()` with `execution_configurations` parameters. The complete list of the available spark configuration is located in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:
 
-| Property Name             | Default | Meaning                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Since Version |
-| ------------------------- | ------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ------------- |
-| spark.feathr.inputFormat  | None    | Specify the input format if the file cannot be tell automatically. By default, Feathr will read files by parsing the file extension name; However the file/folder name doesn't have extension name, this configuration can be set to tell Feathr which format it should use to read the data. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to read delta lake. | 0.2.1         |
-| spark.feathr.outputFormat | None    | Specify the output format. "avro" is the default behavior if this value is not set. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to write delta lake.                                                                                                                                                                                                          | 0.2.1         |
-| spark.feathr.inputFormat.csvOptions.sep | None | Specify the delimiter. For example, "," for commas or "\t" for tabs. (Supports both csv and tsv)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 0.6.0 |
+| Property Name                           | Default | Meaning                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Since Version |
+| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
+| spark.feathr.inputFormat                | None    | Specify the input format if the file cannot be tell automatically. By default, Feathr will read files by parsing the file extension name; However the file/folder name doesn't have extension name, this configuration can be set to tell Feathr which format it should use to read the data. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to read delta lake. | 0.2.1         |
+| spark.feathr.outputFormat               | None    | Specify the output format. "avro" is the default behavior if this value is not set. Currently can only be set for Spark built-in short names, including `json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`. For more details, see ["Manually Specifying Options"](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options). Additionally, `delta` is also supported if users want to write delta lake.                                                                                                                                                                                                          | 0.2.1         |
+| spark.feathr.inputFormat.csvOptions.sep | None    | Specify the delimiter. For example, "," for commas or "\t" for tabs. (Supports both csv and tsv)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 0.6.0         |
+
+## Examples on using job configurations
+
+Examples when using the above job configurations when get offline features:
+
+```python
+client.get_offline_features(
+                            observation_settings=settings,
+                            feature_query=feature_query,
+                            output_path=output_path,
+                            execution_configurations=SparkExecutionConfiguration({"spark.feathr.inputFormat": "parquet", "spark.feathr.outputFormat": "parquet"}),
+                            verbose=True
+                    )
+```
+
+Examples when using the above job configurations when materializing features:
+
+```python
+client.materialize_features(settings, execution_configurations=SparkExecutionConfiguration({"spark.feathr.inputFormat": "parquet", "spark.feathr.outputFormat": "parquet"}))
+```
diff --git a/docs/images/spark-output.png b/docs/images/spark-output.png