feathr-ai · xiaoyongzhu · Aug 2, 2022 · Jun 28, 2022 · Jul 15, 2022 · Jul 15, 2022
diff --git a/docs/concepts/feathr-concepts-for-beginners.md b/docs/concepts/feathr-concepts-for-beginners.md
@@ -126,9 +126,18 @@ client.get_online_features(feature_table = "agg_features",
 ## Illustration
 
 An illustration of the concepts and process that we talked about is like this:
-![Feature Join Process](../images/observation_data.jpg)
+![Observation Data and Feature Query Process](../images/observation_data.jpg)
 
-## Point in time joins and aggregations
+## Miscellaneous topics
+
+### A bit more on `Observation Data`
+
+The "Observation Data" is a concept that is a bit confusing for some beginners, and simply think it as an immutable dataset, but this dataset could be enhanced by other dataset. For example, you usually cannot drop a column for your "observation data", but you can add additional columns to it.
+
+### What's the relationship between `Source` and `Anchor`?
+Usually an Anchor can only have one source, but one source can be consumed by different anchors. From `Source` to `Anchor`, there might be an intermediate step, which is the "preprocessing" function and allows you to customize the input a bit. 
+
+### Point in time joins and aggregations - why we need them?
 
 Assuming users are already familiar with the "regular" joins, for example inner join or outer join, and in many of the use cases, we care about time.
 

diff --git a/docs/concepts/feature-join.md b/docs/concepts/feature-join.md
diff --git a/docs/concepts/get-offline-features.md b/docs/concepts/get-offline-features.md
@@ -0,0 +1,85 @@
+---
+layout: default
+title: Getting Offline Features using Feature Query
+parent: Feathr Concepts
+---
+
+# Getting Offline Features using Feature Query
+
+## Intuitions
+
+After the feature producers have defined the features (as described in the [Feature Definition](./feature-definition.md) part), the feature consumers may want to consume those features.
+
+For example, the dataset is like below, where there are 3 tables that feature producers want to extract features from: `user_profile_mock_data`, `user_purchase_history_mock_data`, and `product_detail_mock_data`.
+
+For feature consumers, they will usually use a central dataset ("observation data", `user_observation_mock_data` in this case) which contains a couple of IDs (`user_id` and `product_id` in this case), timestamps, and other columns. Feature consumers will use this "observation data" to query from different feature tables (using `Feature Query` below).
+
+![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/product_recommendation_advanced.jpg?raw=true)
+
+As we can see, the use case for getting offline features using Feathr is straightforward. Feature consumers want to get a few features - for a particular user, what's the gift card balance? What's the total purchase in the last 90 days; Feature consumers can also get a few features for other entities in the same `Feature Query`. For example, in the meanwhile, feature consumers can also query the product feature such as product quantity and price.
+
+In this case, Feathr users can simply specify the feature name that they want to query, and specify for which entity/key that they want to query on, like below. Note that for feature consumers, they don't have to query all the features; instead they can just query a subset of the features that the feature producers have defined.
+
+```python
+user_feature_query = FeatureQuery(
+    feature_list=["feature_user_age",
+                  "feature_user_tax_rate",
+                  "feature_user_gift_card_balance",
+                  "feature_user_has_valid_credit_card",
+                  "feature_user_total_purchase_in_90days",
+                  "feature_user_purchasing_power"
+                  ],
+    key=user_id)
+
+product_feature_query = FeatureQuery(
+    feature_list=[
+                  "feature_product_quantity",
+                  "feature_product_price"
+                  ],
+    key=product_id)
+```
+
+And specify the location for the observation data:
+
+```python
+settings = ObservationSettings(
+    observation_path="wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/product_recommendation_sample/user_observation_mock_data.csv",
+    event_timestamp_column="event_timestamp",
+    timestamp_format="yyyy-MM-dd")
+```
+
+And finally, specify the feature query and finally trigger the computation:
+
+```python
+client.get_offline_features(observation_settings=settings,
+                            feature_query=[user_feature_query, product_feature_query],
+                            output_path=output_path)
+
+```
+
+More details for the above APIs can be read from:
+
+- [ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.ObservationSettings)
+- [client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_offline_features)
+
+## More on `Observation data`
+
+The path of a dataset as the 'spine' for the to-be-created training dataset. We call this input 'spine' dataset the 'observation' dataset. Typically, each row of the observation data contains:
+
+1. **Entity ID Column:** Column(s) representing entity id(s), which will be used as the join key to query feature value.
+
+2. **Timestamp Column:** A column representing the event time of the row. By default, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. Refer to [Point in time Joins](./point-in-time-join.md) for more details.
+
+3. **Other columns** will be simply pass through to the output training dataset, which can be treated as immutable columns.
+
+## More on `Feature Query`
+
+After you have defined all the features, you probably don't want to use all of them in this particular program. In this case, instead of putting every features in this `FeatureQuery` part, you can just put a selected list of features. Note that they have to be of the same key.
+
+## Difference between `materialize_features` and `get_offline_features` API
+
+It is sometimes confusing between "getting offline features" in this document and the "[getting materialized features](./materializing-features.md)" part, given they both seem to "get features and put it somewhere". However there are some differences and you should know when to use which:
+
+1. For `get_offline_features` API, feature consumers usually need to have a central `observation data` so they can use `Feature Query` to query different features for different entities from different tables. For `materialize_features` API, feature consumers don't have the `observation data`, because they don't need to query from existing feature definitions. In this case, feature consumers only need to specify for a specific entity (say `user_id`), which features they want to materialize to offline or online store. Note that for a feature table in the materialization settings, feature consumers can only materialize features for the same key for the same table.
+
+2. For the timestamps, `get_offline_features` API, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. For `materialize_features` API, Feathr will always materialize the latest feature available in the dataset.
diff --git a/docs/concepts/feature-generation.md → docs/concepts/materializing-features.md b/docs/concepts/feature-generation.md → docs/concepts/materializing-features.md
@@ -1,16 +1,16 @@
 ---
 layout: default
-title: Feature Generation and Materialization
+title: Feature Materialization (also known as feature generation)
 parent: Feathr Concepts
 ---
 
-# Feature Generation and Materialization
+# Feature Materialization (also known as feature generation)
 
-Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
+Feature materialization (also known as feature generation) is the process to create features for a certain entity from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
 
-User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact.
+User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact. Also, please note that you can only materialize features for a specific entity/key in the same `materialize_features` call.
 
-## Generating Features to Online Store
+## Materializing Features to Online Store
 
 When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example:
 
@@ -119,7 +119,7 @@ client.materialize_features(settings, execution_configurations={ "spark.feathr.o
 For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store:
 
 ```python
-
+from feathr import get_result_df
 path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/"
 res = get_result_df(client=client, format="parquet", res_url=path)
 ```

diff --git a/docs/quickstart_databricks.md b/docs/quickstart_databricks.md
@@ -11,9 +11,32 @@ For Databricks, you can simply upload [this notebook](./samples/databricks/datab
 
 ![Import Notebooks](./images/databricks_quickstart1.png)
 
-
 2. Paste the [link to Databricks getting started notebook](./samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb):
 
 ![Import Notebooks](./images/databricks_quickstart2.png)
 
 3. Run the whole notebook. It will automatically install Feathr in your cluster and run the feature ingestion jobs.
+
+# Authoring Feathr jobs in local environment and submit to remote Databricks cluster
+
+Not everyone wants to use databricks notebook as the main development environment, and the above part is more for quick start purpose. For a more serious development, we usually recommend using Visual Studio Code, where [it has native support for Python and Jupyter Notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks) with many great features such as syntax highlight and IntelliSense.
+
+In [this notebook](./samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb), there are a few lines of code like this:
+
+```python
+# Get current databricks notebook context
+ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
+host_name = ctx.tags().get("browserHostName").get()
+host_token = ctx.apiToken().get()
+cluster_id = ctx.tags().get("clusterId").get()
+```
+
+This is the only part you need to change to author the Feathr job in local environment and submit to a remote Databricks cluster. When running those code in Databricks, Feathr will automatically read the current cluster's host name and authentication token using the above code, but this is not true if authoring the job locally. In that case, you will need to change the above lines to below:
+
+```python
+# Authoring Feathr jobs in local environment and submit to remote Databricks cluster
+host_name = 'https://adb-6885802458123232.12.azuredatabricks.net/'
+host_token = 'dapi11111111111111111111'
+```
+
+And that's it! Feathr will automatically submit the job to the cluster you specified.
diff --git a/feathr_project/feathr/definition/transformation.py b/feathr_project/feathr/definition/transformation.py
@@ -43,7 +43,7 @@ class WindowAggTransformation(Transformation):
         agg_func: aggregation function. Available values: `SUM`, `COUNT`, `MAX`, `MIN`, `AVG`, `MAX_POOLING`, `MIN_POOLING`, `AVG_POOLING`, `LATEST`
         window: Time window length to apply the aggregation. support 4 type of units: d(day), h(hour), m(minute), s(second). The example value are "7d' or "5h" or "3m" or "1s"
         group_by: Feathr expressions applied after the `agg_expr` transformation as groupby field, before aggregation, same as 'group by' in SQL
-        filter: Feathr expression applied to each row as a filter before aggregation
+        filter: Feathr expression applied to each row as a filter before aggregation. This should be a string and a valid Spark SQL Expression. For example: filter = 'age > 3'. This is similar to PySpark filter operation and more details can be learned here: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html
     """
     def __init__(self, agg_expr: str, agg_func: str, window: str, group_by: Optional[str] = None, filter: Optional[str] = None, limit: Optional[int] = None) -> None:
         super().__init__()

diff --git a/feathr_project/test/test_fixture.py b/feathr_project/test/test_fixture.py
@@ -68,7 +68,9 @@ def basic_test_setup(config_path: str):
                             feature_type=FLOAT,
                             transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                               agg_func="AVG",
-                                                              window="90d")),
+                                                              window="90d",
+                                                              filter="fare_amount > 0",
+                                                              )),
                     Feature(name="f_location_max_fare",
                             key=location_id,
                             feature_type=FLOAT,