Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Contrib azure provider with synapse/mssql offline store and Azure registry store #3072

Merged
merged 51 commits into from
Aug 19, 2022
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
e1e210d
Broken state
kevjumba Aug 5, 2022
011d1e0
working state
kevjumba Aug 10, 2022
a6a2fce
Fix the lint issues
kevjumba Aug 10, 2022
57b63bb
Semi working state
kevjumba Aug 10, 2022
ae7ed8a
Fix
kevjumba Aug 10, 2022
421645b
Fremove print
kevjumba Aug 10, 2022
07fece5
Fix lint
kevjumba Aug 11, 2022
4062031
Run build-sphinx
kevjumba Aug 11, 2022
cb39329
Add tutorials
kevjumba Aug 11, 2022
554ca1a
Fix
kevjumba Aug 11, 2022
4a969e7
Fix?
kevjumba Aug 11, 2022
116320a
Fix lint
kevjumba Aug 11, 2022
c0b16ef
Fix
kevjumba Aug 11, 2022
44d09d0
Fix lint
kevjumba Aug 12, 2022
b6f0a79
Begin configuring tests
adchia Aug 15, 2022
2b2ff40
Fix
kevjumba Aug 15, 2022
4616366
Working version
kevjumba Aug 16, 2022
c7d9852
Fix
kevjumba Aug 17, 2022
d2e290b
Fix
kevjumba Aug 17, 2022
a726a9a
Fix
kevjumba Aug 17, 2022
32992e3
Fix lint
kevjumba Aug 17, 2022
ebb934b
Fix lint
kevjumba Aug 17, 2022
e456acb
Fix
kevjumba Aug 17, 2022
45f479f
Fix lint
kevjumba Aug 17, 2022
4b8c4a2
Fix
kevjumba Aug 17, 2022
b1bf602
Fix
kevjumba Aug 17, 2022
4586f00
Fix azure
kevjumba Aug 17, 2022
3b88c0b
Fix
kevjumba Aug 17, 2022
9ae8ee3
Fix
kevjumba Aug 17, 2022
1b12e4a
Fix lint and address issues
kevjumba Aug 18, 2022
0ca5048
Fix integration tests
kevjumba Aug 18, 2022
883f314
Fix
kevjumba Aug 18, 2022
ccf8716
Fix lint and address issues
kevjumba Aug 18, 2022
f05288e
Fix
kevjumba Aug 18, 2022
ee30e73
Fix
kevjumba Aug 18, 2022
ab17db9
Fix
kevjumba Aug 18, 2022
be162f5
Revert
kevjumba Aug 18, 2022
f5aa476
Fix
kevjumba Aug 18, 2022
4423dfa
Fix
kevjumba Aug 18, 2022
5806507
Fix
kevjumba Aug 18, 2022
7a4d055
Fix lint
kevjumba Aug 19, 2022
78b74b1
Fix
kevjumba Aug 19, 2022
a9e8119
Fix lint
kevjumba Aug 19, 2022
1341e3e
Fix pyarrow
kevjumba Aug 19, 2022
3d42093
Fix lint
kevjumba Aug 19, 2022
1c591f0
add requirements files
adchia Aug 19, 2022
b4da607
fix name of docs
adchia Aug 19, 2022
c3a0423
fix offline store readme
adchia Aug 19, 2022
576b57e
fix offline store readme
adchia Aug 19, 2022
69940ac
fix
adchia Aug 19, 2022
516ff76
fix
adchia Aug 19, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
35 changes: 29 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,8 @@ test-python-integration-local:
python -m pytest -n 8 --integration \
-k "not gcs_registry and \
not s3_registry and \
not test_lambda_materialization" \
not test_lambda_materialization and \
not test_snowflake" \
sdk/python/tests \
) || echo "This script uses Docker, and it isn't running - please start the Docker Daemon and try again!";

Expand Down Expand Up @@ -113,7 +114,8 @@ test-python-universal-spark:
not test_push_features_to_offline_store.py and \
not gcs_registry and \
not s3_registry and \
not test_universal_types" \
not test_universal_types and \
not test_snowflake" \
sdk/python/tests

test-python-universal-trino:
Expand All @@ -136,9 +138,27 @@ test-python-universal-trino:
not test_push_features_to_offline_store.py and \
not gcs_registry and \
not s3_registry and \
not test_universal_types" \
not test_universal_types and \
not test_snowflake" \
sdk/python/tests


# Note: to use this, you'll need to have Microsoft ODBC 17 installed.
# See https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/install-microsoft-odbc-driver-sql-server-macos?view=sql-server-ver15#17
test-python-universal-mssql:
PYTHONPATH='.' \
FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.offline_stores.contrib.mssql_repo_configuration \
PYTEST_PLUGINS=feast.infra.offline_stores.contrib.mssql_offline_store.tests \
FEAST_USAGE=False IS_TEST=True \
FEAST_LOCAL_ONLINE_CONTAINER=True \
python -m pytest -n 8 --integration \
-k "not gcs_registry and \
not s3_registry and \
not test_lambda_materialization and \
not test_snowflake" \
sdk/python/tests


#To use Athena as an offline store, you need to create an Athena database and an S3 bucket on AWS. https://docs.aws.amazon.com/athena/latest/ug/getting-started.html
#Modify environment variables ATHENA_DATA_SOURCE, ATHENA_DATABASE, ATHENA_S3_BUCKET_NAME if you want to change the data source, database, and bucket name of S3 to use.
#If tests fail with the pytest -n 8 option, change the number to 1.
Expand All @@ -161,7 +181,8 @@ test-python-universal-athena:
not test_historical_features_persisting and \
not test_historical_retrieval_fails_on_validation and \
not gcs_registry and \
not s3_registry" \
not s3_registry and \
not test_snowflake" \
sdk/python/tests

test-python-universal-postgres-offline:
Expand Down Expand Up @@ -203,7 +224,8 @@ test-python-universal-postgres-online:
not test_push_features_to_offline_store and \
not gcs_registry and \
not s3_registry and \
not test_universal_types" \
not test_universal_types and \
not test_snowflake" \
sdk/python/tests

test-python-universal-cassandra:
Expand All @@ -230,7 +252,8 @@ test-python-universal-cassandra-no-cloud-providers:
not test_apply_data_source_integration and \
not test_nullable_online_store and \
not gcs_registry and \
not s3_registry" \
not s3_registry and \
not test_snowflake" \
sdk/python/tests

test-python-universal:
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ The list below contains the functionality that contributors are planning to deve
* [x] [Redshift source](https://docs.feast.dev/reference/data-sources/redshift)
* [x] [BigQuery source](https://docs.feast.dev/reference/data-sources/bigquery)
* [x] [Parquet file source](https://docs.feast.dev/reference/data-sources/file)
* [x] [Synapse source (community plugin)](https://github.com/Azure/feast-azure)
* [x] [Azure Synapse + Azure SQL source (contrib plugin)](https://docs.feast.dev/reference/data-sources/mssql)
* [x] [Hive (community plugin)](https://github.com/baineng/feast-hive)
* [x] [Postgres (contrib plugin)](https://docs.feast.dev/reference/data-sources/postgres)
* [x] [Spark (contrib plugin)](https://docs.feast.dev/reference/data-sources/spark)
Expand All @@ -161,7 +161,7 @@ The list below contains the functionality that contributors are planning to deve
* [x] [Snowflake](https://docs.feast.dev/reference/offline-stores/snowflake)
* [x] [Redshift](https://docs.feast.dev/reference/offline-stores/redshift)
* [x] [BigQuery](https://docs.feast.dev/reference/offline-stores/bigquery)
* [x] [Synapse (community plugin)](https://github.com/Azure/feast-azure)
* [x] [Azure Synapse + Azure SQL (contrib plugin)](https://docs.feast.dev/reference/offline-stores/mssql.md)
* [x] [Hive (community plugin)](https://github.com/baineng/feast-hive)
* [x] [Postgres (contrib plugin)](https://docs.feast.dev/reference/offline-stores/postgres)
* [x] [Trino (contrib plugin)](https://github.com/Shopify/feast-trino)
Expand Down
3 changes: 3 additions & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@
* [Spark (contrib)](reference/data-sources/spark.md)
* [PostgreSQL (contrib)](reference/data-sources/postgres.md)
* [Trino (contrib)](reference/data-sources/trino.md)
* [Azure Synapse + Azure SQL (contrib)](reference/data-sources/mssql.md)
* [Offline stores](reference/offline-stores/README.md)
* [Overview](reference/offline-stores/overview.md)
* [File](reference/offline-stores/file.md)
Expand All @@ -80,6 +81,7 @@
* [Spark (contrib)](reference/offline-stores/spark.md)
* [PostgreSQL (contrib)](reference/offline-stores/postgres.md)
* [Trino (contrib)](reference/offline-stores/trino.md)
* [Azure Synapse + Azure SQL (contrib)](reference/offline-stores/mssql.md)
* [Online stores](reference/online-stores/README.md)
* [SQLite](reference/online-stores/sqlite.md)
* [Snowflake](reference/online-stores/snowflake.md)
Expand All @@ -91,6 +93,7 @@
* [Local](reference/providers/local.md)
* [Google Cloud Platform](reference/providers/google-cloud-platform.md)
* [Amazon Web Services](reference/providers/amazon-web-services.md)
* [Azure](reference/providers/azure.md)
* [Feature repository](reference/feature-repository/README.md)
* [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md)
* [.feastignore](reference/feature-repository/feast-ignore.md)
Expand Down
22 changes: 11 additions & 11 deletions docs/getting-started/concepts/registry.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Registry

Feast uses a registry to store all applied Feast objects (e.g. Feature views, entities, etc). The registry exposes
Feast uses a registry to store all applied Feast objects (e.g. Feature views, entities, etc). The registry exposes
methods to apply, list, retrieve and delete these objects, and is an abstraction with multiple implementations.

### Options for registry implementations

#### File-based registry
By default, Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as
a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS).
By default, Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as
a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS, or Azure).

The quickstart guides that use `feast init` will use a registry on a local file system. To allow Feast to configure
The quickstart guides that use `feast init` will use a registry on a local file system. To allow Feast to configure
a remote file registry, you need to create a GCS / S3 bucket that Feast can understand:
{% tabs %}
{% tab title="Example S3 file registry" %}
Expand All @@ -35,9 +35,9 @@ offline_store:
{% endtab %}
{% endtabs %}

However, there are inherent limitations with a file-based registry, since changing a single field in the registry
requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or
bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for
However, there are inherent limitations with a file-based registry, since changing a single field in the registry
requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or
bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for
multiple feature views or time ranges concurrently).

#### SQL Registry
Expand All @@ -47,14 +47,14 @@ This supports any SQLAlchemy compatible database as a backend. The exact schema

### Updating the registry

We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD
automatically stays synced with the registry. Users will often also want multiple registries to correspond to
different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write
We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD
automatically stays synced with the registry. Users will often also want multiple registries to correspond to
different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write
access since they can impact real user traffic. See [Running Feast in Production](../../how-to-guides/running-feast-in-production.md#1.-automatically-deploying-changes-to-your-feature-definitions) for details on how to set this up.

### Accessing the registry from clients

Users can specify the registry through a `feature_store.yaml` config file, or programmatically. We often see teams
Users can specify the registry through a `feature_store.yaml` config file, or programmatically. We often see teams
preferring the programmatic approach because it makes notebook driven development very easy:

#### Option 1: programmatically specifying the registry
Expand Down
3 changes: 2 additions & 1 deletion docs/how-to-guides/adding-or-reusing-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,8 @@ def test_historical_features(environment, universal_data_sources, full_feature_n
validate_dataframes(
expected_df,
table_from_df_entities,
keys=[event_timestamp, "order_id", "driver_id", "customer_id"],
sort_by=[event_timestamp, "order_id", "driver_id", "customer_id"],
event_timestamp = event_timestamp,
)
# ... more test code
```
Expand Down
8 changes: 6 additions & 2 deletions docs/reference/data-sources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,13 @@ Please see [Data Source](../../getting-started/concepts/data-ingestion.md) for a
{% endcontent-ref %}

{% content-ref url="postgres.md" %}
[postgres.md]([postgres].md)
[postgres.md](postgres.md)
{% endcontent-ref %}

{% content-ref url="trino.md" %}
[trino.md]([trino].md)
[trino.md](trino.md)
{% endcontent-ref %}

{% content-ref url="mssql.md" %}
[mssql.md](mssql.md)
{% endcontent-ref %}
29 changes: 29 additions & 0 deletions docs/reference/data-sources/mssql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# MsSQL source (contrib)

## Description

MsSQL data sources are Microsoft sql table sources.
These can be specified either by a table reference or a SQL query.

## Disclaimer

The MsSQL data source does not achieve full test coverage.
Please do not assume complete stability.

## Examples

Defining a MsSQL source:

```python
from feast.infra.offline_stores.contrib.mssql_offline_store.mssqlserver_source import (
MsSqlServerSource,
)

driver_hourly_table = "driver_hourly"

driver_source = MsSqlServerSource(
table_ref=driver_hourly_table,
event_timestamp_column="datetime",
created_timestamp_column="created",
)
```
59 changes: 59 additions & 0 deletions docs/reference/offline-stores/mssql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# MsSQL/Synapse offline store (contrib)

## Description

The MsSQL offline store provides support for reading [MsSQL Sources](../data-sources/mssql.md). Specifically, it is developed to read from [Synapse SQL](https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-features) on Microsoft Azure

* Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe.

## Disclaimer

The MsSQL offline store does not achieve full test coverage.
Please do not assume complete stability.

## Example

{% code title="feature_store.yaml" %}
```yaml
registry:
registry_store_type: AzureRegistryStore
path: ${REGISTRY_PATH} # Environment Variable
project: production
provider: azure
online_store:
type: redis
connection_string: ${REDIS_CONN} # Environment Variable
offline_store:
type: mssql
connection_string: ${SQL_CONN} # Environment Variable
```
{% endcode %}

## Functionality Matrix

The set of functionality supported by offline stores is described in detail [here](overview.md#functionality).
Below is a matrix indicating which functionality is supported by the Spark offline store.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not spark?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


| | MsSql |
| :-------------------------------- | :-- |
| `get_historical_features` (point-in-time correct join) | yes |
| `pull_latest_from_table_or_query` (retrieve latest feature values) | yes |
| `pull_all_from_table_or_query` (retrieve a saved dataset) | yes |
| `offline_write_batch` (persist dataframes to offline store) | no |
| `write_logged_features` (persist logged features to offline store) | no |

Below is a matrix indicating which functionality is supported by `MsSqlServerRetrievalJob`.

| | MsSql |
| --------------------------------- | --- |
| export to dataframe | yes |
| export to arrow table | yes |
| export to arrow batches | no |
| export to SQL | no |
| export to data lake (S3, GCS, etc.) | no |
| export to data warehouse | no |
| local execution of Python-based on-demand transforms | no |
| remote execution of Python-based on-demand transforms | no |
| persist results in the offline store | yes |

To compare this set of functionality against other offline stores, please see the full [functionality matrix](overview.md#functionality-matrix).
2 changes: 2 additions & 0 deletions docs/reference/providers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ Please see [Provider](../../getting-started/architecture-and-components/provider
{% page-ref page="google-cloud-platform.md" %}

{% page-ref page="amazon-web-services.md" %}

{% page-ref page="azure.md" %}
26 changes: 26 additions & 0 deletions docs/reference/providers/azure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Azure

## Description

* Offline Store: Uses the **MsSql** offline store by default. Also supports File as the offline store.
* Online Store: Uses the **Redis** online store by default. Also supports Sqlite as an online store.

## Disclaimer

The Azure provider does not achieve full test coverage.
Please do not assume complete stability.

## Example

{% code title="feature_store.yaml" %}
```yaml
registry:
registry_store_type: AzureRegistryStore
path: ${REGISTRY_PATH} # Environment Variable
project: production
provider: azure
online_store:
type: redis
connection_string: ${REDIS_CONN} # Environment Variable
```
{% endcode %}
4 changes: 2 additions & 2 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The list below contains the functionality that contributors are planning to deve
* [x] [Redshift source](https://docs.feast.dev/reference/data-sources/redshift)
* [x] [BigQuery source](https://docs.feast.dev/reference/data-sources/bigquery)
* [x] [Parquet file source](https://docs.feast.dev/reference/data-sources/file)
* [x] [Synapse source (community plugin)](https://github.com/Azure/feast-azure)
* [x] [Azure Synapse + Azure SQL source (contrib plugin)](https://docs.feast.dev/reference/data-sources/mssql)
* [x] [Hive (community plugin)](https://github.com/baineng/feast-hive)
* [x] [Postgres (contrib plugin)](https://docs.feast.dev/reference/data-sources/postgres)
* [x] [Spark (contrib plugin)](https://docs.feast.dev/reference/data-sources/spark)
Expand All @@ -19,7 +19,7 @@ The list below contains the functionality that contributors are planning to deve
* [x] [Snowflake](https://docs.feast.dev/reference/offline-stores/snowflake)
* [x] [Redshift](https://docs.feast.dev/reference/offline-stores/redshift)
* [x] [BigQuery](https://docs.feast.dev/reference/offline-stores/bigquery)
* [x] [Synapse (community plugin)](https://github.com/Azure/feast-azure)
* [x] [Azure Synapse + Azure SQL (contrib plugin)](https://docs.feast.dev/reference/offline-stores/mssql.md)
* [x] [Hive (community plugin)](https://github.com/baineng/feast-hive)
* [x] [Postgres (contrib plugin)](https://docs.feast.dev/reference/offline-stores/postgres)
* [x] [Trino (contrib plugin)](https://github.com/Shopify/feast-trino)
Expand Down