Skip to content

Latest commit

 

History

History
1739 lines (1227 loc) · 63.6 KB

changelog.mdx

File metadata and controls

1739 lines (1227 loc) · 63.6 KB
title metaTitle metaDescription description
Changelog
Chalk Product Changelog
Chalk's latest updates.
Updates to Chalk!

November 11, 2024

Idempotency in triggered resolver runs

We now provide an idempotency key parameter for triggering resolver runs so that you can ensure that only one job will be kicked off per idempotency key provided.

ChalkClient.check() function for easy integration testing

The ChalkClient now has a check function that enables you to run a query and check whether the query outputs match your expected outputs. This function should be used with pytest for integration testing. To read more about different methods and best practices for integration testing, see our integration test docs.

Underscore expressions support more mathematical and logical functions

This week, we've added mathematical functions floor, ceil, and abs to chalk.functions, along with the logical functions when, then, otherwise, and is_null. We've also added the haversine function for computing the Haversine distance between two points on Earth given their latitude and longitude. These points can be used in underscore expressions to define features with code that can be statically compiled in C++ for faster execution. See the full list of functions you can use in underscore expressions in our API docs.

Dashboard improvements for providing more insights into resolver performance and execution

In the dashboard, users can now view the P50, P75, P95, and P99 latencies for resolvers in the table under the Resolver tab of the menu. You can also customize which columns are displayed in the table by clicking the gear icon in the top left corner of the table.

In addition, we've added a SQL Explorer for examining resolver output for queries that are run with the store_plan_stages=True parameter.

chalk healthcheck in CLI

You can now use the chalk healthcheck command in the CLI to view information on the health of Chalk's API server and its services. The healthcheck provides information for the API server based on the active environment and project. To read more about the healthcheck command, see the CLI documentation.


November 4, 2024

Offline Query Specifications for shards and workers

When running an asynchronous offline query, you can now specify the inputs num_shards and num_workers as parameters to allow for more granular control over the parallelization of your query execution. To see all of the offline query options, check out the offline query documentation.

In addition, offline query progress reporting now specifies progress by shard, giving developers more insight into where their offline query is in the execution progress.

ChalkClient can now use the default Git branch

You can now default to using the name of your current Git branch when developing using the ChalkClient. For example, if you have checked out a branch named my-very-own-branch you can now set ChalkClient(branch=True) and all of your client calls will be directed at my-very-own-branch. To read more about how to use ChalkClient, see our API documentation.

Underscore expressions support functions for URL parsing, regular expressions, and more

We've added more functions to chalk.functions that can be used in underscore expressions. You can now use regexp_like, regexp_extract, split_part, and regexp_extract_all to do regular expression matching and use url_extract_host, url_extract_path, and url_extract_prtocol to parse URL's. In addition, we've added helpful logical functions like if_then_else, map_dict, and cast to broaden the span of features that you can define using underscore expressions. To read more about all of our functions, check out our API documentation.

Deployment build logs for AWS environments

We now provide more detailed build logs for deployments in AWS environments in the dashboard!


October 28, 2024

Run predictions against SageMaker from Chalk, and do so much more in underscore expressions

We've added a new function chalk.functions.sagemaker_predict that allows you to run predictions against a SageMaker endpoint to resolve features. Read more about how to define a SageMaker endpoint, encode your input data, and run predictions in our SageMaker tutorial.

In addition to being able to make SageMaker calls, underscore expressions now support a variety of new functions. With these functions imported from chalk.functions, you can perform encoding, decoding, math, datetime manipulation, string manipulation, and more! For example, say you have a Transaction feature, where you make a SageMaker call to enrich the transaction data and provide a label for the transaction, and you parse this label for other features. You can now define all of these features related to transaction enrichment using underscore expressions and Chalk functions in the feature definition:

from datetime import date
import chalk.functions as F
from chalk.features import _, features, Primary

@features
class Transaction:
    id: Primary[str]
    amount: float
    date: date
    day: int = F.day_of_year(_.date)
    month: int = F.month_of_year(_.date)

    sagemaker_input_data: bytes = F.string_to_bytes(_.id, encoding="utf-8")
    transaction_enrichment_label: bytes = F.sagemaker_predict(
        _.sagemaker_input_data,
        endpoint="transaction-enrichment-model_2.0.2024_10_28",
        target_model="model_v2.tar.gz",
        target_variant="production_variant_3"
    )
    transaction_enrichment_label_str: str = F.bytes_to_string(_.transaction_enrichment_label, encoding="utf-8")
    is_rent: bool = F.like(_.transaction_enrichment_label_str, "%rent%")
    is_purchase: bool = F.like(_.transaction_enrichment_label_str, "%purchase%")

Nested materialized windowed aggregation references!

You can now reference other windowed aggregations in your windowed aggregation expressions. To read more about how to define your windowed aggregations, see our example here.

Updated usage dashboard to view CPU and storage requests grouped by pod and namespace

We've updated the Usage Dashboard with a new view under the Pod Resources tab that allows you to view CPU and storage requests by pod as grouped by cluster, environment, namespace, and service! If you have any questions about the usage dashboard, please reach out to the Chalk team.

Dropping support for Python 3.8

From chalkpy==2.55.0, Chalk is dropping support for Python 3.8, which has reached end-of-life. If you are still using Python 3.8, please upgrade to Python 3.9 or higher.

October 21, 2024

Pub/Sub streaming source

We've enabled support for using Pub/Sub as a streaming source. Read more about how to use Pub/Sub as a streaming source here.

Online/Offline Storage for Offline Queries

You can automatically load offline query outputs to the online and offline store using the boolean parameters store_online and store_offline. Below is an example of how to use these parameters.

from chalk.client import ChalkClient

client = ChalkClient()
ds = client.offline_query(
    input={"user.id": [1, 2, 3, 4, 5]},
    output=["user.num_interactions_l7d", "user.num_interactions_l30d", "user.num_interactions_l90d"],
    store_online=True,
    store_offline=True
)

SQL explorer for query outputs in the dashboard

Customers running gRPC servers can now run SQL queries on the dataset outputs of online and offline queries in the dashboard. To enable this feature for your deployment, please reach out to the team.

SQL Explorer for Query Outputs

Color updates in the dashboard

We've updated our color scheme in the dashboard to more clearly differentiate between successes and failures in metrics graphs!

Red for failures and green for successes

October 14, 2024

SQL explorer in dashboard for datasets

Customers can now run SQL queries on dataset outputs in the dashboard. To use this feature, navigate to the Datasets page in the menu, select a dataset, and click on the Output Explorer tab.

SQL Explorer for Datasets

Optionally evict nulls from your DynamoDB online store

Last week we enabled the option to decide whether to persist null values for features in Redis lightning online stores, and this week we have enabled this feature in DynamoDB online stores. By default, null values are persisted in the online store for features defined as Optional, but you can set cache_nulls=False in the feature method to evict null values. Read more about how to use the cache_nulls parameter here.

Set environment variables and more in the Advanced section of the Cloud Resource Configurations page in the dashboard

You can set cloud resource configurations for your environment by navigating to Settings > Resources in the dashboard. In addition to specifying resource configurations for resource groups like instance counts and CPU, you can now also set environment variables and other settings like Kubernetes Node Selectors. The Kubernetes Node Selector enables you to specify the machine family you would like to use for your deployment. For example, this would map to EC2 Instance Types for AWS deployments or Compute Engine Machine Families for GCP deployments. If you have any questions about how to use any of these settings in the configuration page, please reach out to the team.

Cloud Resource Configurations Advanced Settings

October 7, 2024

Underscore expressions support datetime subtraction and total_seconds

Underscore expressions now support datetime subtraction and the use of a new library function chalk.functions.total_seconds. This allows you to compute the number of seconds in a time duration and define more complex time interval calculations using performant underscore expressions.

For example, to define a feature that computes the difference between two date features in days and weeks, we can use chalk.functions.total_seconds and underscore date expressions together.

from chalk.functions as F
from chalk.features import _, features, Primary
from datetime import date
@features
class User:
    id: Primary[str]
    created_at: date
    last_activity: date
    days_since_last_activity: float = F.total_seconds(date.today() - _.last_activity) / (60 * 60 * 24)
    num_weeks_active: float = F.total_seconds(_.last_activity - _.created_at) / (60 * 60 * 24 * 7)

Optionally evict nulls from your Redis lightning online store

You can now select whether to persist null values for features in the Redis lightning online store using the cache_nulls parameter in the feature method. By default, null values are persisted in the online store for features defined as Optional. If you set cache_nulls=False, null values will not be persisted in the online store.

from chalk import feature
from chalk.features import features, Primary, Optional

@features
class RestaurantRating:
    id: Primary[str]
    cleanliness_score: Optional[float] = feature(cache_nulls=False) # null values will not be persisted
    service_score: Optional[float] = feature(cache_nulls=True) # null values will be persisted. This is the default behavior.
    overall_score: float # null values are not persisted for required features

Feature value metrics from gRPC server and feature table updates

Customers running the gRPC server can now reach out to enable feature value metrics. Feature value metrics include the number of observations, number of unique values, and percentage of null values over all queries, as well as the running average and maximum of features observed. Please reach out if you'd like to enable feature value metrics.

Feature value metrics

Additionally, the feature table in the dashboard has been updated to allow for customization of columns displayed, which enables viewing request counts over multiple time ranges in the same view.

September 30, 2024

Compute cosine similarity between two vector features

chalk.functions now offers a cosine_similarity function:

import chalk.functions as F
from chalk.features import _, embedding, features

@features
class Shopper:
      id: str
      preferences_embedding: Vector[1536]

@features
class Product:
      id: str
      description_embedding: Vector[1536]

@features
class ShopperProduct:
      id: str
      shopper_id: Shopper.id
      shopper: Shopper
      product_id: Product.id
      product: Product
+     similarity: float = F.cosine_similarity(_.shopper.preferences_embedding, _.product.description_embedding)

Cosine similarity is useful when handling vector embeddings, which are often used when analyzing unstructured text. You can also use embedding to compute vector embeddings with Chalk.

Dashboard now shows metrics for offline query runs

When looking at an offline query run in the dashboard, you'll now find a new Metrics tab showing query metadata, CPU utilization, and memory utilization.

Configuration option for recomputing features on cache misses

We have a new offline query configuration option for recomputing features only when they are not already available in the offline store. This option is useful for workloads with computationally expensive features that cannot easily be recomputed. Please reach out if you'd like to try this feature.

September 23, 2024

Configurable retry policy for SQL resolvers

Sometimes, a SQL resolver may fail to retrieve data due to temporary unavailability. We've added new options for configuring the number of retry attempts a resolver may make (and how long it should wait between attempts). If you're interested in trying out this new functionality early, please let the team know.

Has-one join keys can be chained

When creating has-one relationships, you can set the primary key of the child feature class to the primary key of the parent feature class. For example, you may model an InsurancePolicy feature class as belonging to exactly one user by setting InsurancePolicy.id's type to Primary[User.id].

Now, we've updated Chalk so that you can chain more of these relationships together. For example, an InsurancePolicy feature class may have an associated InsuranceApplication. The InsuranceApplication may also have an associated CreditReport. Chalk now allows chaining an arbitrary number of has-one relationships. Chalk will also validate these relationships to ensure there are no circular dependencies.

Here's an example where we have features describing a system where user has one insurance policy, each policy has one submitted application, and each application has one credit report:

from chalk import Primary
from chalk.features import features

@features
class User:
      id: str
      # :tags: pii
      ssn: str
      policy: "InsurancePolicy"

@features
class InsurancePolicy:
      id: Primary[User.id]
      user: User
+     application: "InsuranceApplication"

+ @features
+ class InsuranceApplication:
+     id: Primary[InsurancePolicy.id]
+     stated_income: float
+     # For the sake of illustrating has-one relationships,
+     # we're assuming exactly one credit report per
+     # application, which may not be realistic. A has-many
+     # relationship may be more accurate here.
+     credit_report: "CreditReport"
+
+ @features
+ class CreditReport:
+     id: Primary[InsuranceApplication.id]
+     fico_score: int
+     application: InsuranceApplication

To query for a user's credit report, you would write:

client.query(
    inputs={User.id: "123"},
    output=[User.policy.application.credit_report],
)

To write a resolver for one of the dependent feature classes here, such as CreditReport.fico_score, you would still reference the relevant feature class by itself:

@online
def get_fico_score(id: CreditReport.id) -> CreditReport.fico_score:
    ...

As an aside, if your resolver depends on features from other feature classes, such as User.ssn, we instead recommend joining those two feature classes directly for clarity (which was possible prior to this changelog entry):

from chalk import Primary
- from chalk.features import features
+ from chalk.features import features, has_one

@features
class User:
      id: str
      # :tags: pii
      ssn: str
      policy: "InsurancePolicy"
+     credit_report: "CreditReport" = has_one(lambda: User.id == CreditReport.id)

# ... the rest of the feature classes

+ @online
+ def get_fico_score(id: User.id, ssn: User.ssn) -> User.credit_report.fico_score:
+     ...

User permissions page shows roles per user

When you view Users in the Chalk settings page, you will now find a menu for viewing the roles associated with each user, whether those roles are granted directly or via SCIM.

An example user with admin and owner roles

September 16, 2024

New feature and resolver UI in the dashboard

We have shipped a new UI for the Features and Resolvers sections of the dashboard!

The new UI has tables with compact filtering and expanded functionality. You can now filter and sort by various resolver and feature attributes! The tables also provide column resizing for convenient exploration of the feature catalog.

Resolver table with compact filtering and sorting

The features table now includes request counts from the last 5 minutes up to the last 180 days, has built-in sorting, and has a Features as CSV button to download all the feature attributes in your table as a CSV for further analysis.

Feature table with request count and csv export button

New helper functions for feature computation

The new chalk.functions module contains several helper functions for feature computation. For example, if you have a feature representing a raw value in GZIP-compressed value, you can use gunzip with an underscore reference to create an unzipped feature. The full list of available functions can be found at the bottom of our underscore expression documentation.

JSON feature type

You can now define features with JSON as the type after importing JSON from the chalk module. You can then reference the JSON feature in resolver and query definitions. You can also retrieve scalar values from JSON features using the json_value function.

September 9, 2024

Configure Chalk to not cache null feature values

By default, Chalk caches all feature values, including null. To prevent Chalk from caching null values, use the feature method and set cache_nulls to False.

More static execution of certain Python resolvers

We built a way to statically interpret Python resolvers to identify ones that are eligible for C++ execution, which has faster performance. For now, resolvers are eligible if they do simple arithmetic and logical expressions. If you're interested in learning more and seeing whether these new query planner options would apply to your codebase, please reach out!

New tutorial for using Chalk with SageMaker

We have a new tutorial for using Chalk with SageMaker available now. In the tutorial, we show how to use Chalk to generate training datasets from within a SageMaker pipeline for model training and evaluation.

September 3, 2024

Feature catalog shows associated named queries

In the August 19 changelog entry, we announced NamedQuery, a tool for naming your queries so that you can execute them without writing out the full query definition.

This week, we've updated the dashboard's feature catalog so that it shows which named queries reference a given feature as input or output.

Feature catalog showing links to named queries a feature is an input or output of

August 26, 2024

View aggregation backfills in the dashboard

We added a new Aggregations page to the dashboard where you can see the results of aggregate backfill commands. Check it out to see what resolvers were run for a backfill, the backfill's status, and other details that will help you drill down to investigate performance.

For more details on aggregate backfills, see our documentation on managing windowed aggregations.

August 19, 2024

Execute queries by name

Instead of writing out the full definition of your query each time you want to run it, you can now register a name for your query and reference it by the name!

Here's an example of a NamedQuery:

from chalk import NamedQuery
from src.feature_sets import Book, Author

NamedQuery(
    name="book_key_information",
    input=[Book.id],
    output=[
        Book.id,
        Book.title,
        Book.author.name,
        Book.year,
        Book.short_description
    ],
    tags=["team:analytics"],
    staleness={
        Book.short_description: "0s"
    },
    owner="mary.shelley@aol.com",
    description=(
        "Return a condensed view of a book, including its title, author, "
        "year, and short description."
    )
)

After applying this code, you can execute this query by its name:

chalk query --in book.id=1 --query-name book_key_information

To see all named queries defined in your current active deployment, use chalk named-query list.

As Shakespeare once wrote, "What's in a named query? That which we call a query by any other name would execute just as quickly."

Miscellaneous improvements

  • The offline query page of the dashboard now shows which table in your offline store contains the query's output values.

August 12, 2024

Queries can reference multiple feature namespaces

Previously, you could only reference one feature namespace in your queries. Now you can request features from multiple feature namespaces. For example, here's a query for a specific customer and merchant:

client.query(
    input={
        Customer.id: 12345,
        Merchant.id: 98765,
    },
    output=[Customer, Merchant],
)

Dashboard resources view shows allocatable CPU and memory

The resources page of the dashboard now shows the allocatable and total CPU and memory for each of your Kubernetes nodes. Kubernetes reserves some of each machine's resources for internal usage, so you cannot allocate 100% of a machine's stated resources to your system. Now, you can use the allocatable CPU and memory numbers to tune your resource usage with more accuracy.

Performance improvements

We identified an improvement for our query planner’s handling of temporal joins! Our logic for finding the most recent observation for a requested timestamp is now more efficient. Happy time traveling!

August 5, 2024

DynamoDB with PartiQL

We now support DynamoDB as a native accelerated data source! After connecting your AWS credentials, Chalk automatically has access to your DynamoDB instance, which you can query with PartiQL.

Underscore expressions support references to the target window duration

Underscore expressions on windowed features can now include the special expression _.chalk_window to reference the target window duration. Use _.chalk_window in windowed aggregation expressions to define aggregations across multiple window sizes at once:

@features
class Transaction:
    id: int
    user_id: "User.id"
    amount: float

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    total_spend: Windowed[float] = windowed(
        "30d", "60d", "90d",
        default=0,
        expression=_.transactions[_.amount, _.ts > _.chalk_window].sum(),
        materialization={"bucket_duration": "1d"},
    )

Offline queries allow resource overriding

  • offline_query now supports the resources parameter. resources allows you to override the default resource requests associated with offline queries and cron jobs so that you can control CPU, memory, ephemeral volume size, and ephemeral storage.

July 26, 2024

Dashboard improvements

  • The offline query page of the dashboard now shows live query progress. After query completion, the query page will also show how long each resolver took to run."
  • The Kubernetes resource page in the dashboard shows which kinds of hardware resources are currently running. It also allows you to group resources by application, component, and other common groupings.

July 19, 2024

Datasets and dataset revisions now support previews and summaries

Datasets and DatasetRevisions have two new methods: preview and summary. preview shows the first few rows of the query output. summary shows summary statistics of the query output. Here's an example of summary output:

     describe  user.id  ...  __index__ shard_id batch_id
0       count      1.0  ...        1.0        0        0
1  null_count      0.0  ...        0.0        0        0
2        mean      1.0  ...        0.0        0        0
3         std      0.0  ...        0.0        0        0
4         min      1.0  ...        0.0        0        0
5         max      1.0  ...        0.0        0        0
6      median      1.0  ...        0.0        0        0

[7 rows x 14 columns]

Create isolated node pools for your resource groups (in AWS)

Chalk resource groups create separate independent deployments of the query server to prevent resource contention. For example, one team may want to make long-running analytics queries and another may want to make low-latency queries in line with customer requests.

We have updated the Cloud Resource Configuration page! You can now configure resource groups to use completely independent node pools to ensure your workflows run on separate computer hardware. The configuration page also allows you to specify exactly what kind of hardware will be available in each resource group so you can optimize the balance between cost and performance.

This feature is currently available for customers running Chalk in EKS, but will be available soon for customers using GKE.

Performance improvements

  • We've significantly improved SQL runtime in our query planner by executing eligible queries in C++ instead of SQLAlchemy. Chat with our support team if you'd like to update your query planner options.
  • We improved the performance of some underscore expressions by executing count() operations as native dataframe operations.

Miscellaneous improvements

  • Our feature catalog now lets you filter features by their context (online or offline). Additionally, you can now search features by their name, description, and owner.
  • We fixed an issue where some underscore experessions had incorrect typechecking.

July 10, 2024

Feature catalog

You can now view and filter features in the feature catalog by their tags and owners.

Feature catalog filtering by tag and owner

July 1, 2024

Chalk gRPC

We shipped a gRPC engine for Chalk that improved performance by at least 2x through improved data serialization, efficient data transfer, and a migration to our C++ server. You can now use ChalkGRPCClient to run queries with the gRPC engine and fetch enriched metadata about your feature classes and resolvers through the get_graph method.

Spine SQL query

With ChalkPy v2.38.8 or later, you can now pass spine_sql_query to offline queries. The resulting rows of the SQL query will be used as input to the offline query. Chalk will compute an efficient query plan to retrieve your SQL data without requiring you to load the data and transform it into input before sending it back to Chalk. For more details, check out our documentation.

Static planning of underscore expressions

We shipped static planning of underscore expressions. Underscore expressions enable you to define and resolve features from operations on other features. When you use underscore expressions, we now do static analysis of your feature definition to transform it into performant C++ code.

Underscore expressions currently support basic arithmetic and logical operations, and we continue to build out more functionality! See the code snippet below for some examples of how to use underscore expressions:

@features
class SampleFeatureSet:
    id: int
    feature_1: int
    feature_2: int
    feature_1_2_sum: int = _.feature_1 + _.feature_2
    feature_1_2_diff: int = _.feature_1 - _.feature_2
    feature_1_2_equality: bool = _.feature_1 == _.feature_2

June 28, 2024

Chalk deployment tags

You can now add tags to your deployments. Tags must be unique to each of your environments. If you add an already existing tag to a new deployment, Chalk will remove the tag from your old deployment.

Tags can be added with the --deployment-tag flag in the Chalk CLI:

chalk apply --deployment-tag=latest --deployment-tag=v1.0.4

Resource configuration management in dashboard

We updated our UI for resource configuration management in the dashboard! You can now toggle your view between a GUI or a JSON editor. The GUI exposes all the configuration options available in the JSON editor, including values that aren't set, and allows you to easily adjust your cluster's resources to fit your needs.

resource configuration management

June 19, 2024

New data sources and native drivers

We added integrations for Trino and Spanner as new data sources. We've also added native drivers for Postgres and Spanner, which drastically improves performance for these data sources.

May 29, 2024

Heartbeating

We now have heartbeating to poll the status of long-running queries and resolvers, which will now mark any hanging runs that are no longer detected as "failed" after a certain period of time.

May 14, 2024

Data source and feature-level RBAC

We expanded the functionality of our service tokens to enable role-based access control (RBAC) at both the data source and feature level. On the datasource level, you can now restrict a token to only access data sources with matching tags to resolve features. On the feature level, you can restrict a token's access to tagged features either by blocking the token from returning tagged features in any queries but allowing the feature values to be used in the computation of other features, or by blocking the token from accessing tagged features entirely.

datasource and feature level rbac

May 8, 2024

Incremental Status

We shipped statuses during incremental runs such that users can get a signal of the current high water mark of data being updated.

chalk incremental status  --scheduled_query get_some_data__daily
✓ Fetched resolver progress state
Resolver:                 N/A
Query:                    run_this_query_daily
Environment:              chalk12345
Max Ingested Timestamp:   2024-07-01T16:01:46+00:00
Last Execution Timestamp: 2024-07-01T00:01:27.421873+00:00

April 18, 2024

Miscellaneous improvements

  • Windowed resolvers have expanded to allow for hourly cadences.

April 9, 2024

Miscellaneous improvements

  • SQL resolvers have improved error reporting for failures related to type conversion (e.g., if your resolver selects an int column, but the feature's type is string)

March 29, 2024

Miscellaneous improvements

  • SQL file resolvers have spellcheck (based on Levenshtein distance)
  • Failed annotation parsing raises a type error with a more helpful error message

March 19, 2024

Scheduled Queries

Chalk now supports executing an offline_query on a schedule. Effectively, this extends the existing "scheduled resolver" functionality and allows you to execute more complicated data ingestion or caching workflows without needing to use Airflow or other external schedulers to orchestrate resolver execution.

Here's an example of a scheduled query that caches the number of transactions a user has made in the last 24 hours into the online store:

from chalk import ScheduledQuery

ScheduledQuery(
    name="num_transactions_last_24h",
    output=[User.num_transactions_last_24h],
    schedule="0 0 * * *", # every day at midnight
    store_online=True,    # store the result in the online store
    store_offline=False,  # don't store this value in the offline store
)

Bugfixes and improvements

  • offline_query(...) now accepts sample_features: list[Feature] as an argument. This works in conjunction with recompute_features, and allows you to write something like:
ChalkClient().offline_query(
    input={User.id: [...]},
    output=[User.full_name],
    recompute_features=True,                          # means "recompute all features
    sample_features=[User.first_name, User.last_name] # but sample these features from the offline store
)

This is useful when you have a large number of features that you want to recompute, but only a few that you want to sample.

  • ChalkClient.offline_query now accepts run_asynchronously: bool to explicitly opt a query into running on an isolated worker.
  • DataSet.to_polars()/.to_pandas() now accept output_ts: str and output_id: str to customize the name of the timestamp and id columns in the output dataframe.
  • Feature and resolver discovery during chalk apply is roughly twice as fast as of chalkpy v2.33.9.
  • Dataset downloads no longer have any dependency on locally registered features, which resolves crashes for certain dataset management workflows.
  • ChalkClient.query now supports request_timeout: float, which is passed to the underlying requests.request call.

March 8, 2024

Bugfixes and improvements

  • A persistent issue with chalk drop has been resolved. Now, chalk drop will allow you to reset a feature whose deletion has been deployed to the active deployment, which will allow you to re-deploy the feature. Previously, it was possible to get into a state that was impossible to recover from without support.

  • tags(...) allows you to extract the tags of a @features class or a property (Feature) of that class.

  • DataSet.to_polars()/to_pandas() now raises an error if the dataset computation had errors. This prevents the user from accidentally using a dataset that was not computed correctly. If you wish to use the dataset anyway, you can use DataSet.to_polars(ignore_errors=True).

March 1, 2024

Support for custom SQL sampling in offline query

You can now specify a custom SQL sampling query for offline queries. This allows you to use a native SQL query to compute the query's entity spine for offline queries. This is useful when you have a complicated sampling policy (i.e. class-based sampling). Additional non-primary key features can be provided as well.

January 22, 2024

required_resolver_tags for queries

You can now specify required_resolver_tags when querying. This allows you to ensure that a query only considers a resolver if it has a certain tag. This is useful for guaranteeing that a query only uses resolvers that are cost-efficient, or for enforcing certain compliance workflows.

In this example:

@offline()
def fetch_credit_scores() -> DataFrame[User.id, User.credit_score]:
    """
    Call bureaus to get credit scores; costs money for each record retrieved.
    """

    return requests.post(...)

@offline(tags=["low-cost"])
def fetch_previously_ingested_credit_scores() -> DataFrame[User.id, User.credit_score]:
    """
    Pull previously retrieved credit scores from Snowflake only
    """

    return snowflake.query_string("select user_id as id, credit_score from ...").all()

querying with required_resolver_tags can be used to enforce that only 'low-cost' resolvers are executed --

# This query is guaranteed to /never/ run any resolver that isn't tagged "low-cost".

dataset = ChalkClient().offline_query(
    input={User.id:[1,2,3]},
    output=[
        User.credit_score
    ],
    recompute_features=True,
    required_resolver_tags=["low-cost"]
)

October 24, 2023

Support for Python 3.11

You can now use either of Python 3.11 or 3.10 on a per-environment basis.

project: my-project-id
environments:
  default:
    runtime: 3.10
  develop:
    runtime: 3.11

See Python Version for more information.


October 23, 2023

Quality of Life Improvements

  • ChalkClient.query_bulk(...) and multi_query no longer require that references features be defined as Python classes, and string names for inputs and outputs can now be used instead.

October 11, 2023

Alert descriptions

Alerts now support descriptions, which can be used to provide more context about the alert.

from chalk.monitoring import Chart, Series
Chart(name="Request count").with_trigger(
    Series
        .feature_null_ratio_metric()
        .where(feature=User.fico_score) > 0.2,
+   description="""*Debugging*
+
+   When this alert is triggered, we're parsing null values from
+   a lot of our FICO reports. It's likely that Experian is
+   having an outage. Check the <dashboard|https://internal.dashboard.com>.
+   """
)

These descriptions can also be set in the Chalk dashboard via the metric alerts interface.

Alert description interface:


October 5, 2023

query_bulk support for notebooks

The query_bulk method is now available in the ChalkClient class. This method allows you to query for multiple rows of features at once.

This method uses Apache Arrow's Feather format to encode data. This allows the endpoint to transmit data (particularly numeric-heavy data) using roughly 1/10th the bandwidth that is required for the JSON format used by query.

This method has been available in beta for a few months, but is now available for general use, and as part of this release is now supported when querying using notebooks without access to feature schemas.


September 26, 2023

Improve scheduled resolver runs list

The list of scheduled resolvers now shows which resolvers are actually scheduled to run in the current environment, based on the environment argument to @online and @offline.

Scheduled Resolvers List:

Resolvers that are annotated with an environment other than the current environment are labeled with the environment in which they are configured to run.


August 23, 2023

Improved chalk query output

The chalk query command now has improved output for errors. Previously, errors were displayed in a table, which meant that stacktraces were truncated:

> chalk query --in email.normalized=nice@chalk.ai --out email

Errors

Code             Feature  Resolver                        Message
─────────────────────────────────────────────────────────────────────────────
RESOLVER_FAILED           src.resolvers.get_fraud_tags    KeyError: 'tags'

Now, errors are displayed in a more readable format, and stacktraces are not truncated:

> chalk query --in email.normalized=nice@chalk.ai --out email

Errors

Resolver Failed src.resolvers.get_fraud_tags

KeyError: 'tags'
  File "src/resolvers.py", line 30, in get_fraud_tags
      return parsed["tags"]

KeyError('tags')

August 19, 2023

Query plan trace viewer

The query plan viewer now includes a flame graph visualization of the query plan's execution, called the Trace View. Precise trace data is stored for every offline query by default and for online queries when the query is made with the --explain flag.

Trace View:


August 11, 2023

Override now in online query

  • Support now= for .query, --now, etc.

Query plan viewer improvements

  • Redesigned query plan viewer
  • Support viewing execution time per operator
  • Support viewing data processing metrics per operator
  • Query plans saved for all queries by default

No-input online and offline query improvements

  • offline_query now supports running downstream resolvers when no input is provided. Query primary keys will be sampled or computed, depending on the value of recompute_features.
  • online_query now support running a query without any input. Query primary keys will be computed using an appropriate no-argument resolver that returns a DataFrame[...]

Misc

  • --local for chalk query, combines chalk apply --branch and chalk query --branch
  • The progress indicator in the chalk command line tool is no longer an off-brand magenta.

August 5, 2023

Chalk Python SDK Improvements

Added: .to_polars(), to_pandas(), and .to_pyarrow() accept prefixed: bool as an argument. prefixed=True is the default behavior, and will prefix all column names with the feature namespace. prefixed=False will not prefix column names.

DataFrame({User.name: ["Andy"]}).to_polars(prefixed=False)
# output:
# polars DataFrame with `name` as the sole column.

DataFrame({User.name: ["Andy"]}).to_polars(prefixed=True)
# output:
# polars DataFrame with `user.name` as the sole column.

Added: include_meta on ChalkClient.query(...), which includes .meta on the response object. This metadata object includes useful information about the query execution, at the cost of increased network payload size and a small increase in latency.


July 25, 2023

Freezing time in unit tests

Chalk now supports freezing time in unit tests. This is useful for testing time-dependent resolvers.

from datetime import timezone, datetime
from chalk.features import DataFrame, after
from chalk.features.filter import freeze_time

df = DataFrame([...])
with freeze_time(at=datetime(2020, 2, 3, tzinfo=timezone.utc)):
    df[after(days_ago=1)] # Get items after february 2nd

freeze_time also works with resolvers that declare specific time bounds for their aggregation inputs:

@online
def get_num_transactions(txs: Card.transactions[before(days_ago=1)]) -> Card.num_txs:
  return len(txs)

with freeze_time(at=datetime(2020, 9, 14)):
    num_txs = get_num_transactions(txs) # num transactions before september 13th

July 11, 2023

Explicitly time-dependent resolvers

Chalk now supports resolvers that are explicitly time-dependent. This is useful for performing backfills which compute values that depend on values that are semantically similar to datetime.now().

You can express time-dependency by declaring a dependency on a special feature called Now:

@online
def get_age_in_years(birthday: User.birthday, now: Now) -> User.age_in_years:
    return (now - birthday).years

In online query, (i.e. with ChalkClient().query), Now is datetime.now(). In offline query contexts, now will be set to the appropriate input_time value for the calculation. This allows you to backfill a feature for a single entity at many different historical time points:

ChalkClient().offline_query(input={User.id: [1,1,1]}, output=[User.age_in_years], input_times=[
    datetime.now() - timedelta(days_ago=100),
    datetime.now() - timedelta(days_ago=50),
    datetime.now() - timedelta(days_ago=0),
])
...

Now can be used in batch resolvers as well:

@online
def batch_get_age_in_years(df: DataFrame[User.id, User.birthday, Now]) -> DataFrame[User.id, User.age_in_years]:
    ...

June 21, 2023

Testing your SQL File Resolvers

SQL file resolvers are Chalk's preferred method of resolving features with SQL queries. Now, you can get your SQL file resolvers in Python by the name of the SQL file resolver. For example, if you have the following SQL file resolver:

-- source: postgres
-- cron: 1h
-- resolves: Person
select id, name, email, building_id from table where id=${person.id}

you can test out your resolver with the following code.

from chalk import get_resolver

resolver = get_resolver('example') # get_resolver('example.chalk.sql') will also work
result = resolver('my_id')

June 15, 2023

Metrics Export Updates

Now, Chalk supports exporting metrics about "named query" execution. These metrics (count, latency) join similar metrics about feature and resolver execution. Contact your Chalk Support representative to configure metrics export if you would like to view metrics about Chalk system execution in your existing metrics dashboards.

Additional updates:

  • synthetic cache resolvers are now excluded
  • query_name is a tag on many metrics

June 14, 2023

Branch deployment performance

Chalk Branch Deployments provide an excellent experience for quick iteration cycles on new features and resolvers. Now, Chalk Branch Deployments automatically use a pool of "standby" workers, so there is less delay before queries can be served against a new deployment. This reduces the time it takes to run query or offline query against a new deployment from ~10-15 seconds to ~1-3 seconds. This impacts customers with more complex feature graphs the most.


June 13, 2023

Expanded support for logical keying in streaming contexts

Stream resolvers support a keys= parameter. This parameter allows you to re-key a stream by a property of the message, rather than relaying on the protocol layer key. This is appropriate if a stream is keyed randomly, or by an entity key like "user", but you want to aggregate along a different axis, e.g. "organization".

Now, keys= supports passing a "dotted string" (e.g. foo.bar) to indicate that Chalk should use a sub-field of your message model. Previously, only root-level fields of the model were supported.

DataFrame unit tests

If you specify projections or filters in DataFrame arguments of resolvers, Chalk will automatically project out columns and filter rows in the input data.

Below, we test a resolver that filters rooms in a house to only the bedrooms:

@features
class Room:
    id: str
    name: str

@features
class Home:
    id: str
    rooms: DataFrame[Room] = has_many(
        lambda: Room.home_id == Home.id
    )
    num_bedrooms: int

@online
def get_num_bedrooms(
    rooms: Home.rooms[Room.name == 'bedroom']
) -> Home.num_bedrooms:
    return len(rooms)

Now, we may want to write a unit test for this resolver.

def test_get_num_rooms():
    # Rooms is automatically converted to a `DataFrame`
    rooms = [
        Room(id=1, name="bedroom"),
        Room(id=2, name="kitchen"),
        Room(id=3, name="bedroom"),
    ]

    # The kitchen room is filtered out
    assert get_num_bedrooms(rooms) == 2

    # `get_num_bedrooms` also works with a `DataFrame`
    assert get_num_bedrooms(DataFrame(rooms)) == 2

While we could have written this test before, we would have had to manually filter the input data to only include bedrooms. Also note that Chalk will automatically convert our argument to a DataFrame if it is not already one.

June 12, 2023

Query Run Page

Chalk's dashboard shows aggregated logs and metrics about the execution of queries and resolvers. Now, it can also show detailed metrics for a single query. This is useful for debugging and performance tuning.

You can access this page from the "runs" tab on an individual named query page, or from the "all query runs" link on the "queries" page.

You can search the list of previously executed queries by date range, or by "query id". The query id is returned in the "online query" API response object.

May 15, 2023

BigTable Online Storage

Chalk now supports BigTable as an online-storage implementation. BigTable is appropriate for customers with large working sets of online features, as is common with recommendation systems. We have successfully configured BigTable to serve 700,000 feature vectors per second at ~30ms p90 e2e latency.

May 10, 2023

Enhancements to Offline Query

The Offline Query has been enhanced with a new recompute_features parameter. Users can control which features are sampled from the offline store, and which features are recomputed.

  • The default value False will maintain current behavior, returning only samples from the offline store.
  • True will ignore the offline store, and execute @online and @offline resolvers to produce the requested output.
  • If, instead, the user passes in a list of features to recompute_features, those features will be recomputed by running @online and @offline resolvers, and all other feature values - including those needed to recompute the requested features - will be sampled from the offline store.

Recompute Dataset

The 'recompute' capability is also exposed on Dataset. When passed a list of features to recompute, a new Dataset Revision will be generated, and the existing dataset will be used as inputs to recompute the requested features.

Developing in Jupyter

Chalk has introduced a new workflow when working with branches, allowing full iterations to take place directly in any IPython notebook. When a user creates a Chalk Client with a branch in a notebook, subsequent features and resolvers in the notebook will be deployed to that branch. When combined with Recompute Dataset and the enhancements to Offline Query, users have a new development loop available for feature exploration and development:

  1. Take advantage af existing data in chalk
  2. Explore that data using familiar tools in a notebook
  3. Enrich the data by developing new features and resolvers
  4. Immediately view the results of adjusting features in the dataset
  5. When exploration is complete, features and resolvers can be directly added back to the Chalk project

May 5, 2023

View Deployment Source Code

Deployments now offer the ability to view their source code. By clicking the "View Source" button on the Deployment Detail page, users can view all files included in the deployed code.

April 21, 2023

Improved Deployment Utilities

Users can now "redeploy" any historical deployement with a UI button on the deployment details page. This enables useful workflows including rollbacks. The "download source" button downloads a tarball containing the deployed source to your local machine. Deploy UI Enhancements

April 18, 2023

Resolver error messages for incorrect types include primary keys

When writing resolvers, incorrect typing can be a difficult to track. Now, if a resolver instantiates a feature of an incorrect type, the resolver error message will include the primary key value(s) of the query itself.

April 11, 2023

Online query improvements

The Online Query API can now be used to query DataFrame-typed features. For instance, you can query all of a user's transaction level features in a single query:

chalk query --in user.id --out user.transactions

{
  "columns": ["transaction.id", "transaction.user_id", ...],
  "values": [[1, 2, 3, ...], ["user_1", "user_2", "user_3", ...]
}

More functionality will be added to Online and Offline query APIs to support more advanced query patterns.

April 6, 2023

Branch deployments

When deploying with chalk apply a new flag --branch <branch_name> has been introduced which creates a branch deployment. Users can interact with their branch deployment using a consistent name by passing the branch name to query, upload_features, etc. Chalk clients can also be scoped to a branch by passing the branch in the constructor. Branch deployments are many times faster than other flavors of chalk apply, frequently taking only a few seconds from beginning to end. Branch deployments replace preview deploys, which have been deprecated.

March 31, 2023

Speed improvements for deployments

Deployments via chalk apply are now up to 50% faster in certain cases. If your project's PIP dependencies haven't changed, new deployments will build & become active significantly faster than before.

Deploy Time Comparison:

March 17, 2023

Offline TTL

Introduces a new "offline_ttl" property to features decorator . Now you can control for how long data is valid in the offline_store. Any feature older than the ttl value will not be returned in an offline query.

@features
class MaxOfflineTTLFeatures:
    id: int
    ts: datetime = feature_time()

    no_offline_ttl_feature: int = feature(offline_ttl=timedelta(0))
    one_day_offline_ttl_feature: int = feature(offline_ttl=timedelta(days=1))
    infinite_ttl_feature: int

Strict Feature Validation

Adds the strict property to features decorator, indicating that any failed validation will throw an error. Invalid features will never be written to the online or offline store is strict is True. Also introduces the validations array to allow differentiated strict and soft validations on the same feature.

@features
class ClassWithValidations:
    id: int
    name: int = feature(max=100, min=0, strict=True)
    feature_with_two_validations: int = feature(
        validations=[
            Validation(min=70, max=100),
            Validation(min=0, max=100, strict=True),
        ]
    )

March 7, 2023

Datasets in Offline Query

The Dataset class is now live! Using the new ChalkClient.offline_query method, we can inspect important metadata about the query and retrieve its output data in a variety of ways.

Simply attach a dataset_name to the query to persist the results.

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
     input={
         User.id: uids,
     },
     input_times=[at] * len(uids),
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     dataset_name='my_dataset'
)
pandas_df: pd.DataFrame = dataset.data_as_pandas

Check out the documentation here.

February 28, 2023

Deployment Build Logs

Chalk now provides access to build and boot logs through the Deployments page in the dashboard.

Build Logs

February 16, 2023

Resolver timeouts

Computing features associated with third-party services can be unpredictably slow. Chalk helps you manage such uncertainty by specifying a resolver timeout duration.

Now you can set timeouts for resolvers!

@online(timeout="200ms")
def resolve_australian_credit_score(driver_id: User.driver_id_aus) -> User.credit_score_aus:
    return experian_client.get_score(driver_id)

January 26, 2023

SQL File Resolvers

SQL-integrated resolvers can be completely written in SQL files: no Python required! If you have a SQL source like as follows:

pg = PostgreSQLSource(name='PG')

You can define a resolver in a .chalk.sql file, with comments that detail important metadata. Chalk will process it upon chalk apply as it would any other Python resolver.

-- type: online
-- resolves: user
-- source: PG
-- count: 1
select email, full_name from user_table where id=${user.id}

Check out the documentation here.

January 12, 2023

Improved Logging

Logging on your dashboard has been improved. You can now scroll through more logs, and the formatting is cleaner and easier to use. This view is available for resolvers and resolver runs.

Logs Viewer

January 9, 2023

Pretty Print Online Query Results

Online Query Response objects now support pretty-print in any iPython environment.

Pretty Print Query Response

January 8, 2023

Linux docker containers on M1 Macs

chalkpy has always supported running in docker images using M1's native arm64 architecture, and now chalkpy==1.12.0 supports most functionality on M1 Macs when run with AMD64 (64 bit Linux) architecture docker images. This is helpful when testing images built for Linux servers that include chalkpy.

January 6, 2023

Docs Search

Chalk has lots of documentation, and finding content is now difficult.

We've added docs search!

Documentation search

Try it out by typing cmd-K, or clicking the search button at the top of the table of contents.

September 27, 2022

Tags & Owners as Comments

This update makes several improvements to feature discovery.

Tags and owners are now parsed from the comments preceding the feature definition.

@features
class RocketShip:
    # :tags: **team:identity**, **priority:high**
    # :owner: **katherine.johnson@nasa.gov**
    velocity: float
    ...

Prior to this update, owners and tags needed to be set in the feature(...) function:

@features
class RocketShip:
    velocity: float = feature(
        tags=["**team:identity**", "**priority:high**"],
        owner="**katherine.johnson@nasa.gov**"
    )
    ...

Feel free to choose either mechanism!

July 28, 2022

Auto Id Features

It's natural to name the primary feature of a feature class id. So why do you always have to specify it? Until now, you needed to write:

@features
class User:
    id: str = feature(primary=True)
    ...

Now you don't have to! If you have a feature class that does not have a feature with the primary field set, but has a feature called id, it will be assigned primary automatically:

@features
class User:
    id: str
    ...

The functionality from before sticks around: if you use a field as a primary key with a name other than id, you can keep using it as your primary feature:

@features
class User:
    user_id: str = feature(primary=True)
    # Not really the primary key!
    id: str

July 25, 2022

DataFrame Expressions

The Chalk DataFrame now supports boolean expressions! The Chalk team has worked hard to let you express your DataFrame transformations in natural, idiomatic Python:

DataFrame[
  User.first_name == "Eleanor" or (
    User.email == "eleanor@whitehouse.gov" and
    User.email_status not in {"deactivated", "unverified"}
  ) and User.birthdate is not None
]

Python experts will note that or, and, is, is not, not in, and not aren't overload-able. So how did we do this? The answer is AST parsing! A more detailed blog post to follow.

July 22, 2022

Descriptions as Comments

This update makes several improvements to feature discovery.

Descriptions are now parsed from the comments preceding the feature definition. For example, we can document the feature User.fraud_score with a comment above the attribute definition:

@features
class User:
    # **0 to 100 score indicating an identity match.**
    # **Low scores indicate safer users**
    fraud_score: float
    ...

Prior to this update, descriptions needed to be set in the feature(...) function:

@features
class User:
    fraud_score: float = feature(description="""
           **0 to 100 score indicating an identity match.**
           **Low scores indicate safer users**
        """)
    ...

The description passed to feature(...) takes precedence over the implicit comment description.

Namespace Metadata

You can now set attributes for all features in a namespace!

Here, we assign the tag group:risk and the owner ravi@chalk.ai to all features on the feature class. Owners specified at the feature level take precedence (so the owner of User.email is the default ravi@chalk.ai whereas the owner of User.flaky_api_result is devops@chalk.ai). Tags aggregate, so email has the tags pii and group:risk.

@features(tags="group:risk", owner="ravi@chalk.ai")
class User:
    email: str = feature(tags="pii")
    flaky_api_result: str = feature(owner="devops@chalk.ai")

July 14, 2022

Self-Serve Slack Integration

You can configure Chalk to post message to your Slack workspace! You can find the Slack integration tab in the settings page of your dashboard.

Slack integration

Slack can be used as an alert channel or for build notifications.

July 13, 2022

Python 3.8 Support

Chalk's pip package now supports Python 3.8! With this change, you can use the Chalk package to run online and offline queries in a Python environment with version >= 3.8. Note that your features will still be computed on a runtime with Python version 3.10.

July 8, 2022

Named Integrations

Chalk's injects environment variables to support data integrations. But what happens when you have two data sources of the same kind? Historically, our recommendation was to create one set of environment variables through an official data source integration, and one set of prefixed environment variables yourself using the generic environment variable support.

With the release of named integrations, you can connect to as many of the same data source as you need! Provide a name at the time of configuring your data source, and reference it in the code directly. Named integrations inject environment variables with the standard names prefixed by the integration name (ie. RISK_PGPORT). The first integration of a given kind will also create the un-prefixed environment variable (ie. both PGPORT and RISK_PGPORT).

June 29, 2022

SOC 2 Report

Chalk is excited to announce the availability of our SOC 2 Type 1 report from Prescient Assurance. Chalk has instituted rigorous controls to ensure the security of customer data and earn the trust of our customers, but we're always looking for more ways to improve our security posture, and to communicate these steps to our customers. This report is one step along our ongoing path of trust and security.

If you're interested in reviewing this report, please contact support@chalk.ai to request a copy.

June 3, 2022

Pandas Integration

You can now convert Chalk's DataFrame to a pandas.DataFrame and back! Use the methods chalk_df.to_pandas() and .from_pandas(pandas_df).

Migration Sampling

The 1.4.1 release of the CLI added a parameter --sample to chalk migrate. This flag allows migrations to be run targeting specific sample sets.

Feature/Resolver Health

Added spark lines to the feature and resolver tables which show a quick summary of request counts over the past 24 hours. Added status to feature and resolver tables which show any failing checks related to a feature or resolver.