In [1]:
# Import the python chalk client: the client is used to run queries (which will let you 
# compute features and build datasets)
from chalk.client import ChalkClient

# Notebook Development In Chalk

Chalk allows you to run queries, make datasets, and write new features directly from a Jupyter Notebook. In this notebook, we'll walk 
through some of the supported features of Chalk Notebook development:

1. Creating a new branch based on your main deployment and loading features defined in your main deployment into your notebook
2. Running online queries
3. Running offline queries
4. Defining new features:
    - Defining Alias Expressions
    - Defining New Feature Classes
    - Defining New Features on Existing Feature Classes
    - Defining SQL Resolvers
    - Defining New Python Resolvers

## 1. Creating a Branch & Loading Features

In Chalk, branch deployments are used to test out new functionality and build out new features. To begin building out features, we create a new `chalk branch` that
copies over features and resolvers from your main chalk deployment. We then call load features, which will import your feature classes into the notebook and list
out the different Features that your team has already built out! 

In [2]:
client = ChalkClient(branch="notebook-demo")
client.get_or_create_branch(branch_name="notebook-demo")
client.load_features()

Branch notebook-demo already exists. Client is being updated to point to that branch, but a new deployment is not being created.
Branch set to 'notebook-demo'


Output()

Combobox(value='', continuous_update=False, ensure_option=True, layout=Layout(width='auto'), options=('CreditR…

Output()

## 2. Running Online Queries

The Chalk python client (`ChalkClient`) can be used to run online queries. To run an online query, specify the output features you want for a given input primary key: Chalk identifies which resolvers need to be run to caclulate your outputs.

In [3]:
result = client.query(
    input={User.id: 1},
    output=[
        User,
        User.credit_report.percent_past_due,
    ],
    staleness={
        User.credit_report.percent_past_due: "0s",
    }
)
result

Unnamed: 0,Feature,Value
0,id,1
1,email,nicoleam@nasa.gov
2,name,Nicole Mann
3,dob,1977-08-27
4,email_username,nicoleam
5,domain_name,nasa.gov
6,denylisted,False
7,name_email_match_score,75.0
8,emailage_response,"{""domainAge"": 10200, ""domainname"": ""nasa.gov"",..."
9,email_age_days,2642


In [4]:
result = client.query(
    input={User.id: 1},
    output=[
        User.transactions
    ],
    staleness={
        User.credit_report.percent_past_due: "0s",
    }
)

# get the transactions from the online query as a pandas dataframe
result.get_feature_value(User.transactions).to_pandas()

Unnamed: 0,transaction.name_memo_sim,transaction.at,transaction.completion,transaction.category,transaction.is_nsf,transaction.is_ach,transaction.id,transaction.amount,transaction.memo,transaction.clean_memo,transaction.user_id
0,0.111111,2024-09-01 08:10:29+00:00,"{""category"": ""transfer"", ""is_nsf"": false, ""cle...",transfer,False,True,172,150.00,ACH TRANSFER 5432109876,ACH TRANSFER,1
1,0.117647,2024-09-01 10:35:45+00:00,"{""category"": ""Food & Drink"", ""is_nsf"": false, ...",Food & Drink,False,False,173,-7.45,Starbucks 56789,Starbucks,1
2,0.142857,2024-09-01 13:45:33+00:00,"{""category"": ""shopping"", ""is_nsf"": false, ""cle...",shopping,False,False,174,-79.56,SQ - Target 07/01,Target,1
3,0.285714,2024-09-02 07:35:16+00:00,"{""category"": ""Food"", ""is_nsf"": false, ""clean_m...",Food,False,False,169,-45.78,SQ - Chipotle 9876,Chipotle,1
4,0.350000,2024-09-02 09:25:44+00:00,"{""category"": ""food"", ""is_nsf"": false, ""clean_m...",food,False,False,170,-12.89,Venmo Payment to Amy / Groceries,Venmo Payment to Amy / Groceries,1
...,...,...,...,...,...,...,...,...,...,...,...
153,0.117647,2024-10-15 08:15:22+00:00,"{""category"": ""Food and Drink"", ""is_nsf"": false...",Food and Drink,False,False,2,-12.45,Starbucks 57892,Starbucks,1
154,0.125000,2024-10-15 09:34:01+00:00,"{""category"": ""Shopping"", ""is_nsf"": false, ""cle...",Shopping,False,False,3,-64.50,SQ - WALMART SUPERCENTER #345,WALMART SUPERCENTER #345,1
155,0.214286,2024-10-15 11:48:16+00:00,"{""category"": ""Transportation"", ""is_nsf"": false...",Transportation,False,False,4,-32.78,Uber ride 08/15,Uber ride,1
156,0.111111,2024-10-15 14:22:45+00:00,"{""category"": ""ACH Transfer"", ""is_nsf"": false, ...",ACH Transfer,False,True,5,150.00,ACH TRANSFER 1234567890,ACH TRANSFER,1


## 3. Running Offline Queries

Chalk offline queries return `Dataset`s which are lazy references to parquet files in cloud storage. These datasets can be pulled into your local machine by calling: `ds.to_pandas()` or `ds.to_polars()`.


In [5]:
from chalk.client import Dataset, ResourceRequests

# Chalk Offline
dataset: Dataset = client.offline_query(
    output=[
        User.name,
        User.denylisted,
        User.name_email_match_score,
        User.domain_name,
        User.email_username,
        User.total_spend,
        User.credit_report.percent_past_due,
    ],
    recompute_features=[User.credit_report.percent_past_due],
    
    ## offline queries support horizontal and vertical scaling: they can be sharded and configured to run on their own 
    ## separate pods with a specified amounts of memory and cpu.
    # run_asynchronously=True,
    # ResourceRequests(cpu=4, memory="16Gi"),
    # num_shards=1, 
    
    max_samples=30,
    dataset_name="user_features"
)
dataset

Dataset(name='user_features', version='6')

In [6]:
df = dataset.to_pandas()
df

[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[38;2;54;165;135m╭─[0m[38;2;54;165;135m───────────────────────────────────────────────────[0m[38;2;54;165;135m chalk ■ [0m[38;2;54;165;135m───────────────────────────────────────────────────[0m[38;2;54;165;135m─╮[0m
[38;2;54;165;135m│[0m                                                                                                                 [38;2;54;165;135m│[0m
[38;2;54;165;135m│[0m  [38;2;54;165;135mThe `DatasetRevision` is still being computed. to_pandas() will execute once computation is complete.        [0m  [38;2;54;165;135m│[0m
[38;2;54;165;135m│[0m                                                                                                                 [38;2;54;165;135m│[0m
[38;2;54;165;135m│[0m  Offline Query completed   [38;2;146;137;204m0:01:16[0m                                                                              [38;2;54;165;135m│[0m
[38;2;54;165;135m│[0m              

Unnamed: 0,user.name,user.denylisted,user.name_email_match_score,user.domain_name,user.email_username,user.total_spend,user.credit_report.id,user.id,user.credit_report.percent_past_due
0,Emily Carter,False,75.0,gmail.com,emily_carter21,,123,4,0.0
1,Nicole Mann,False,75.0,nasa.gov,nicoleam,,123,1,0.0


### Reading an Existing Dataset

`client.offline_query` takes a `dataset_name` parameter. If this is supplied, the dataset can be read in a future session or by another user through it's name. It will also be searchable in the chalk dashboard by that name.

In [4]:
dataset = client.get_dataset(dataset_name="user_features")
dataset

Dataset(name='user_features', version='6')

## 4. Creating New Features

### Alias Expressions

[Chalk Expressions](https://docs.chalk.ai/docs/underscore) can be used to create and alias dynamic features. This is done by requesting an aliased expression as the output for a query:

In [5]:
from chalk.features import _
result = client.query(
    input={User.id: 1},
    output=[
        User.transactions,
        (_.transactions[_.amount].mean()).alias("average_transaction_amt")
    ],
)

# get the t

### New Feature Classes

New features classes can be defined in notebook cells—[Chalk feature classes](https://docs.chalk.ai/docs/features) resemble pydantic base models or dataclasses. They contain features which are annotated using python type annotation.

In [6]:
from chalk.features import features

@features
class Brand:
    id: int
    name: str
    industry: str
    url: str

### New SQL Resolvers

SQL resolvers can be written for these new feature classes to pull data from data sources which have been linked in your Chalk dashboard. For example, the resolver below pulls features from a brands table in the Postgres datasource (named pg), which was connected in the Chalk dashboard.

In [11]:
%%sql_resolver get_brands
-- resolves: Brand
-- source: postgres
SELECT id, name, industry, url FROM brands

In [12]:
# Lets query our new feature!

client.query(
    input={Brand.id: 1},
    output=[Brand.name, Brand.industry, Brand.url]
) 

Unnamed: 0,Feature,Value
0,name,Nike
1,industry,Sportswear
2,url,https://www.nike.com


### Python Resolvers

Feature classes can be updated and you can write python resolvers to derive features from the base features you're pulling from Postgres. Below, we write a python resolver, which gets the html for a brand's site. 

In [13]:
# Update features and create derived features

@features
class Brand:
    id: int
    name: str
    industry: str
    url: str
    site_html: str | None


# create a python online resolver for the new feature

from chalk import online
import requests

@online
def get_brand_site(url: Brand.url) -> Brand.site_html:
    result = requests.get(url)
    if result.status_code == 200:
        return result.content
    else:
        return None

In [14]:
# Test Resolvers as Python Functions

test_result = get_brand_site('https://www.nike.com')
print(test_result[:100], "...")

b'<!DOCTYPE html><html lang="en-US"><head><meta charSet="utf-8"/><meta name="viewport" content="width=' ...


In [16]:
# Run an online query For new features
client.query(
    input={
        Brand.id: 1,
    },
    output=[Brand.site_html]
)


Unnamed: 0,Feature,Value
0,site_html,"<!DOCTYPE html><html lang=""en-US""><head><meta ..."


### Adding New Features to Existing Feature Classes

Notebook defined features can be added to existing feature classes—here we specify a join between Transaction and Brand (based on the cleaned_memo and the brand name.

### New DataFrame Taking Resolvers

In [3]:
from chalk import online
from chalk.features import feature, _, DataFrame

User.max_rolling_sum_2days = feature(typ=float)

@online
def get_max_transaction_rolling_2(transactions: User.transactions[_.amount, _.at]) -> User.max_rolling_sum_2days:
    df = transactions.to_pandas()
    return df.sort_values(by="transaction.at")["transaction.amount"].rolling(2).sum().max()

In [4]:
# Test new DataFrame Taking resolver:

from datetime import datetime, timedelta

test_df = DataFrame(
    [
        Transaction(id="1", amount=10, at = datetime.now()),
        Transaction(id="2", amount=1000, at = datetime.now() - timedelta(days=2)),
        Transaction(id="3", amount=26, at = datetime.now() - timedelta(days=5)),
        Transaction(id="4", amount=3000, at = datetime.now() - timedelta(days=3)),
        Transaction(id="5", amount=1000, at = datetime.now() - timedelta(days=1))
    ]
)


assert get_max_transaction_rolling_2(test_df) == 4000

In [5]:
# Run an online query for new joined feature

client.query(
    input={
        User.id: 1
    },
    output=[User.max_rolling_sum_2days]
)

Unnamed: 0,Feature,Value
0,max_rolling_sum_2days,981.33
