Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
356 lines (222 sloc) 11.4 KB

Expressions

Expression provides a powerful, simple query language for data manipulation, similar to SQL. You can create/replace columns or drop rows, which can use primitives or existing columns in your dataset, including cross-table filters and aggregations.

What can you do with Expressions?

How to use Expressions

Expression is a transform. Pass a single triplequoted string directive to it. Datamode will parse and execute your directive on the dataset.

Expression(''' your_expression_here '''),

Expression Syntax

Full expression:
    <assignment> | <droprows>
    [where ...]
    [join ...]

assignment:
    <tablecolumn> = <formula> [default <primitive>]

droprows:
    drop rows from <[table.]column>

formula:
    <arbitrary nested infix PEDMAS formula including aggregation functions>

where:
    where <tablecolumn> <comparison-operator> (<tablecolumn> | <primitive>)

join:
    join <tablecolumn> <comparison-operator> <tablecolumn>

tablecolumn:
    [tablename.]columnname

aggregations:
    funcname(<formula>)

primitive:
    integer, bool, string, float

Note

Some things to keep in mind:

  • String literals must be single-quoted.
  • Both the table name and column name can separately be optionally double-quoted to allow spaces, e.g. "my table".mycolumn.
  • All keywords are case insensitive (e.g. aND, WHERE, is null, True, fAlSe, DEFAULT etc).
  • You don't strictly need triplequoted strings but it makes writing longer queries easier.
  • For now, aggregation functions can only take a single column, nothing else (and only works with joins)
  • Don't forget the comma at the end, since Expression is just another Transform.

Actions

Assigning columns

You can create or overwrite a single named column by assigning an arbitrary formula to it, which follows standard PEDMAS order. The formula can have any number of nested math expressions, including functions like sum, average and count for cross-table aggregations. Example:

Expression(''' mytable.mycolumn = foo / bar + (baz * 3) ** 2 '''),

One restriction is that you can't create an infix formula within an aggregation function. This is for performance and complexity reasons. However, if you need this logic, just use two Expressions to create an interim column. That will also make your logic and data easier to debug.

Dropping rows

You can use the single drop rows from <table> directive. You'll probably want to specify a where clause though, or all rows will be dropped. If you use a join, only rows in the specific drop table will be dropped. Example:

Expression(''' drop rows from users
               where user_age < 0 '''),

Filtering with where clauses

This is similar to SQL - you can compare a column to a primitive, two columns in the same table, or two columns cross-table. Cross-table columns require an explicit join. Don't forget you need to use = for assignment, == (or another operator) for comparison.

Examples:

Joining

Datamode currently only supports a single inner join per expression. You can join any two tables on any one column from each. The columns don't need to have the same name. If you need to join more than one table, you should use multiple Expressions. That will help you separate and debug your logic as well.

Examples:

You can also use this as a lookup engine - see the example below on zipcodes.

Table/column handling

For source columns

You can always specify the table, but you don't need to. If you don't, Datamode will lookup the table from the column. If it doesn't exist, or there are multiple columns with the same name, Datamode will raise an error.

For destination columns

Again, you can always specify the table. Just like source columns, Datamode will attempt to lookup the column. If it doesn't exist, it will be created in the table used by the source columns. If the table doesn't exist, it will be created as well.

Examples

Simple math operations

# Simple division
Expression(''' tax = price / 10 '''),

# Exponents and square roots
Expression(''' price_sqrt = price ** 0.5 '''),

# Multicolumn operation
Expression(''' roi = revenue / budget '''),

Nested math operations with aggregation functions

You can create an arbitrarily nested expression, but it's often better to break up your Expression into multiple Expressions, which helps with debugging.

# Arbitrarily complex PEDMAS order
Expression(''' newcol = (foo ** bar + baz / (baz * (3 - 2)) ) ** 2 '''),

Create multiple columns

Note the comma separating the clauses. The actions share any where or join clauses.

# Two assignments, with a cross-column where clause.
Expression(''' column1 = foo + bar, column2 = bar + baz
               where foo > bar '''),

Sum and count total orders per user

Expression can aggregate columns across multiple tables, just like SQL. In this case we're summing and counting the orders for each user, joining the orders and users table. You have to source multiple tables in your transforms, however, e.g. with multiple calls to SourceFile, SourceSql, etc.

Expression(''' users.orders_total = sum(orders.total),
               users.orders_count = count(orders.total)
               join users.user_id == orders.user_id
           '''),

Time between signup and first order

We want to find out how long it took the user to make their first order. To do that, we can get the first order and subtract the user's signupDate. If any row data is null, the result for that row will be null too.

Expression(''' users.time_to_first_order = first(orders.date) - users.signupDate
               join users.user_id == orders.user_id '''),

Lookup a value from another table (one to many)

Say you've collected zipcodes for each user. You have another table with income for that zipcode. You can use Expression to do a lookup on the zipcodes table and store the result in the users table.

Expression(''' users.zipcode_income = zipcodes.income
               join users.zipcode == zipcodes.zipcode '''),

Drop rows

You can remove invalid rows, or really any rows that meet the where expression. You can also join a table to use a cross-table where clause, but rows will only be dropped from the named drop rows from <table> table.

# Drop obviously invalid rows
Expression(''' drop rows from users
               where users.age <= 0 '''),

# Drop rows with a join
# Will only drop rows from users, not zipcodes
Expression(''' drop rows from users
               where zipcode.income < 10000
               join users.zipcode == zipcodes.zipcode '''),

Use spaces or other chars in column or table names

Double quotes are only necessary around table or column names that have spaces or dots.

Expression(''' "my new column" = "my table"."my old column" * 2 '''),

Create boolean columns from existing columns

With this query, any project valued over 100 will get high priority = true, false otherwise. This can be especially helpful when trying to create dependent variables.

Expression(''' projects.high_priority = true default false
               where projects.value > 100 '''),

Simple string manipulation

You can use + for simple string concatenation. It won't work with non-strings yet though.

Expression(''' address = street_line + ', ' + city + ', ' + state + ' ' + zipcode '''),

Note

For better readability with multiple lines, use Python's triplequoted strings in your expression:

Expression(''' mycolumn1 = mycolumn2
               where mycolumn3 > 1 '''),

What's not supported yet

  • Date to string comparisons (coming soon)
  • Sorting, including for aggregations. (coming soon)
  • Groupbys, although every join has an implicit groupby on the joined id. (coming soon)
  • Operating on strings and non-strings in conjunction
  • OR-conjunctions in where clauses
  • Python-style string or date formatting
  • Arbitrary formulas inside aggregation functions
  • Scalar functions like logarithms, etc.
  • Intervals like week, date, day, hour, minute etc
  • Time windows

If you'd like any of the above to exist, please email us at feedback@datamode.com.