# Using the data algebra for Statistics and Data Science

## Introduction

This is an intermediate level example of using the [data algebra](https://github.com/WinVector/data_algebra) to translate a statistical method into an implementation that works both with Pandas and in databases.

This example turns out to be fairly non-trivial, as it involves:

  * Aggregating data.
  * Combining the aggregations either through a "join results" strategy.
  * Re-shaping records to move from multi row records to single row records.

This many steps can be daunting. The guiding principle is: decompose into sub-problems, solve those, and then compose the pieces into a working solution. The data algebra supports this strategy as it is optimized to build up data processing pipelines from smaller pieces and is optimized for re-use and testing.

### The problem

We will demonstrate using the data algebra to solve a statistical problem: computing a difference in means in terms of a non-pooled standard deviation. What makes it a challenge is: we want to do this quickly per-group for possibly very many groups.

First we set up some example data in Python using Numpy and Pandas.

In [1]:
# import packages
import numpy.random
import pandas

from data_algebra.data_ops import *
from data_algebra.cdata import *
import data_algebra.BigQuery
import data_algebra.test_util

In [2]:
# build synthetic example data

# seed the pseudo-random generator for repeatability
numpy.random.seed(1999)

# choose our simulated number of observations
n_obs = 1000

d = pandas.DataFrame({
    'group': numpy.random.choice(['a', 'b', 'c'], size=n_obs, replace=True),
    'value': numpy.random.normal(0, 1, size=n_obs),
    'sensor': numpy.random.choice(['s1', 's2'], size=n_obs, replace=True),
})
# make the b group have an actual difference in means of s1 versus s2
group_b_sensor_s2_rows = (d['group'] == 'b') & (d['sensor'] == 's2')
d.loc[group_b_sensor_s2_rows, 'value'] = d.loc[group_b_sensor_s2_rows, 'value']  + 0.5

d.head()

Unnamed: 0,group,value,sensor
0,a,0.051306,s2
1,a,0.700005,s2
2,a,-1.022481,s2
3,b,1.862029,s1
4,c,-1.173817,s2


The data is synthetic. What is modeling is taking measurements from different groups using two different sensors.

We want to see, for each group if the empirically observed difference in means in the values recorded by sensor `s1` and sensor `s2` are different in an interesting way. By our construction of the synthetic data there is a significant difference in group `b`, and not in any other group.

This is essentially an [ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance) and [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test) type of question, where we define interesting as the observed difference being rare under a null hypothesis such as the means and variance being shared between `s1` and `s2` sensors per group.


## The Statistical Package Approach

Before using the data algebra, let us do this the standard way: using a pre-packaged solution.

In [3]:
import scipy.stats

groups = list(set(d['group']))
groups.sort()
d_grouped = d.groupby(['group'])

def f(g):
    d_sub = d_grouped.get_group(g)
    v_s1 = d_sub.loc[d_sub['sensor'] == 's1', 'value']
    v_s2 = d_sub.loc[d_sub['sensor'] == 's2', 'value']
    res_g = scipy.stats.ttest_ind(v_s1, v_s2)
    return pandas.DataFrame({
        'group': [g],
        't': [res_g.statistic],
        'significance': [res_g.pvalue],
    })

group_stats = [f(g) for g in groups]
group_stats = pandas.concat(group_stats).reset_index(inplace=False, drop=True)

group_stats

Unnamed: 0,group,t,significance
0,a,-1.139708,0.255263
1,b,-3.452261,0.00063
2,c,0.467327,0.640554


For our example let's pursue this calculation by hand. Our quantity of interest is going to be: for each group we want to estimate `t = ((s1 estimate) - s2 estimate)) / (var(s1 estimate) + var(s2 estimate)).sqrt()`, where `(si estimte) = mean(si)`. This estimate is using the fact that, for independent processes variances are additive.

When `|t|` is large (say 2 or 3), we consider the observed difference to be unlikely under the null hypothesis that the true means or expected values of `s1` and `s2` sensor are identical per group. The reasoning being under the null hypothesis (and under fairly mild additional conditions and with enough data), `t` with be nearly [Student-t distributed](https://en.wikipedia.org/wiki/Student%27s_t-distribution) where absolute values as large as 2 or 3 being somewhat rare.

So let's estimate `t` using the data algebra.

## The Solution

The strategy is to break the calculation down into smaller solvable steps. The data algebra is essentially a coding of [Codd's relational algebra](https://en.wikipedia.org/wiki/Relational_algebra) in Python. This is just the thesis that if one learns a few primary data transforms, then many data processing tasks can be effectively written in terms of these operations. The operations are typically:


   * Adding a column as function of other columns. That is for each row values are combined to create a new value. This often called an extension.
   * Computing an aggregation such as mean, max in one column controlled by a specification of which rows are to be grouped together. This is typically called projection if we want exactly one row result per group or a "window function" if we want one result row per input row.
   * Joining rows from two data frames that match on particular key columns. This is a powerful method of mapped lookup and cross-product formation.

In terms of these operators we want new columns such as `mean(s1 values)`, `mean(s2 values)`, and so on, to be calculated per group. The organizing idea is:

> Imagine some columns such that if these columns were already in your data frame, then the calculation would be easy to finish. Then add these columns to your data frame.

Let's do that using the data algebra.

### Adding Some Columns

First we specify the operations we want to perform. The wish we are trying to satisfy is: "the calculation would be much easier we already knew the per sensor and group standard deviations and group sizes". This can be written a "project" operation partitioned by our grouping columns `group` and `sensor`.

Our definition of operators is as follows. We start with `describe_table()` which build a description of the column structure of our data frame `d`. We then call `.extend()` on this object to specify new columns (`group_sd` and `group_size`) we want produced. The `partition_by` specifies which set of rows go into each calculation. We also add more steps to combine these columns to get the per-group variances. Notice the later steps don't use a partition to be specified, as we can safely calculate per row. The rule is: each extend is separated to use only values that are available before the step.

This is easiest just to see this in action.

In [4]:
# define our operators
td = descr(d=d)
ops_var = (
    td
        .project(
            {
                'group_sensor_var': 'value.var()',  # estimate variance of items
                'group_sensor_mean': 'value.mean()',  # estimate mean of items
                'group_sensor_n': '(1).sum()',   # sample sizes
            },
            group_by=['group', 'sensor'])
        .extend(  # get the variance of the mean estimate
            {'group_sensor_est_var': 'group_sensor_var / group_sensor_n'})
    )

The operations we used are:


  * `extend()`: add new columns to current rows. This can work either without a partition, where each calculation is performed among values in each row. Or this can work with a partition, where values are aggregated across groups of rows, but still written into the original rows. In an `extend()` the calculation is specified as a dictionary of new column values mapping to the quoted expressions to calculate the values. The expression grammar is similar to Python/Numpy/Pandas, with a good number of methods available. Each extend can only refer to values that already exist, this is to prevent confusion as to which values are in which columns during calculation.
  * `project()` is a grouped operation where each group of rows is replaced by a single row. The grouping columns are copied into the new row, so they don't have to be specified. Any other columns must be created by calculations.


The types of operators will be familiar to [dplyr]( https://CRAN.R-project.org/package=dplyr) users. However, they originally come from Codd, and this style of emphasis on composition was prototyped in [rqdatatable](https://CRAN.R-project.org/package=rqdatatable). A complete method list can be found [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Methods/op_catalog.csv). A general introduction is found [here](https://github.com/WinVector/data_algebra).

Our working values look like the following if we apply the operations we have up to now.

In [5]:
# apply our operators to our data frame d
ops_var.transform(d)


Unnamed: 0,group,sensor,group_sensor_var,group_sensor_mean,group_sensor_n,group_sensor_est_var
0,a,s1,0.966323,-0.103881,134,0.007211
1,a,s2,0.86128,0.018839,187,0.004606
2,b,s1,0.890383,0.097989,161,0.00553
3,b,s2,0.992986,0.470291,163,0.006092
4,c,s1,0.977829,0.069197,166,0.005891
5,c,s2,1.176182,0.017453,189,0.006223


The `transform()` implementation is modular and is intended to eventually support other realizations of Pandas style APIs. Prospects include Dask, datatable, RAPIDS; but we have not started development on these adapters. To support his prospect there is only one explicit reference to Pandas in the package, and that reference can be overridden by the user.

## Combining Rows to Get Results

We can now see the sensors seem to differ more in group `b` than in the other groups. Let's finish the calculation and quantify this. We now have the issue of needing values from specific pairs of rows. The simplest way to work around this is to get all the values we want into a single row and then work forward.

What we mean is we wish to take a record that looks like the following.

In [6]:
a = pandas.DataFrame({
    'group': ['a', 'a'],
    'sensor': ['s1', 's2'],
    'group_sensor_mean': [-0.103881, 0.018839],
    'group_sensor_est_var': [0.007211, 0.004606],
})

a

Unnamed: 0,group,sensor,group_sensor_mean,group_sensor_est_var
0,a,s1,-0.103881,0.007211
1,a,s2,0.018839,0.004606


And transform it into a single row such as the following.

In [7]:
b = pandas.DataFrame({
    'group': ['a'],
    'group_sensor_mean_s1': [-0.103881],
    'group_sensor_mean_s2': [0.018839],
    'group_sensor_est_var_s1': [0.007211],
    'group_sensor_est_var_s2': [0.004606],
})

b

Unnamed: 0,group,group_sensor_mean_s1,group_sensor_mean_s2,group_sensor_est_var_s1,group_sensor_est_var_s2
0,a,-0.103881,0.018839,0.007211,0.004606


### Coordinatized Data

There are a number of ways to do this. Our preferred method is to use the [coordinatized data methodology](https://github.com/WinVector/data_algebra/blob/main/Examples/cdata/cdata_general_example.ipynb).

In general the methodology works by specifying examples of the incoming and outgoing records, though it does have convenience methods for common tasks such as melting, pivoting, and un-pivoting.

What we do is write down the incoming and outgoing record shapes.

In [8]:
record_in = pandas.DataFrame({
    'sensor': ['s1', 's2'],
    'group_sensor_mean': ['group_sensor_mean_s1', 'group_sensor_mean_s2'],
    'group_sensor_est_var': ['group_sensor_est_var_s1', 'group_sensor_est_var_s2'],
})

record_in


Unnamed: 0,sensor,group_sensor_mean,group_sensor_est_var
0,s1,group_sensor_mean_s1,group_sensor_est_var_s1
1,s2,group_sensor_mean_s2,group_sensor_est_var_s2


Notice these are just the example per-group records with specific values replaced by labels. These examples then specify the transform. The convention is: single row records don't need to be specified, and we draw out how multi row records work as follows.

In [9]:
record_map = RecordMap(
    blocks_in=RecordSpecification(
        control_table=record_in,
        record_keys=['group'],
    ),
)

We confirm that the transform essentially takes `a` to `b`.

In [10]:
a_transform = record_map.transform(a)

a_transform

Unnamed: 0,group,group_sensor_mean_s1,group_sensor_est_var_s1,group_sensor_mean_s2,group_sensor_est_var_s2
0,a,-0.103881,0.007211,0.018839,0.004606


In [11]:
assert data_algebra.test_util.equivalent_frames(a_transform, b)

We then add this step to our growing operator pipeline.

In [12]:
ops = (
    ops_var
        .convert_records(record_map)
)

ops.transform(d)

Unnamed: 0,group,group_sensor_mean_s1,group_sensor_est_var_s1,group_sensor_mean_s2,group_sensor_est_var_s2
0,a,-0.103881,0.007211,0.018839,0.004606
1,b,0.097989,0.00553,0.470291,0.006092
2,c,0.069197,0.005891,0.017453,0.006223


Notice instead of applying new operations to our data frame, we instead append new operations onto our existing operations pipeline `ops`.  This is the core of the data algebra: operating on pipelines to produce larger re-usable pipelines.

We can now finish our task as a calculation using the available columns.

## Putting it all Together

In [13]:
ops = (
    ops
        .extend({'mean_diff': 'group_sensor_mean_s1 - group_sensor_mean_s2'})
        .extend({'t': 'mean_diff / (group_sensor_est_var_s1 + group_sensor_est_var_s2).sqrt()'})
        .drop_columns(['group_sensor_mean_s1', 'group_sensor_est_var_s1',
                       'group_sensor_mean_s2', 'group_sensor_est_var_s2'])
        .order_rows(['group'])
)

pandas_res = ops.transform(d)

pandas_res

Unnamed: 0,group,mean_diff,t
0,a,-0.12272,-1.12891
1,b,-0.372302,-3.453425
2,c,0.051744,0.470131


This result is close to the statistics package result.


In [14]:
assert data_algebra.test_util.equivalent_frames(
    pandas_res.loc[:, ['group', 't']],
    group_stats.loc[:, ['group', 't']],
    float_tol=0.01)

The new operations we showed here are:

  * `drop_columns()`: remove columns. One could instead use `select_row()` to specify which columns are retained.
  * `order_rows()`, which re-orders rows. Because data algebra is designed to work with SQL databases ordering (unless used to limit rows) only is safe as a last step in a pipeline.

And we have our estimate: we reliably detect that the `b` is an unlikely measurement if there were no per-sensor difference in means for this group. We can re-apply the `ops` transform to any additional data frames that have the expected columns.

## Databases

An additional benefit of the data algebra is: the same operations can be applied to an arbitrary database (currently Google BigQuery, PostgreSQL, SQLite, Spark, and (partially) MySQL).

To do this we build a model of the database connection.

In [15]:
db_handle = data_algebra.BigQuery.example_handle()

We then, for purposes of illustration, insert our data into the database. In real applications the data is usually already in the database.

In [16]:
db_handle.insert_table(d, table_name='d', allow_overwrite=True)

(TableDescription(table_name="d", column_names=["group", "value", "sensor"]))

And now we can build a table that contains the result:

In [17]:
db_handle.execute(f"DROP TABLE IF EXISTS {db_handle.db_model.table_prefix}.res")
db_handle.execute(f"CREATE TABLE {db_handle.db_model.table_prefix}.res AS " + db_handle.to_sql(ops))

It is important to note, at this point all calculations are occurring in the database. No data is round tripping between the database and Python. With the right database, this can achieve a performance and scale that is beyond typical Python data frame tools and packages.

We can, of course, look at the result.

In [18]:
db_res = db_handle.read_query(f'SELECT * FROM {db_handle.db_model.table_prefix}.res')

db_res

Unnamed: 0,mean_diff,group,t
0,-0.12272,a,-1.12891
1,-0.372302,b,-3.453425
2,0.051744,c,0.470131


Notice the results match the in-memory calculations

In [19]:
assert data_algebra.test_util.equivalent_frames(pandas_res, db_res, float_tol=1e-3)


## Conclusion

And that is how to translate a statistical calculation into a database using the data algebra. All one needs to start is a list of the allowed operations and expressions, we have links to such documentation [here](https://github.com/WinVector/data_algebra).

The data algebra is optimized to allow one to build up a data processing pipeline piece by piece. This allows one to concentrate on solving sub-problems one at a time. The data algebra emphasizes as operators acting on each other through composition, processing data is the delayed end application.

The resulting data algebra transform pipeline can be re-used on any number of data frames by the `.transform()` method and also used with different databases using the `.to_sql()` method. Data algebra pipelines can be saved using standard Python pickling procedures.


## Appendices

### The Entire Pipeline

In [20]:
print(ops)


(
    TableDescription(table_name="d", column_names=["group", "value", "sensor"])
    .project(
        {
            "group_sensor_var": "value.var()",
            "group_sensor_mean": "value.mean()",
            "group_sensor_n": "(1).sum()",
        },
        group_by=["group", "sensor"],
    )
    .extend({"group_sensor_est_var": "group_sensor_var / group_sensor_n"})
    .convert_records(
        data_algebra.cdata.RecordMap(
            blocks_in=data_algebra.cdata.RecordSpecification(
                record_keys=["group"],
                control_table=pd.DataFrame(
                    {
                        "sensor": ["s1", "s2"],
                        "group_sensor_mean": [
                            "group_sensor_mean_s1",
                            "group_sensor_mean_s2",
                        ],
                        "group_sensor_est_var": [
                            "group_sensor_est_var_s1",
                            "group_sensor_est_var_s2",
            

### The Generated SQL

The generated SQL can be quite long, but remember we were able to build up our pipeline by composition. This allowed us to worry about each stage of operations separately.

But, let's take a look at the produced SQL anyway.

In [21]:
print(db_handle.to_sql(ops))

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `project_0` AS (
  SELECT  -- .project({ 'group_sensor_var': 'value.var()', 'group_sensor_mean': 'value.mean()', 'group_sensor_n': '(1).sum()'}, group_by=['group', 'sensor'])
   VAR_SAMP(`value`) AS `group_sensor_var` ,
   AVG(`value`) AS `group_sensor_mean` ,
   SUM(1) AS `group_sensor_n` ,
   `group` ,
   `sensor`
  FROM
   `data-algebra-test.test_1.d`
  GROUP BY
   `group` ,
   `sensor`
 ) ,
 `extend_1` AS (
  SELECT  -- .extend({ 'group_sensor_est_var': 'group_sensor_var / group_sensor_n'})
   `group` ,
   `sensor` ,
   `group_sensor_var` ,
   `group_sensor_mean` ,
   `group_sensor_n` ,
   `group_sensor_var` / `group_sensor_n` AS `group_sensor_est_var`
  FROM
   `project_0`
 ) ,
 `convert_records_blocks_in_2` AS (
  -- convert records blocks in
  SELECT
     `group` AS `group`,
     MAX(CASE WHEN  ( CAST(`sensor` AS STRING) = "s1" )  THEN 

The data algebra emits different SQL for different SQL dialects. However, adapting to additional databases is a simple task.

## Clean Up

In [22]:
db_handle.drop_table("d")
db_handle.drop_table("res")
db_handle.close()  # clean up