This is a quick note showing how to compare two data frames using
the [data algebra](https://github.com/WinVector/data_algebra).  The question is: which rows
are in one data frame and not the other

First let's set up our example

In [1]:

# import packages
import string
import numpy
import numpy.random
import pandas

from data_algebra.data_ops import *
from data_algebra.cdata import *
import data_algebra.BigQuery
import data_algebra.SQLite



In [2]:
# build synthetic example data

# seed the pseudo-random generator for repeatability
numpy.random.seed(1999)

# choose our simulated number of observations
n_obs = 100
symbols = list(string.ascii_lowercase)

d1 = pandas.DataFrame({
    'group': numpy.random.choice(symbols, size=n_obs, replace=True),
})

d2 = pandas.DataFrame({
    'group': numpy.random.choice(symbols, size=n_obs, replace=True),
})

Our example question is: which rows are unique to `d1` and which are unique to `d2`.

Let's define our grouping columns and proceed.

In [3]:
# which columns we consider to be row keys
# can be more than one column
grouping_columns = ['group']

Our plan is simple, we count how many rows each table has
for a given key and then join the results together for comparison.

The data algebra notations we will use include:

  * `descr()`. `descr(name=value)` is a notation that builds a description
     of a the Pandas data frame "`value`" and refers to this table by the name "`name`".
  * `.project()`. The project an aggregation that produces one row per distinct combination
     of values in the grouping columns. The grouping columns are copied into the result,
     and we can calculate additional columns, such as the row count `(1).sum()`.
  * `.natural_join()` joins two tables on the keys specified by "`by`".
  * `.extend()` allows us to calculate new columns. In this case we are using `coalece()`
    to replace missing values produced by the join with zeros. The missing values are
    exactly the key combinations where one table has rows and the other does not.

In [4]:

summary_ops = (
    descr(d1=d1)
        .project(
            {'d1_count': '(1).sum()'},
            group_by=grouping_columns)
        .natural_join(
            b=descr(d2=d2)
                .project(
                    {'d2_count': '(1).sum()'},
                    group_by=grouping_columns),
            by=grouping_columns,
            jointype='full')
        .extend({
            'd1_count': 'd1_count.coalesce(0)',
            'd2_count': 'd2_count.coalesce(0)',
            })
)

Once we have our intended set of operations we can execute them against our tables by
supplying data for each named table using the `.eval()` method.

In [5]:
summary_table = summary_ops.eval({'d1': d1, 'd2': d2})

summary_table

Unnamed: 0,group,d1_count,d2_count
0,a,4.0,4.0
1,b,2.0,2.0
2,c,2.0,4.0
3,d,3.0,4.0
4,e,8.0,1.0
5,f,1.0,2.0
6,g,5.0,5.0
7,h,5.0,4.0
8,i,4.0,3.0
9,j,3.0,7.0


(Note: data algebra can run the exact same command in many databases by automatic translation to SQL
by the `.to_sql()` method.)

From the resulting summary it is easy to see which columns are unique to one table or another.
We can zero in on these columns by selecting the rows where one of the counts is zero. New
commands we use in this example include:

  * `data()`. `data()` is a notation that captures a description of a Pandas data frame *and*
    a copy of the data. Notice data described in this way doesn't need a name. This
    is because as we have the data, we don't need a name to later look up data with.
  * `ex()`. `ex()` is a wrapper that takes a data algebra pipeline and executes it with the
    captured data.
  * `.select_rows()` picks rows matching the logical conditions we specify on the columns.
  * `.order_rows()` sorts the Pandas data frame by the values in the named columns.

In [6]:
ex(
    data(summary_table)
        .select_rows('(d1_count <= 0) | (d2_count <= 0)')
        .order_rows(grouping_columns)
)

Unnamed: 0,group,d1_count,d2_count
0,u,0.0,3.0
1,w,4.0,0.0


Notice throughout we used the variable `grouping_columns` instead of explicitly naming the
columns. That means this code is re-usable and could easily be converted into a utility function.

And that is it. We have worked through how to easily compare two Pandas data frames using the data algebra.

## Appendix, the same query work in BigQuery

For this demo we show the same operations operating
in the Google BigQuery database. Note: we didn't use
SQLite (as in
our [previous example](https://win-vector.com/2021/10/03/how-to-compare-two-tables-using-the-data-algebra/))
both for variety of examples and because SQLite doesn't currently support full
joins ([ref](https://www.sqlitetutorial.net/sqlite-full-outer-join/), we have an example of
how to simulate a full join [here](https://github.com/WinVector/data_algebra/blob/main/Examples/GettingStarted/simulating_full_join.ipynb)).


In [7]:
db_handle = data_algebra.BigQuery.example_handle()

# inserting just for the example, usually for databases the
# data is already in the database
db_handle.insert_table(d1, table_name='d1', allow_overwrite=True)
db_handle.insert_table(d2, table_name='d2', allow_overwrite=True)

all_ops = (
    summary_ops
        .select_rows('(d1_count <= 0) | (d2_count <= 0)')
        # move order to read-back request, as it isn't needed here
)

db_handle.drop_table('compare_result')
db_handle.execute(
    f'CREATE TABLE {db_handle.db_model.table_prefix}.compare_result AS {db_handle.to_sql(all_ops)}')

read_ops = (
    db_handle.describe_table(f'compare_result')
        .order_rows(grouping_columns)
)
db_handle.read_query(read_ops)

Unnamed: 0,group,d1_count,d2_count
0,u,0,3
1,w,4,0


In [8]:
# clean up
db_handle.close()

SQLite example

In [None]:
sqlite_handle = data_algebra.SQLite.example_handle()

print(sqlite_handle.to_sql(summary_ops))

In [None]:
# clean up
sqlite_handle.close()