In [our last note](https://github.com/WinVector/data_algebra/blob/main/Examples/GettingStarted/comparing_two_dataframes.ipynb)
we mentioned that SQLite doesn't currently support full joins.

Let's take a look at that.

First we import our libraries.

In [1]:
import pandas

from data_algebra.data_ops import *
import data_algebra.SQLite
import data_algebra.test_util

We set up our example.

In [2]:
d1 = pandas.DataFrame({
    'g': ['a', 'a', 'b', 'b', 'b'],
    'v1': [1, None, 3, 4, None],
    'v2': [None, 1, None, 7, 8],
})

d2 = pandas.DataFrame({
    'g': ['c', 'b', 'b'],
    'v1': [None, 1, None],
    'v2': [1, None, 2],
})

sqlite_handle = data_algebra.SQLite.example_handle()
sqlite_handle.insert_table(d1, table_name='d1')
sqlite_handle.insert_table(d2, table_name='d2')

(TableDescription(table_name="d2", column_names=["g", "v1", "v2"]))

When we try for a full join, we get generate an exception.

In [3]:
try:
    sqlite_handle.read_query(
        'SELECT * FROM d1 FULL JOIN d2 ON d1.g = d2.g')
except Exception as e:
    print('Caught: ' + str(e))

Caught: Execution failed on sql 'SELECT * FROM d1 FULL JOIN d2 ON d1.g = d2.g': RIGHT and FULL OUTER JOINs are not currently supported


In the [data algebra](https://github.com/WinVector/data_algebra)
we would write the query a bit more like the following.

In [4]:
join_columns = ['g']

ops = (
    descr(d1=d1)
        .natural_join(
            b=descr(d2=d2),
            by=join_columns,
            jointype='full')
)

And we have no trouble executing this query in Pandas

In [5]:
res_pandas = ops.eval({'d1': d1, 'd2': d2})

res_pandas


Unnamed: 0,g,v1,v2
0,a,1.0,
1,a,,1.0
2,b,3.0,
3,b,3.0,2.0
4,b,4.0,7.0
5,b,4.0,7.0
6,b,1.0,8.0
7,b,,8.0
8,c,,1.0


We *can* simulate a full join using concatenate (or "UNION ALL") and left joins.

This has two disadvantages:

  * The adapting query is a bit long.
  * It refers to each incoming data frame twice, breaking pipeline nature of such a query
    (the execution pattern being a DAG or directed acyclic graph instead of a tree).

Let's ignore these issues and write down the query that simulates the full join. Our strategy is:

  * Build up a table with each key from *either* table in exactly one row.
  * Left join the key table into the `d1` and then into `d2`.

The query looks like this.

In [6]:
ops_simulate = (
    # get shared key set
    descr(d1=d1)
        .project({}, group_by=join_columns)
        .concat_rows(
            b=descr(d2=d2)
                .project({}, group_by=join_columns),
            id_column=None,
            )
        .project({}, group_by=join_columns)
        # simulate full join with left joins
        .natural_join(
            b=descr(d1=d1),
            by=join_columns,
            jointype='left')
        .natural_join(
            b=descr(d2=d2),
            by=join_columns,
            jointype='left')
)

And the result in Pandas is as follows.

In [7]:
res_pandas_2 = ops_simulate.eval({'d1': d1, 'd2': d2})

assert data_algebra.test_util.equivalent_frames(res_pandas_2, res_pandas)

The sole advantage is the longer `ops_simulate` pipeline can be run in SQLite.

In [8]:
res_sqlite = sqlite_handle.read_query(ops_simulate)

assert data_algebra.test_util.equivalent_frames(res_sqlite, res_pandas)

Some exciting news is, the next upcoming version of the data algebra
(version `0.8.3` and above) incorporate this simulation into its SQLite
adapter. It performs and operator tree to dag re-write and can execute
the original operations directly in the database.

In [9]:
res_sqlite_2 = sqlite_handle.read_query(ops)

assert data_algebra.test_util.equivalent_frames(res_sqlite_2, res_pandas)

The generated query isn't too illegible. And due to power of SQL-with can in fact be used on
non-table sources.

In [10]:
print(sqlite_handle.to_sql(ops))

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: SQLiteModel
--       string quote: '
--   identifier quote: "
WITH
 "table_reference_0" AS (
  SELECT
   "g"
  FROM
   "d1"
 ) ,
 "project_1" AS (
  SELECT  -- .project({ }, group_by=['g'])
   "g"
  FROM
   "table_reference_0"
  GROUP BY
   "g"
 ) ,
 "table_reference_2" AS (
  SELECT
   "g"
  FROM
   "d2"
 ) ,
 "project_3" AS (
  SELECT  -- .project({ }, group_by=['g'])
   "g"
  FROM
   "table_reference_2"
  GROUP BY
   "g"
 ) ,
 "concat_rows_4" AS (
  SELECT  -- _0..concat_rows(b= _1, id_column=None, a_name='a', b_name='b')
   "g"
  FROM
  (
   SELECT
    *
   FROM
    "project_1"
  UNION ALL
   SELECT
    *
   FROM
    "project_3"
  ) "concat_rows_4"
 ) ,
 "project_5" AS (
  SELECT  -- .project({ }, group_by=['g'])
   "g"
  FROM
   "concat_rows_4"
  GROUP BY
   "g"
 ) ,
 "natural_join_6" AS (
  SELECT  -- _0..natural_join(b= _1, by=['g'], jointype='LEFT')
   COALESCE("project_5"."g", "d1"."g") AS "g" ,
   "v1"

And that is how to simulate a full join using concatenate and left-join.

In [11]:
# clean up
sqlite_handle.close()