### Natural Joins in Pandas and SQLite

When a row is not identified uniquely by a single column value, you may need to JOIN ON more than one column value in your pandas or sql statement. If you'd like to use the full intersection of all columns (where you must have a matching value in every common column between two tables or dataframes), you can specify all columns or use the default behavior of pandas or the NATURAL JOIN statement in SQL. 

This workbook illustrates various options. 

#### Create data

We create two identical tables. Neither of these tables are uniquely identified by a single column.

In [1]:
t_1 = [
    ["A", "B", "C"],
    ["A", "B", "D"],
    ["B", "B", "C"],
    ["A", "A", "C"],
    ["B", "B", "D"]
]

t_2 = [
    ["A", "B", "C"],
    ["A", "B", "D"],
    ["B", "B", "C"],
    ["A", "A", "C"],
    ["B", "B", "D"]
]

In [2]:
import pandas as pd

In [7]:
df_1 = pd.DataFrame(t_1)
df_1.columns = ['C1', 'C2', 'C3']

df_2 = pd.DataFrame(t_2)
df_2.columns = ['C1', 'C2', 'C3']

In [8]:
df_1

Unnamed: 0,C1,C2,C3
0,A,B,C
1,A,B,D
2,B,B,C
3,A,A,C
4,B,B,D


In [9]:
df_2

Unnamed: 0,C1,C2,C3
0,A,B,C
1,A,B,D
2,B,B,C
3,A,A,C
4,B,B,D


### Pandas Merge

Because the two tables are identical, an inner join on all common columns, the "natural join", replicates one of the original tables. 

This is the default behavior for pandas merge - if you don't specify a join column, all common columns between the two tables must match.

In [30]:
df_1.merge(df_2)

Unnamed: 0,C1,C2,C3
0,A,B,C
1,A,B,D
2,B,B,C
3,A,A,C
4,B,B,D


If we specify a join column, we will get the cross product of both dataframe rows matching on this column. Because neither table is uniquely identified by this column, the result set will expand.

In [31]:
df_1.merge(df_2, on='C1')

Unnamed: 0,C1,C2_x,C3_x,C2_y,C3_y
0,A,B,C,B,C
1,A,B,C,B,D
2,A,B,C,A,C
3,A,B,D,B,C
4,A,B,D,B,D
5,A,B,D,A,C
6,A,A,C,B,C
7,A,A,C,B,D
8,A,A,C,A,C
9,B,B,C,B,C


If we specify all common columns in our query, we will get the same result as the default behavior. 

In [20]:
df_1.merge(df_2, on=['C1','C2','C3'])

Unnamed: 0,C1,C2,C3
0,A,B,C
1,A,B,D
2,B,B,C
3,A,A,C
4,B,B,D


### SQL JOIN

We can replicate the pandas work using SQL. 

In [34]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

A NATURAL JOIN in SQL will join on all common columns between two tables

In [35]:
pysqldf("""
SELECT
    *
FROM
    df_1
NATURAL JOIN
    df_2
""")

Unnamed: 0,C1,C2,C3
0,A,B,C
1,A,B,D
2,B,B,C
3,A,A,C
4,B,B,D


If we join only on the first column, we will get the cross product of both tables matching on this column, because C1 is not a unique identifier for each row.

In [36]:
pysqldf("""
SELECT
    *
FROM
    df_1
JOIN
    df_2
ON
    df_1.C1 = df_2.C1
""")

Unnamed: 0,C1,C2,C3,C1.1,C2.1,C3.1
0,A,B,C,A,A,C
1,A,B,C,A,B,C
2,A,B,C,A,B,D
3,A,B,D,A,A,C
4,A,B,D,A,B,C
5,A,B,D,A,B,D
6,B,B,C,B,B,C
7,B,B,C,B,B,D
8,A,A,C,A,A,C
9,A,A,C,A,B,C


If we explicitely JOIN on all common columns, we will replicate the NATURAL JOIN

In [39]:
pysqldf("""
SELECT
    *
FROM
    df_1
JOIN
    df_2
ON
    df_1.C1 = df_2.C1
    AND
    df_1.C2 = df_2.C2
    AND
    df_1.C3 = df_2.C3 
""")

Unnamed: 0,C1,C2,C3,C1.1,C2.1,C3.1
0,A,B,C,A,B,C
1,A,B,D,A,B,D
2,B,B,C,B,B,C
3,A,A,C,A,A,C
4,B,B,D,B,B,D
