## Compare DataFrames A & B to find rows in A only, in B only and in both A & B or vice versa ("difference & overlap")

**Background:** see [this post](https://betterprogramming.pub/a-visual-guide-to-set-comparisons-in-python-6ab7edb9ec41) for some  good visuals of the purpose/result of these operations.

### Prep & read data lists

In [2]:
import pandas as pd

In [3]:
df_A = pd.read_csv("A.csv")
df_B = pd.read_csv("B.csv")

### Set id column to compare on and do quick checks

In [29]:
id_col = "id"  # Specify the id column to compare on

In [30]:
df_A.equals(df_B)

False

In [31]:
df_A.shape

(238879, 13)

In [32]:
df_B.shape

(237234, 13)

In [33]:
len(pd.unique(df_A[id_col]))

233232

In [34]:
len(pd.unique(df_B[id_col]))

233173

### Cross-compare to find ids of records in A only, in B only and in both A and B

In [35]:
id_A = set(df_A[id_col])
id_B = set(df_B[id_col])

id_A_only = id_A.difference(id_B)       # Find ids in A only but not in B ("A anti-joins B")
id_B_only = id_B.difference(id_A)       # Find ids in B only but not in A ("B anti-joins A")

id_A_B_both = id_A.intersection(id_B)   # Find ids in A and in B as well ("A overlaps B")
id_B_A_both = id_B.intersection(id_A)   # Find ids in B and in A as well ("B overlaps A")

In [36]:
# Can convert the id_cols to str type and zero-pad for nicer look, but then they won't match the original dfs unless dfs are also zero-padded
# import numpy as np
# id_A = set(df_A[id_col].astype(np.int64).astype(str).str.zfill(13).unique())
# id_B = set(df_B[id_col].astype(np.int64).astype(str).str.zfill(13).unique())

# Can call DataFrame.unique() to find distinct ids but Python's native set() above simply works fine
# id_A = set(df_A[id_col].unique())
# id_B = set(df_B[id_col].unique())

### Verify the found ids

In [37]:
# Verify the two anti-join result sets are really mutually exclusive - output should be empty sets: (set(), set())
id_A_only.intersection(id_B_only), id_B_only.intersection(id_A_only)

(set(), set())

In [38]:
# Verify the two overlapping sets are really overlapping - output should be True
id_A_B_both == id_B_A_both

True

### Produce full result tables that cross-reference the original tables

In [39]:
df_A_only = df_A[df_A[id_col].isin(id_A_only)].copy()   # copy() to get a copy instead of a slice which causes warning for the next line
df_A_only["Rownum_In_Original"] = df_A_only.index + 2   # + 2 to make the rownum one-based instead of zero-based and account for header row so as to show the exact rownum in Excel view

df_B_only = df_B[df_B[id_col].isin(id_B_only)].copy()
df_B_only["Rownum_In_Original"] = df_B_only.index + 2

In [40]:
df_A_B_both = df_A[df_A[id_col].isin(id_A_B_both)].copy()
df_A_B_both["Rownum_In_Original"] = df_A_B_both.index + 2

df_B_A_both = df_B[df_B[id_col].isin(id_B_A_both)].copy()
df_B_A_both["Rownum_In_Original"] = df_B_A_both.index + 2

In [41]:
# Can also get the opposite dfs (not only in df_A, not only in df_B) - only if needed
# df_A_only_not = df_A[~df_A[id_col].isin(id_A_only)].copy()
# df_B_only_not = df_B[~df_B[id_col].isin(id_B_only)].copy()

### Save result lists to CSV files

In [42]:
df_A_only.to_csv("A_only.csv", index=False)   # No need to keep the (misleading) index column - the Rownum_In_Original column will show which row comes from where in the original data list
df_B_only.to_csv("B_only.csv", index=False)
df_A_B_both.to_csv("A_B_both.csv", index=False)
df_B_A_both.to_csv("B_A_both.csv", index=False)

### Double check variables generated in the process

In [43]:
%whos

Variable      Type         Data/Info
------------------------------------
df_A          DataFrame             CustID Contract_<...>238879 rows x 13 columns]
df_A_B_both   DataFrame             CustID Contract_<...>238475 rows x 14 columns]
df_A_only     DataFrame             CustID Contract_<...>\n[404 rows x 14 columns]
df_B          DataFrame             CustID Contract_<...>237234 rows x 13 columns]
df_B_A_both   DataFrame             CustID Contract_<...>236891 rows x 14 columns]
df_B_only     DataFrame             CustID Contract_<...>\n[343 rows x 14 columns]
id_A          set          {20004208641.0, 144126771<...>10334713.0, 1015545854.0}
id_A_B_both   set          {20004208641.0, 101444485<...>10334713.0, 1015545854.0}
id_A_only     set          {30400665617.0, 102605622<...>4201586.0, 20003250173.0}
id_B          set          {20004208641.0, 144126771<...>10334713.0, 1015545854.0}
id_B_A_both   set          {20004208641.0, 101444485<...>10334713.0, 1015545854.0}
id_B_only    