# Study of Full Outer Joins in *pandas*
*Eric Nordstrom* | *eanord4@gmail.com*  
*December 13th, 2019*

## Objective

I recently found that, in addition to the standard *pd.merge* and *pd.DataFrame.merge* functions, *pandas* provides at least one other way to perform a full outer join, which is to create a *pd.DataFrame* instance and feed it *pd.Series* instances as data (see **Introduction**).

This experiment aims to determine the relative speed and scaling between these three ways of joining with *pandas*. The experiment also aims to find how null values affect the speed of joining.

## Introduction

The *pandas* Python module has multiple ways of performing joins between datasets.

1. ***pandas.merge*** is a function taking inputs of two *pandas.DataFrame* instances (datasets) in addition to parameters specifying the type of join, the column(s) on which to join, etc.

2. ***pandas.DataFrame.merge*** is a method of the *pandas.DataFrame* class which considers the instance whose *merge* method was called as the **left** dataset and an input data frame as the **right** dataset. Additional parameters are used similarly to *pandas.merge*. This method likely works in the same way as *pandas.merge*, but this was not confirmed before the experiment.

3. Creating a new data frame by calling the ***pandas.DataFrame*** class (thereby calling its *\__init\__* method) also performs a full outer join if *pandas.Series* instances are provided as columns. The join in this case is assumed to be on the indices of the series provided as illustrated in the following example:

```
In:  pd.DataFrame({
         'A': pd.Series([1,2,3], index=[1,2,3]),
         'B': pd.Series(['a', 'b', 'c'], index=[2, 5, 9])
     })

Out:      A    B
     1  1.0  NaN
     2  2.0    a
     3  3.0  NaN
     5  NaN    b
     9  NaN    c
```

## Methods

First, functions were defined to systematize the creation of datasets and the way joins were performed. Example datasets and joins are displayed after the function definitions.

Next, experimental data were collected under each case described below. For each case, three trials were performed and recorded.

Once collected, all trials were and analyzed using *pandas* plotting methods.

### Variables
* data frame length
* number of null values (resulting from unshared rows)

### Cases
* length ($L$) = 10, 10<sup>2</sup>, 10<sup>4</sup>, 10<sup>7</sup>, 10<sup>11</sup>
    * *It is not yet clear which "length" is related to the time required for the join. See notes below.*
* number of unshared rows (resulting in null values) ($n$) = 10%, 30%, 50%, 80%
* join method 

### Dependencies
The following Python packages were used:
* *pandas* 0.25.3
* *random*
* *timeit*

## Notes
### Notes about joins
Each join will be performed on two similar datasets of the same length. Values will be randomly selected **float**s between 0 and 1. Each dataset will consist of a single column of values and a set of integer indices upon which to perform the join. Each join will be a full outer join on the indices.
### Notes about the "length" variable
It is not yet clear which "length" is related to the time required to join datasets. This would depend on the exact algorithm being used to compare indices. However, a reasonable algorithm for a full outer join might be like the following:
1. Iterate over the left indices. For each index, check if it is present in the right dataset.
    * If it is, include it in the resulting data frame. **Note to ignore this row when iterating over the right dataset.**
    * If it is not, place a NaN value in the column representing the right dataset.
2. Iterate over the right indices. **Skip previously used rows.** 
    * The remaining rows cannot be present in the left dataset since it was already iterated over. Therefore, for each row, place a *NaN* value in the colum representing the left dataset.

Using this algorithm as the hypothesis, the time required for joining should scale as

$
O[a(L_\text{left}-n_\text{left}+L_\text{right}-n_\text{right}) + b(n_\text{left}+n_\text{right})]\\
=O[a(L_\text{left}+L_\text{right}) + (b-a)(N_\text{left}+N_\text{right})]
$

where $L_\text{[side]}$ is the length of the original dataset and $N_\text{[side]}$ is the number of null values resulting on the given side. Constants $a$ and $b$ are unknown and represent the fact that shared indices and unshared indices might require different amounts of time to handle. Using $L$ as the single original dataset length and $n$ as the fraction of rows which are unshared, the expression becomes

$ O[a(2L) + (a-b)(2nL)] $

or

(1) $\qquad O[(\alpha + \beta n)L]$,

where $\alpha$ and $\beta$ are unknown constants and the sign of $\beta$ informs whether the join is sped up or slowed down by null values. From here, a linear model of the timing can be created as follows:

(2) $\qquad \hat{t}(L) = t_0 + \hat{m}L$,

where $\hat{m}$ itself is modeled linearly as follows:

(3) $\qquad \hat{m}(n) = \alpha + \beta n$

### Notes about statistical methods

*At this time, it is not known which statistical methods are appropriate to calculate the above models in a way that reflects their influence on each other. This will be researched after data is gathered.*

## Procedure

#### Import dependencies

In [1]:
from timeit import timeit
import random as rm
import pandas as pd

#### Define useful functions to produce datasets and joins

In [2]:
def pd_merge(left_series, right_series):
    '''merge the datasets using `pd.merge`'''
    
    return pd.merge(left_series, right_series, how='outer', left_index=True, right_index=True)

def df_merge(left_df, right_series):
    '''merge the datasets using `pd.DataFrame.merge`; requires left dataset to be a data frame'''
    
    return left_df.merge(right_series, how='outer', left_index=True, right_index=True)

def init_join(left_series, right_series):
    '''merge the datasets by calling `pd.DataFrame`'''
    
    return pd.DataFrame({'LEFT': left_series, 'RIGHT': right_series})

def datasets(L, n):
    '''generate two series of the specified length and unshared fraction of rows'''
    
    N_unshared = int(n * L)  # number of unshared rows in each dataset. The total number of null values after joining will be double.
    length_range = range(L)
    all_indices = range(L + N_unshared)
    
    # get values
    left_data = [rm.random() for i in length_range]
    right_data = [rm.random() for i in length_range]
    
    # get left indices
    left_indices = sorted(rm.sample(all_indices, L))
    remaining = set(all_indices).difference(left_indices)
    
    # get right indices
    right_indices = rm.sample(left_indices, L - N_unshared)  # shared indices
    right_indices += rm.sample(remaining, N_unshared)  # unshared indices
    right_indices = sorted(right_indices)  # order of shared and unshared will be random
    
    # construct datasets as series
    left_series = pd.Series(left_data, index=left_indices, name='LEFT')
    right_series = pd.Series(right_data, index=right_indices, name='RIGHT')
    
    return left_series, right_series

#### Test the functions; display example datasets and join results

In [3]:
# Datasets

left_example, right_example = datasets(10, .5)

print('** Left Example **')
print('------------------------')
print(left_example)
print()

print('** Right Example **')
print('------------------------')
print(right_example)

** Left Example **
------------------------
1     0.662601
2     0.023784
3     0.442068
4     0.757046
6     0.939661
8     0.064824
9     0.608298
10    0.463203
12    0.706573
14    0.656467
Name: LEFT, dtype: float64

** Right Example **
------------------------
0     0.284811
1     0.462688
5     0.758086
7     0.051532
8     0.498499
9     0.783367
11    0.017342
12    0.815860
13    0.341990
14    0.057858
Name: RIGHT, dtype: float64


In [7]:
# Join

pd_merge_result = pd_merge(left_example, right_example)
print('** Join Example: L=10, n=10% **')
print('-------------------------------')
print(pd_merge_result)
print()

** Join Example: L=10, n=10% **
-------------------------------
        LEFT     RIGHT
0        NaN  0.284811
1   0.662601  0.462688
2   0.023784       NaN
3   0.442068       NaN
4   0.757046       NaN
5        NaN  0.758086
6   0.939661       NaN
7        NaN  0.051532
8   0.064824  0.498499
9   0.608298  0.783367
10  0.463203       NaN
11       NaN  0.017342
12  0.706573  0.815860
13       NaN  0.341990
14  0.656467  0.057858



In [5]:
# Show that all ways of joining yield the same result

print('`pd.merge` result same as `pd.DataFrame.merge`?')
print('---------------------------------------------------')
print(all( pd_merge_result == df_merge(left_example.to_frame(), right_example) ))
print()

print('`pd.merge` result same as `pd.DataFrame.__init__`?')
print('---------------------------------------------------')
print(all( pd_merge_result == init_join(left_example, right_example) ))

`pd.merge` result same as `pd.DataFrame.merge`?
---------------------------------------------------
True

`pd.merge` result same as `pd.DataFrame.__init__`?
---------------------------------------------------
True


#### Initialize experimental data

In [6]:
# Cases to be tried
L_cases = [10, 10 ** 2, 10 ** 4, 10 ** 7, 10 ** 11]
n_cases = [.1, .3, .5, .8]
join_cases = ['pd.merge', 'pd.DataFrame.merge', 'pd.DataFrame.__init__']
join_funcs = [pd_merge, df_merge, init_join]
num_trials = 3

# Initialize
times = pd.DataFrame(
    
    columns=join_cases,
    
    index=pd.MultiIndex.from_product(
        [L_cases, n_cases, range(num_trials)],
        names=['L', 'n', 'Trial']
    )

)

# Show structure of data to be collected
times.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pd.merge,pd.DataFrame.merge,pd.DataFrame.__init__
L,n,Trial,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,0.1,0,,,
10,0.1,1,,,
10,0.1,2,,,
10,0.3,0,,,
10,0.3,1,,,


#### Create datasets and perform joins