Pandas AutoJoin

With the help of this lightweight framework it is possible to automatically join two or more large dataframes together. Pandas AutoJoin offers the following features:

Simple usage
Automatic primary and foreign key detection
(technical and composite keys)
Detect and solve some data conflicts
(Misspellings, Homonyms, Synonyms, Datatypes...)
Command Line Interface (CLI)

The framework simplifies the automation of data acquisition a part of Data Science Life Cycle. It extends the benefits of AutoML and makes it more accessible to machine learning beginners.

Content

Installation
Usage
Documentation
Command Line Interface (CLI)
Example
Paper (Citation)

Installation

To install package and use it locally:

python -m pip install .

Requirements:

Usage

Let df1 be the reference Pandas dataframe table and df2 other input tables for join. Then the automatic join can perform easliy by:

import pandas as pd
import pandas_auto_join as aj
df1 = pd.read_csv('./example/datasets/flights/flights.csv')
df2 = pd.read_csv('./example/datasets/flights/airlines.csv')
df = aj.join(df1, df2)
print(df)

or as CLI:

python -m pandas_auto_join './example/datasets/flights/flights.csv' './example/datasets/flights/airlines.csv'

Debug Mode: You can change the DEBUG stage in config.py file.

Documentation

def join(
    *args: Tuple[pd.DataFrame], 
    how: str = 'inner',
    strategy: str = 'levenshtein',
    threshold: float = 0.5,
    verbose: int = 0
)

Parameters

*args : Tuple[pd.DataFrame]
Dataframes, at least two for join.
how : str, default = 'inner'
How join should perform inner, left or outer.
strategy : str, default = 'levenshtein'
Similarity strategy for detect string similarity.
1. levenshtein: Levenshtein edit based distance.
  💡 Good for small differences in strings (e.g. Max Stne vs. Max Stone).
2. jaro: Jaro distance.
  💡 Good for strings, that have similar prefix (e.g. Max-Whilem Stone, Max Stone).
3. sets: The best match between any strings in the first set and the second set.
  💡 Good for string sequences, where order is swapped (e.g. Stone, Max vs. Max Stone).
threshold : float, default = 0.5:
How similar should strings be? Value between 0.0 and 1.0.
1. 0.0 = No similarity check
2. 1.0 = Equi-join, strings has to be equal
verbose : int, default = 0:
Print informative messages while run (1=yes, 0=no)

Command Line Interface (CLI)

 PANDAS
     _         _            _       _       
    / \  _   _| |_ ___     | | ___ (_)_ __  
   / _ \| | | | __/ _ \ _  | |/ _ \| | '_ \ 
  / ___ \ |_| | || (_) | |_| | (_) | | | | |
 /_/   \_\__,_|\__\___/ \___/ \___/|_|_| |_|
                                            
          
Usage: python -m pandas_auto_join [OPTIONS] FILES...

  Command to load and auto join two or more dataframes. Actually support
  PARQUET and CSV files.

Options:
  -h, --how [left|inner|outer]    Pandas merge type.  [default: inner]
  -ss, --strategy [levenshtein|jaro|sets]
                                  Algorithm for calculating similarity score.
                                  [default: levenshtein]
  -st, --threshold FLOAT RANGE    Threshold for similiarity of strings. 1 =
                                  Equi-join. 0 = Accept all.  [default: 0.5;
                                  0<=x<=1]
  -o, --output TEXT               Name of output file with file extension.
  -v, --verbose INTEGER RANGE     Print feedback while running.  [default: 0;
                                  0<=x<=1]
  --help                          Show this message and exit.

Example

An example with five datasets and different data conflicts:

Different column names for each table
Detect Flight and Date are composite keys for FLNoand FLDate
FLNo has white spaces and Flight not
FLDate different format than Date
Bag is foreign key (number) for bagid
Airline has misspellings in names Airline

Flights (Reference table)

Flight	Date	Bag	Airline
ABC1234	2024-02-01	43242	Luthansa
ABC1234	2024-02-02	34234	Eurowoing
...	...	...	...

Transcation Data: Flight times

FLNo	FLDate	Time
ABC 1234	02/01/2024	08:00
ABC 1234	02/02/2024	13:00
...	...	...

Transcation Data: Baggage

bagid	bagcount
43242	143
34234	89
...	...

Master data: Airline

Airline	Code
Lufthansa	LH
Eurowings	EW
...	...

✅ Result table

import pandas_auto_join as aj
df = aj.join(flights, flight_times, baggage, airlines)

or

python -m pandas_auto_join ./example/datasets/flights/flights.csv ./example/datasets/flights/flight_times.csv ./example/datasets/flights/baggage.csv ./example/datasets/flights/airlines.csv

Flight	Date	Bag	Airline	Time	bagcount	Code
ABC1234	2024-02-01	43242	Luthansa	08:00	143	LH
ABC1234	2024-02-02	34234	Eurowoing	13:00	89	EW
...	...	...	...	...	...	...

Paper (Citation)

Note

Paper on the framework with benchmarks has not yet been published, but will be submitted later.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
example		example
src/pandas_auto_join		src/pandas_auto_join
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pandas AutoJoin

Content

Installation

Usage

Documentation

Command Line Interface (CLI)

Example

Paper (Citation)

About

Releases

Languages

License

bitnulleins/pandas_auto_join

Folders and files

Latest commit

History

Repository files navigation

Pandas AutoJoin

Content

Installation

Usage

Documentation

Command Line Interface (CLI)

Example

Paper (Citation)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages