Skip to content

Pandas Auto-Join Framework for auto joining multiple dataframes.

License

Notifications You must be signed in to change notification settings

bitnulleins/pandas_auto_join

Repository files navigation

Pandas AutoJoin

With the help of this lightweight framework it is possible to automatically join two or more large dataframes together. Pandas AutoJoin offers the following features:

  • Simple usage
  • Automatic primary and foreign key detection
    (technical and composite keys)
  • Detect and solve some data conflicts
    (Misspellings, Homonyms, Synonyms, Datatypes...)
  • Command Line Interface (CLI)

The framework simplifies the automation of data acquisition a part of Data Science Life Cycle. It extends the benefits of AutoML and makes it more accessible to machine learning beginners.

Content

Installation

To install package and use it locally:

python -m pip install .

Requirements:

  • Python >= 3.8
  • pandas
  • click
  • tqdm
  • Levenshtein

Usage

Let df1 be the reference Pandas dataframe table and df2 other input tables for join. Then the automatic join can perform easliy by:

import pandas as pd
import pandas_auto_join as aj
df1 = pd.read_csv('./example/datasets/flights/flights.csv')
df2 = pd.read_csv('./example/datasets/flights/airlines.csv')
df = aj.join(df1, df2)
print(df)

or as CLI:

python -m pandas_auto_join './example/datasets/flights/flights.csv' './example/datasets/flights/airlines.csv'

Debug Mode: You can change the DEBUG stage in config.py file.

Documentation

def join(
    *args: Tuple[pd.DataFrame], 
    how: str = 'inner',
    strategy: str = 'levenshtein',
    threshold: float = 0.5,
    verbose: int = 0
)

Parameters

  • *args : Tuple[pd.DataFrame]
    Dataframes, at least two for join.
  • how : str, default = 'inner'
    How join should perform inner, left or outer.
  • strategy : str, default = 'levenshtein'
    Similarity strategy for detect string similarity.
    1. levenshtein: Levenshtein edit based distance.
      💡 Good for small differences in strings (e.g. Max Stne vs. Max Stone).
    2. jaro: Jaro distance.
      💡 Good for strings, that have similar prefix (e.g. Max-Whilem Stone, Max Stone).
    3. sets: The best match between any strings in the first set and the second set.
      💡 Good for string sequences, where order is swapped (e.g. Stone, Max vs. Max Stone).
  • threshold : float, default = 0.5:
    How similar should strings be? Value between 0.0 and 1.0.
    1. 0.0 = No similarity check
    2. 1.0 = Equi-join, strings has to be equal
  • verbose : int, default = 0:
    Print informative messages while run (1=yes, 0=no)

Command Line Interface (CLI)

 PANDAS
     _         _            _       _       
    / \  _   _| |_ ___     | | ___ (_)_ __  
   / _ \| | | | __/ _ \ _  | |/ _ \| | '_ \ 
  / ___ \ |_| | || (_) | |_| | (_) | | | | |
 /_/   \_\__,_|\__\___/ \___/ \___/|_|_| |_|
                                            
          
Usage: python -m pandas_auto_join [OPTIONS] FILES...

  Command to load and auto join two or more dataframes. Actually support
  PARQUET and CSV files.

Options:
  -h, --how [left|inner|outer]    Pandas merge type.  [default: inner]
  -ss, --strategy [levenshtein|jaro|sets]
                                  Algorithm for calculating similarity score.
                                  [default: levenshtein]
  -st, --threshold FLOAT RANGE    Threshold for similiarity of strings. 1 =
                                  Equi-join. 0 = Accept all.  [default: 0.5;
                                  0<=x<=1]
  -o, --output TEXT               Name of output file with file extension.
  -v, --verbose INTEGER RANGE     Print feedback while running.  [default: 0;
                                  0<=x<=1]
  --help                          Show this message and exit.

Example

An example with five datasets and different data conflicts:

  • Different column names for each table
  • Detect Flight and Date are composite keys for FLNoand FLDate
  • FLNo has white spaces and Flight not
  • FLDate different format than Date
  • Bag is foreign key (number) for bagid
  • Airline has misspellings in names Airline

Flights (Reference table)

Flight Date Bag Airline
ABC1234 2024-02-01 43242 Luthansa
ABC1234 2024-02-02 34234 Eurowoing
... ... ... ...

Transcation Data: Flight times

FLNo FLDate Time
ABC 1234 02/01/2024 08:00
ABC 1234 02/02/2024 13:00
... ... ...

Transcation Data: Baggage

bagid bagcount
43242 143
34234 89
... ...

Master data: Airline

Airline Code
Lufthansa LH
Eurowings EW
... ...

✅ Result table

import pandas_auto_join as aj
df = aj.join(flights, flight_times, baggage, airlines)

or

python -m pandas_auto_join ./example/datasets/flights/flights.csv ./example/datasets/flights/flight_times.csv ./example/datasets/flights/baggage.csv ./example/datasets/flights/airlines.csv
Flight Date Bag Airline Time bagcount Code
ABC1234 2024-02-01 43242 Luthansa 08:00 143 LH
ABC1234 2024-02-02 34234 Eurowoing 13:00 89 EW
... ... ... ... ... ... ...

Paper (Citation)

Note

Paper on the framework with benchmarks has not yet been published, but will be submitted later.

About

Pandas Auto-Join Framework for auto joining multiple dataframes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages