With the help of this lightweight framework it is possible to automatically join two or more large dataframes together. Pandas AutoJoin offers the following features:
- Simple usage
- Automatic primary and foreign key detection
(technical and composite keys) - Detect and solve some data conflicts
(Misspellings, Homonyms, Synonyms, Datatypes...) - Command Line Interface (CLI)
The framework simplifies the automation of data acquisition a part of Data Science Life Cycle. It extends the benefits of AutoML and makes it more accessible to machine learning beginners.
To install package and use it locally:
python -m pip install .
Requirements:
- Python >= 3.8
- pandas
- click
- tqdm
- Levenshtein
Let df1
be the reference Pandas
dataframe table and df2
other input tables for join. Then the automatic join can perform easliy by:
import pandas as pd
import pandas_auto_join as aj
df1 = pd.read_csv('./example/datasets/flights/flights.csv')
df2 = pd.read_csv('./example/datasets/flights/airlines.csv')
df = aj.join(df1, df2)
print(df)
or as CLI:
python -m pandas_auto_join './example/datasets/flights/flights.csv' './example/datasets/flights/airlines.csv'
Debug Mode: You can change the DEBUG
stage in config.py file.
def join(
*args: Tuple[pd.DataFrame],
how: str = 'inner',
strategy: str = 'levenshtein',
threshold: float = 0.5,
verbose: int = 0
)
Parameters
- *args : Tuple[pd.DataFrame]
Dataframes, at least two for join. - how : str, default = 'inner'
How join should perform inner, left or outer. - strategy : str, default = 'levenshtein'
Similarity strategy for detect string similarity.- levenshtein: Levenshtein edit based distance.
💡 Good for small differences in strings (e.g. Max Stne vs. Max Stone). - jaro: Jaro distance.
💡 Good for strings, that have similar prefix (e.g. Max-Whilem Stone, Max Stone). - sets: The best match between any strings in the first set and the second set.
💡 Good for string sequences, where order is swapped (e.g. Stone, Max vs. Max Stone).
- levenshtein: Levenshtein edit based distance.
- threshold : float, default = 0.5:
How similar should strings be? Value between 0.0 and 1.0.- 0.0 = No similarity check
- 1.0 = Equi-join, strings has to be equal
- verbose : int, default = 0:
Print informative messages while run (1=yes, 0=no)
PANDAS
_ _ _ _
/ \ _ _| |_ ___ | | ___ (_)_ __
/ _ \| | | | __/ _ \ _ | |/ _ \| | '_ \
/ ___ \ |_| | || (_) | |_| | (_) | | | | |
/_/ \_\__,_|\__\___/ \___/ \___/|_|_| |_|
Usage: python -m pandas_auto_join [OPTIONS] FILES...
Command to load and auto join two or more dataframes. Actually support
PARQUET and CSV files.
Options:
-h, --how [left|inner|outer] Pandas merge type. [default: inner]
-ss, --strategy [levenshtein|jaro|sets]
Algorithm for calculating similarity score.
[default: levenshtein]
-st, --threshold FLOAT RANGE Threshold for similiarity of strings. 1 =
Equi-join. 0 = Accept all. [default: 0.5;
0<=x<=1]
-o, --output TEXT Name of output file with file extension.
-v, --verbose INTEGER RANGE Print feedback while running. [default: 0;
0<=x<=1]
--help Show this message and exit.
An example with five datasets and different data conflicts:
- Different column names for each table
- Detect
Flight
andDate
are composite keys forFLNo
andFLDate
FLNo
has white spaces andFlight
notFLDate
different format thanDate
Bag
is foreign key (number) forbagid
Airline
has misspellings in namesAirline
Flights (Reference table)
Flight | Date | Bag | Airline |
---|---|---|---|
ABC1234 | 2024-02-01 | 43242 | Luthansa |
ABC1234 | 2024-02-02 | 34234 | Eurowoing |
... | ... | ... | ... |
Transcation Data: Flight times
FLNo | FLDate | Time |
---|---|---|
ABC 1234 | 02/01/2024 | 08:00 |
ABC 1234 | 02/02/2024 | 13:00 |
... | ... | ... |
bagid | bagcount |
---|---|
43242 | 143 |
34234 | 89 |
... | ... |
Airline | Code |
---|---|
Lufthansa | LH |
Eurowings | EW |
... | ... |
✅ Result table
import pandas_auto_join as aj
df = aj.join(flights, flight_times, baggage, airlines)
or
python -m pandas_auto_join ./example/datasets/flights/flights.csv ./example/datasets/flights/flight_times.csv ./example/datasets/flights/baggage.csv ./example/datasets/flights/airlines.csv
Flight | Date | Bag | Airline | Time | bagcount | Code |
---|---|---|---|---|---|---|
ABC1234 | 2024-02-01 | 43242 | Luthansa | 08:00 | 143 | LH |
ABC1234 | 2024-02-02 | 34234 | Eurowoing | 13:00 | 89 | EW |
... | ... | ... | ... | ... | ... | ... |
Note
Paper on the framework with benchmarks has not yet been published, but will be submitted later.