Skip to content

concepts

datacorner edited this page Dec 4, 2023 · 9 revisions

Main Concepts in pipelite

To understand how pipelite works and/or if you also need to extend its capabilities you need to figure out how it works first, and the concepts around that solution. First of all datapipes works by building Data Pipelines. The Data Pipeline concept is of course the most important one and will be detailed below.

The Data Pipeline

What is a Data pipeline ?

Each Data Pipeline is a simple or complex data flows which gather the data from different sources, makes some needed transformation (optional) to finally load the result into one or several other Data sources. Each pipeline can be very simple like this one:

graph TD;
    id1[Access to the Data Source 1]-->id2[Read the dataset 1];
    id2[Read the dataset 1]-->id3[Transform dataset 1];
    id3[Transform dataset 1]-->id4[Access to the Data Source 2];
    id4[Access to the Data Source 2]-->id5[Write to the Data Source 2];

Or by connecting to several Data sources and using multiples transformations you can also build more complex pipelines like this one:

graph TD;
    id1[Access to the Data Source 1]-->id2[Read the dataset 1];
    id3[Access to the Data Source-2]-->id4[Read the dataset 2];
    id2[Read the dataset 1]-->id5[Transform dataset 1];
    id4[Read the dataset 2]-->id6[Transform dataset 2];
    id6[Transform dataset 2]-->id13[Join or Aggregate Dataset];
    id5[Transform dataset 1]-->id13[Join or Aggregate Dataset];
    id5[Transform dataset 1]-->id12[Write to the Data Source 3];
    id13[Join or Aggregate Dataset]-->id14[Access to the Data Source 3];
    id14[Access to the Data Source 3]-->id15[Write to the Data Source 4];

Some Pipeline rules in pipelite

Pipelite leverages the data pipeline function we just saw in introduction. That said, to perform a data pipeline these are some rules to keep in mind when configuring the Data Pipeline with pipelite:

  • Each Extractor, Loader and Transformer must have a unique id (in the configuration file).
  • Each ETL Object (Extractor, Loader and Transformer) has a specific configuration schema the use must respect (JSON validation)
  • An Extractor has only one output
  • A Loader has only one input
  • A Transformer can have many inputs (datasets) and also many outputs (datasets)

The ETL pattern

Separating the duty is something very important if you manage to build solid and extensible Data Pipelines. That's the reason why it's important to follow the ETL Pattern. pipelite by design separates the Extractions (E) from the Transformations (T) and of course the Loadings (L).

The Extraction and loading address the same kind of assets : a Data Source so they are managed in the same object in pipelite. This object can of course manage read and write depending on the configuration it is called during the extraction (read) or the loading (write).

Notes:

  • The pipelite code separate the datasources wrappers classes and the transformers you can use inside a pipeline. To exend the pipelite capabilities, a Python developer may hav have to create a new class and inherit from the datasource/transformer parent to create a new customized asset. It's also possible to change the pipeline behavior by inheriting the pipeline class.
  • Another important point: between the layers (E, T and L), the data are stored in a specific format (managed byt the etlDataset class) so as to enable flexibility. By default this format is a Pandas DataFrame (but totally encapsulated into the etlDataset class), but could also be extended to other datasets storage capabilities.

The configuration file

The configuration file is the heart of pipelite. Its purpose is to describe the Data pipeline. So all the Data sources in input (extractors), all the transformations needed (Transformers) and of course all the data sources in outputs (loaders) must be listed, specified and configured in this simple file.

Note: The format of this file is JSON

Added to the pipeline configuration it's also possible to configure several global behaviors of the flow (like the way you need to log each steps of the flow, etc.)

The Data Sources

As explained above the Data sources represents the connection to the real Data sources (like files, Databases or any other applications). So they have several characteristics:

  • It's possible to extend a Data Source (by inheritage in Python), or create easily a new Data Source. This needs to code in Python of course but pipelite makes that quite easy.
  • You must configure a Data Source (by using the configuration file)
  • A Data Source can be available in Read, Write or Read and Write mode
  • They have one dataset in input (if writing) and/or one dataset in outpout (in reading)

The Transformers

The transformers can take one or several datasets in input make some different transformations to provides new (transformed) dataset(s) in outpout. They also have several characteristics:

  • It's possible to extend a Transformer (by inheritage in Python), or create easily a new customized Transformer. This needs to code in Python of course but pipelite makes that quite easy.
  • They can have many datasets in input and/or output
  • You must configure a Transformer (by using the configuration file)

The Pipelines

The pipeline encapsulate all the logic flow to perform the Data Pipeline. By default pipelite.pipelines.sequentialPL is provided can be used directly by a user, but as mentioned above, it's possible to build his own way to manage a data pipeline.

The pipelite.pipelines.sequentialPL pipeline

This Data pipeline has these characteristics:

  • First it calculates by going through all the transformers which is the best order to launch all these transformations. The way is prepared beforehand and will be used during the execution phase.
  • Secondly it runs all the transformers and potentially creates new datasets in the fly. Each transformer has a set of datasets inputs and another in output. A datasets can come from an extractor or it could have been generated by a previous transformer in the chain. That means the transformers are launched in sequence, and can create new datasources (that did not existed in the extrator list)
  • At the end all the loaders are launched. only the loaders with a filled named dataset are run.