profileTR

profileTR Transformer

Purpose

This transformer aims at analysing and so profiling one or many datasets. It can provides severavl informations on the dataset(s) like:

all the column name
The column types
The column patterns
The number of Null values
The number of distinct values
The top values
The frequency distribution of the values

It works in a very simple way and can accept many datasets in input. The file generated can be a json file or an HTML file. The image below shows the html result:

Configuration by using the SQL directly in the configuration file

The specific configuration (as a Datasource) in the configuration file section parameters includes the following parameters:

directory: directory where the file generated will be created
filename: filename to generate
maxvaluecounts : Maximum list of values used for the: Top Values, patterns and type detection analysis
output html or json (rendering file)

Data sources configuration

Inputs : several inputs
Outputs : 1 output

Configuration example:

    "transformers":  [     
    { 
        "id": "profiling",
        "classname": "pipelite.transformers.profileTR",
        "inputs" : [ "Source A", "test" ],
        "parameters": {
            "directory" : "tests/data/out",
            "filename" : "test.html",
            "maxvaluecounts": 10,
            "output": "html"
        }
    }
    ... ] ...

JSON rendering

This is an example of a JSON file generated by this transformer.

{
  "sources": [
    {
      "id": "S",
      "profile": {
        "rows count": 14,
        "columns count": 16,
        "columns names": [
          "id",
          "concept:name",
          [... List here all the columns names ...]
        ],
        "columns": [
          {
            "id": "0",
            "name": "id",
            "type": "float64",
            "inferred": "float64",
            "distinct": 12,
            "nan": 2,
            "null": 2,
            "stats": {
              "count": 12.0,
              "mean": 6.166666666666667,
              "std": 4.323999271157397,
              "min": 0.0,
              "25%": 2.75,
              "50%": 6.0,
              "75%": 9.25,
              "max": 13.0
            },
            },
            "top values": {
              "0.0": 1,
              "1.0": 1,
              [... All the Top Values ...]
            },
            "pattern": {
              "N.N": 9,
              "NN.N": 3,
              [... All the patterns ...]
            },
            "types": {
              "number": 12,
              "null": 2,
              [... All the types ...]
            }

Example

Example with HTML output

graph TD;
    id1[Read Data Source S1]-->id2[Dataset S1];
    id2[Dataset S1]-->id3[Profile S1];
    id11[Read Data Source S2]-->id21[Dataset S2];
    id21[Dataset S2]-->id31[Profile S2];
    id3[Profile S1]-->id4[Combine and create a HTML file];
    id31[Profile S2]-->id4[Combine and create a HTML file];

In this example 2 data sources are read and profiled. One output HTML file is then generated with all the computed results.

Example with JSON output

🏠 Home
🔑 Main concepts
💻 Installation
🔨 Configuration
🚀 Running

Supported Data Sources
📄 CSV File
📑 XES File
📃 Excel File
📤 ODBC
🏢 SAP
🎢 ABBYY Timeline

Supported Transformations
🔀 Pass Through
📶 Dataset Profiling
🔂 Concat 2 Data sources
🆖 SubString
🆒 Column Transformation
🔃 Join data sources
🔃 Lookup
🔤 Rename Column Name

Extending pipelite
✅ how to
✅ Adding new Data sources
✅ Adding new Transformers
✅ Adding new Pipelines

Provide feedback

Saved searches