# ETL add new etl process
> Add your custom ETL process to the ETL pipeline.

## 🌌 Original ETL Pipeline 
> This is simple ETL pipeline to load huggingface dataset

In [1]:
from omegaconf import OmegaConf

# load from dict
ETL_config = OmegaConf.create({
    'spark': {
        'appname': 'ETL',
        'driver': {'memory': '16g'},
    },
    'etl': [
        {
            'name': 'data_ingestion___huggingface___hf2raw',
            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        {
            'name': 'data_save___huggingface___ufl2hf_obj'
        }
    ]
})

print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: ETL
  driver:
    memory: 16g
etl:
- name: data_ingestion___huggingface___hf2raw
  args:
    name_or_path:
    - ai2_arc
    - ARC-Challenge
- name: data_save___huggingface___ufl2hf_obj



In [2]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

# raw -> hf_obj
spark, dataset = etl_pipeline.run(ETL_config)
dataset

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/14 19:26:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/14 19:26:56 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/14 19:26:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)


  0%|          | 0/3 [00:00<?, ?it/s]

                                                                                

Downloading and preparing dataset spark/14056872 to /root/.cache/huggingface/datasets/spark/14056872/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/14056872/0.0.0. Subsequent calls will reuse this data.


Dataset({
    features: ['answerKey', 'choices', 'id', 'question'],
    num_rows: 2590
})

In [3]:
dataset[0]

{'answerKey': 'A',
 'choices': {'text': ['loss of electrons.',
   'loss of protons.',
   'gain of electrons.',
   'gain of protons.'],
  'label': ['A', 'B', 'C', 'D']},
 'id': 'Mercury_7029645',
 'question': 'Metal atoms will most likely form ions by the'}

## 🌌 Add Custom ETL Process

1. create your custom ETL process
2. check ETL process is registered
3. wrap it with `register_etl` decorator
4. add your custom ETL process to the ETL config
5. run the ETL pipeline

Here you are going to make a simple 

In [4]:
from dataverse.etl import register_etl

### 🌠 1. create your custom ETL process

- naming convention is `cate___sub-cate___name`
    - e.g. `huggingface___dataset___load_dataset`
- for input because we are using huggingface dataset `List[Dict]` format will be inserted

```python
# ai2_arc format
[
    {
        'id': ...,
        'choices': ...,
        'question': ...,
        'answerKey': ...,
    },
    {...},
    ...
]
```

Make a spark process assuming `List[Dict]` is given. Here we are simply going to remove `choices` key from each data point

In [5]:
def your___custom___etl_process(spark, data, *args, **kwargs):
    # add your custom process here
    # here we are going to simply remove 'choices' key
    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})

    return data

### 🌠 2. check ETL process is registered

ETL Pipeline only runs registered ETL process

In [6]:
from dataverse.etl import ETLRegistry 

# we can see our custom is not registered yet
ETLRegistry()

An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid


  from .autonotebook import tqdm as notebook_tqdm


Total [ 43 ]
data_ingestion [ 16 ]
deduplication [ 4 ]
cleaning [ 13 ]
pii [ 2 ]
quality [ 1 ]
data_load [ 4 ]
utils [ 3 ]

### 🌠 3. wrap it with `register_etl` decorator

How to register your custom ETL process?
Simply wrap it with `register_etl` decorator

```python
@register_etl
def your_custom_etl_process():

In [7]:
@register_etl
def your___custom___etl_process(spark, data, *args, **kwargs):
    # remove all text
    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})

    return data

In [8]:
# you will see your custom etl is registered
ETLRegistry()

Total [ 44 ]
data_ingestion [ 16 ]
deduplication [ 4 ]
cleaning [ 13 ]
pii [ 2 ]
quality [ 1 ]
data_load [ 4 ]
utils [ 3 ]
your [ 1 ]

### 🌠 4. add your custom ETL process to the ETL config


In [9]:
from omegaconf import OmegaConf

# load from dict
ETL_config = OmegaConf.create({
    'spark': {
        'appname': 'ETL',
        'driver': {'memory': '16g'},
    },
    'etl': [
        {
            'name': 'data_ingestion___huggingface___hf2raw',
            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },

        # ======== add your custom etl here ========
        {
            'name': 'your___custom___etl_process'
        },
        # ==========================================

        {
            'name': 'data_save___huggingface___ufl2hf_obj'
        }
    ]
})

print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: ETL
  driver:
    memory: 16g
etl:
- name: data_ingestion___huggingface___hf2raw
  args:
    name_or_path:
    - ai2_arc
    - ARC-Challenge
- name: your___custom___etl_process
- name: data_save___huggingface___ufl2hf_obj



### 🌠 5. run the ETL pipeline

You can check that ETL process you added customly works great and `choices` are removed.

In [10]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

# raw -> hf_obj
spark, dataset = etl_pipeline.run(ETL_config)
dataset

23/11/14 19:27:13 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/14 19:27:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)


  0%|          | 0/3 [00:00<?, ?it/s]

                                                                                

Downloading and preparing dataset spark/1082445423 to /root/.cache/huggingface/datasets/spark/1082445423/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/1082445423/0.0.0. Subsequent calls will reuse this data.


Dataset({
    features: ['answerKey', 'id', 'question'],
    num_rows: 2590
})

In [11]:
dataset[0]

{'answerKey': 'A',
 'id': 'Mercury_7029645',
 'question': 'Metal atoms will most likely form ions by the'}