# ETL create new etl process
> Create your custom ETL process to the ETL pipeline.

when you want to create your own ETL process, it could be tricky.
here is a simple example to show you where to start to create your own ETL process.

## 🌌 1. Start from ETL Pipeline you wanna add your own ETL process
> simple ETL pipeline to load huggingface dataset

In [1]:
from omegaconf import OmegaConf

# load from dict
ETL_config = OmegaConf.create({
    'spark': {
        'appname': 'ETL',
        'driver': {'memory': '16g'},
    },
    'etl': [
        {
            'name': 'data_ingestion___huggingface___hf2raw',
            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        {
            'name': 'data_save___huggingface___ufl2hf_obj'
        }
    ]
})

print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: ETL
  driver:
    memory: 16g
etl:
- name: data_ingestion___huggingface___hf2raw
  args:
    name_or_path:
    - ai2_arc
    - ARC-Challenge
- name: data_save___huggingface___ufl2hf_obj



In [2]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

# raw -> hf_obj
spark, dataset = etl_pipeline.run(ETL_config)
dataset

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/14 19:02:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/14 19:02:29 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)


  0%|          | 0/3 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

                                                                                

Downloading and preparing dataset spark/-1076059055 to /root/.cache/huggingface/datasets/spark/-1076059055/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-1076059055/0.0.0. Subsequent calls will reuse this data.


Dataset({
    features: ['answerKey', 'choices', 'id', 'question'],
    num_rows: 2590
})

## 🌌 2. choose where you wanna add your own ETL process
> remove or comment out the following ETL process from config!

In [3]:
from omegaconf import OmegaConf

# load from dict
ETL_config = OmegaConf.create({
    'spark': {
        'appname': 'ETL',
        'driver': {'memory': '16g'},
    },
    'etl': [
        {
            'name': 'data_ingestion___huggingface___hf2raw',
            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        
        # TODO: you want to add your own ETL process from here

        # TODO: if so, you need to add the following ETL process!
        #       remove or comment out the following ETL process
        # {
        #     'name': 'data_load___huggingface___ufl2hf_obj'
        # }
    ]
})

print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: ETL
  driver:
    memory: 16g
etl:
- name: data_ingestion___huggingface___hf2raw
  args:
    name_or_path:
    - ai2_arc
    - ARC-Challenge



In [4]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

# raw -> spark, data[rdd, Dataframe]
spark, data = etl_pipeline.run(ETL_config)

23/11/14 19:02:42 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).


Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
spark

In [6]:
data

PythonRDD[13] at RDD at PythonRDD.scala:53

## 🌌 3. Check the current process so far!
> use spark to check the current process so far!
- `collect` is heavy so recommend to use `take` instead of `collect`!

In [7]:
data.take(3)

[{'id': 'Mercury_7029645',
  'question': 'Metal atoms will most likely form ions by the',
  'choices': Row(text=['loss of electrons.', 'loss of protons.', 'gain of electrons.', 'gain of protons.'], label=['A', 'B', 'C', 'D']),
  'answerKey': 'A'},
 {'id': 'Mercury_7216598',
  'question': 'Which phrase does not describe asexual reproduction in organisms?',
  'choices': Row(text=['requires two parents', 'little variation in offspring', 'only one type of cell involved', 'duplicates its genetic material'], label=['A', 'B', 'C', 'D']),
  'answerKey': 'A'},
 {'id': 'MDSA_2008_5_40',
  'question': 'A student is investigating changes in the states of matter. The student fills a graduated cylinder with 50 milliliters of packed snow. The graduated cylinder has a mass of 50 grams when empty and 95 grams when filled with the snow. The packed snow changes to liquid water when the snow is put in a warm room. Which statement best describes this process?',
  'choices': Row(text=['Cooling causes the sn

## 🌌 4. Create your own ETL process
> what do you want to do after all? 

Let's say you want to add `filter` process to the ETL pipeline.
- you want to remove `choices` key from the dataset

In [8]:
data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})

In [9]:
data.take(3)

[{'id': 'Mercury_7029645',
  'question': 'Metal atoms will most likely form ions by the',
  'answerKey': 'A'},
 {'id': 'Mercury_7216598',
  'question': 'Which phrase does not describe asexual reproduction in organisms?',
  'answerKey': 'A'},
 {'id': 'MDSA_2008_5_40',
  'question': 'A student is investigating changes in the states of matter. The student fills a graduated cylinder with 50 milliliters of packed snow. The graduated cylinder has a mass of 50 grams when empty and 95 grams when filled with the snow. The packed snow changes to liquid water when the snow is put in a warm room. Which statement best describes this process?',
  'answerKey': 'D'}]

Hey! it's working ;)! `choices` key are removed from the dataset!

## 🌌 5. Working? It's time to add to the ETL Registry
> working great? it's time to move to how to add to the ETL Registry!
[ETL_add_new_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/guideline/etl/ETL_add_new_etl_process.ipynb)

Check out the guideline from above notebook. and for preview here is the function template to add to the ETL Registry.

```python
# before
data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})

# after
def your___custom___etl_process(spark, data, *args, **kwargs):
    # add your custom process here
    # here we are going to simply remove 'choices' key
    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})

    return data
```