# ETL test etl process
> when you want to get `test`(sample) data to quickly test your ETL process, or need `data from a certain point` to test your ETL process, you can check how to do it here.

## ðŸŒŒ Get `test`(sample) data

### ðŸŒ  get `test`(sample) data `w/o config`
> when you have created a ETL process and don't wanna set config from the scratch here is a quick way to get the sample data

In [1]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()
spark, data = etl_pipeline.sample()

# default sampling will return 100 `ufl` data
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/14 19:37:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

total data # : 100
sample data :


[{'id': 'e2ce9284-8691-471b-88e3-ba29a5888fd1',
  'name': 'test_fake_ufl',
  'text': 'Simple toward doctor any. Rich name reality bad family. Gas mind even important stay describe official.\nThere recognize campaign wind on. Drop sport however central read.',
  'meta': '{"name": "Amanda Ross", "age": 60, "address": "302 Rebecca Camp\\nPatrickborough, CT 40755", "job": "Broadcast engineer"}'}]

when you want to increase the sample size do the following
```python
spark, data = etl_pipeline.sample(n=10000)
spark, data = etl_pipeline.sample(10000)
```

In [2]:
spark, data = etl_pipeline.sample(10000)
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

total data # : 10000
sample data :


[{'id': '79081a73-5c82-432d-bf4a-f7de8bf59d12',
  'name': 'test_fake_ufl',
  'text': 'Serious teacher follow they entire between. Far see issue view throughout order field.\nWant senior sell amount picture. Tree cell low edge.',
  'meta': '{"name": "Jack Yoder", "age": 75, "address": "083 Diana Parkway Suite 438\\nLake Amberport, AS 76996", "job": "Haematologist"}'}]

### ðŸŒ  get `test`(sample) data `w/ config`
> this might took some time to get the data but you can choose your own data
- this was also introduced in `ETL_03_create_new_etl_process.ipynb`

Getting sample data `you want`

In [3]:
from omegaconf import OmegaConf

# load from dict
ETL_config = OmegaConf.create({
    'spark': {
        'appname': 'ETL',
        'driver': {'memory': '16g'},
    },
    'etl': [
        {
            'name': 'data_ingestion___huggingface___hf2raw',
            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
        },
        {'name': 'utils___sampling___random'}
    ]
})

print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: ETL
  driver:
    memory: 16g
etl:
- name: data_ingestion___huggingface___hf2raw
  args:
    name_or_path:
    - ai2_arc
    - ARC-Challenge
- name: utils___sampling___random



In [4]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()
spark, data = etl_pipeline.run(ETL_config)
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

23/11/14 19:38:01 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)


  0%|          | 0/3 [00:00<?, ?it/s]

total data # : 280
sample data :


[{'id': 'Mercury_7029645',
  'question': 'Metal atoms will most likely form ions by the',
  'choices': Row(text=['loss of electrons.', 'loss of protons.', 'gain of electrons.', 'gain of protons.'], label=['A', 'B', 'C', 'D']),
  'answerKey': 'A'}]

## ðŸŒŒ Test your ETL process
> its time to test your ETL process with the sample data. define ETL process and run it

In [5]:
from dataverse.etl import ETLPipeline
from dataverse.etl import register_etl

etl_pipeline = ETLPipeline()

# get sample data
spark, data = etl_pipeline.sample()
print(f"total data # : {data.count()}")
print(f"sample data :")
data.take(1)

23/11/14 19:38:06 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


total data # : 100
sample data :


[{'id': 'eec9b075-b786-454c-a398-f69d8cf39739',
  'name': 'test_fake_ufl',
  'text': 'Country toward ago old right.\nNewspaper hotel although short. Hair actually building.\nWe build then blue hundred perform wall.',
  'meta': '{"name": "Michael Aguirre", "age": 18, "address": "8324 Jennings Road Apt. 378\\nLatoyahaven, MT 27716", "job": "Television camera operator"}'}]

In [6]:
@register_etl
def test___your___etl_process(spark, data, *args, **kwargs):
    # add your custom process here
    # here we are going to simply remove 'id' key
    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'id'})

    return data

In [7]:
# test right away
# - successfully removed `id` key
etl = test___your___etl_process
etl()(spark, data).take(1)

[{'name': 'test_fake_ufl',
  'text': 'Country toward ago old right.\nNewspaper hotel although short. Hair actually building.\nWe build then blue hundred perform wall.',
  'meta': '{"name": "Michael Aguirre", "age": 18, "address": "8324 Jennings Road Apt. 378\\nLatoyahaven, MT 27716", "job": "Television camera operator"}'}]

In [8]:
# test it is registered by calling it from etl_pipeline
# - successfully removed `id` key
etl = etl_pipeline.get('test___your___etl_process')
etl()(spark, data).take(1)

[{'name': 'test_fake_ufl',
  'text': 'Country toward ago old right.\nNewspaper hotel although short. Hair actually building.\nWe build then blue hundred perform wall.',
  'meta': '{"name": "Michael Aguirre", "age": 18, "address": "8324 Jennings Road Apt. 378\\nLatoyahaven, MT 27716", "job": "Television camera operator"}'}]

## ðŸŒŒ Experiments on the data itself
> there is no chosen way to use this `test`(sample) data. you can do whatever you want with it. here are some examples

In [9]:
data.map(lambda x: {**x, 'duck': 'is quarking (physics)'}).take(1)

[{'id': 'eec9b075-b786-454c-a398-f69d8cf39739',
  'name': 'test_fake_ufl',
  'text': 'Country toward ago old right.\nNewspaper hotel although short. Hair actually building.\nWe build then blue hundred perform wall.',
  'meta': '{"name": "Michael Aguirre", "age": 18, "address": "8324 Jennings Road Apt. 378\\nLatoyahaven, MT 27716", "job": "Television camera operator"}',
  'duck': 'is quarking (physics)'}]