# ETL how to run?
> At here we will talk about how to run ETL. There is 2 steps to run ETL.

1. prepare config
2. put config to ETLPipeline

## 1. prepare config

In [1]:
from omegaconf import OmegaConf

# E = Extract, T = Transform, L = Load
E_path = "/data/private/dataverse/dataverse/config/etl/sample/data_ingestion___sampling.yaml"
T_path = "/data/private/dataverse/dataverse/config/etl/sample/data_preprocess___dedup.yaml"
L_path = "/data/private/dataverse/dataverse/config/etl/sample/data_load___hf_obj.yaml"

E_config = OmegaConf.load(E_path)
T_config = OmegaConf.load(T_path)
L_config = OmegaConf.load(L_path)


### Extract Config

- load huggingface dataset `RedPajama-Data-1T-Sample`
- Convert to huggingface dataset to UFL format
    - `ufl format` is the following structure `List[Dict]`
- sample 1% of total data to reduce the size of dataset
- save to parquet `./sample/pajama_sample_ufl.parquet`

In [2]:
print(OmegaConf.to_yaml(E_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 4g
etl:
- name: data_ingestion___red_pajama___hf2ufl
  args:
    name_or_path: togethercomputer/RedPajama-Data-1T-Sample
    repartition: 3
- name: utils___sampling___random
  args:
    sample_n_or_frac: 0.01
- name: data_load___parquet___ufl2parquet
  args:
    save_path: ./sample/pajama_sample_ufl.parquet



### Transform Config

- load parquet `./sample/pajama_sample_ufl.parquet`
- deduplicate by `text` column, 15-gram minhash jaccard similarity
- save to parquet `./sample/pajama_preprocess_ufl.parquet`

In [3]:
print(OmegaConf.to_yaml(T_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___ufl___parquet2ufl
  args:
    input_paths:
    - ./sample/pajama_sample_ufl.parquet
- name: deduplication___polyglot___minhash
- name: data_load___parquet___ufl2parquet
  args:
    save_path: ./sample/pajama_preprocess_ufl.parquet



### Load Config

- load parquet `./sample/pajama_preprocess_ufl.parquet`
- convert to huggingface dataset and return the object

In [4]:
print(OmegaConf.to_yaml(L_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___ufl___parquet2ufl
  args:
    input_paths:
    - ./sample/pajama_preprocess_ufl.parquet
- name: data_load___huggingface___ufl2hf_obj



## 2. put config to ETLPipeline

In [5]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

In [6]:
# raw -> ufl
etl_pipeline.run(E_config)

# ufl -> dedup -> ufl
etl_pipeline.run(T_config)

# ufl -> hf_obj
dataset = etl_pipeline.run(L_config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/22 01:19:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found cached dataset red_pajama-data-1_t-sample (/root/.cache/huggingface/datasets/togethercomputer___red_pajama-data-1_t-sample/plain_text/1.0.0/6ea3bc8ec2e84ec6d2df1930942e9028ace8c5b9d9143823cf911c50bbd92039)


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset already exists at /root/.cache/dataverse/dataset/huggingface_0213b3219ff7d503.parquet


                                                                                

Downloading and preparing dataset spark/-477983029 to /root/.cache/huggingface/datasets/spark/-477983029/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-477983029/0.0.0. Subsequent calls will reuse this data.


In [7]:
dataset

Dataset({
    features: ['id', 'meta', 'name', 'text'],
    num_rows: 9190
})

In [9]:
dataset[777]

{'id': 'ba38ac0a589a11ee8ddfd652e8cce297',
 'meta': '{"pred_label": "__label__cc", "pred_label_prob": 0.5543975830078125, "wiki_prob": 0.4456024169921875, "source": "cc/2020-05/en_middle_0033.json.gz/line427354"}',
 'name': 'red_pajama',
 'text': 'Mechanical drawings of lens mounts: are they “open source”?\nDozens of 3rd party companies manufacture lens adapters from one lens mount to another. Are lens mounts mechanical specifications "open source" or did these companies measure[*] lens mounts dimensions themselves and manufacture their adapters from there? As a side question, if lens mounts specs are not opened, is there any licensing issue in what I would say is "copying" the mounts for commercial purpose?\n[*] measure: unlikely with a vernier caliper and such, but rather with an appropriate 3D scanner, a coordinate measurement machine or a contactless depth probe, etc ...\nlens-mount lens-adapter\ncalocedruscalocedrus\nAre you asking whether the drawings can be shared, or whether th