# ETL how to run?
> At here we will talk about how to run ETL. There is 2 steps to run ETL.

1. prepare config
2. put config to ETLPipeline

## 1. prepare config

In [1]:
from omegaconf import OmegaConf

# E = Extract, T = Transform, L = Load
E_path = "/data/private/dataverse/dataverse/config/etl/sample/data_ingestion___sampling.yaml"
T_path = "/data/private/dataverse/dataverse/config/etl/sample/data_preprocess___dedup.yaml"
L_path = "/data/private/dataverse/dataverse/config/etl/sample/data_load___hf_obj.yaml"

E_config = OmegaConf.load(E_path)
T_config = OmegaConf.load(T_path)
L_config = OmegaConf.load(L_path)


### Extract Config

- load huggingface dataset `RedPajama-Data-1T-Sample`
- Convert huggingface dataset to UFL format
    - `ufl format` is the following structure `List[Dict]`
- sample 1% of total data to reduce the size of dataset
- save to parquet `./sample/pajama_sample_ufl.parquet`

In [2]:
print(OmegaConf.to_yaml(E_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___red_pajama___hf2ufl
- name: utils___sampling___random
  args:
    sample_n_or_frac: 0.01
- name: data_load___parquet___ufl2parquet
  args:
    save_path: ./sample/pajama_sample_ufl.parquet



### Transform Config

- load parquet `./sample/pajama_sample_ufl.parquet`
- deduplicate by `text` column, 15-gram minhash jaccard similarity
- save to parquet `./sample/pajama_preprocess_ufl.parquet`

In [3]:
print(OmegaConf.to_yaml(T_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___ufl___parquet2ufl
  args:
    input_paths:
    - ./sample/pajama_sample_ufl.parquet
- name: deduplication___polyglot___minhash
- name: data_load___parquet___ufl2parquet
  args:
    save_path: ./sample/pajama_preprocess_ufl.parquet



### Load Config

- load parquet `./sample/pajama_preprocess_ufl.parquet`
- convert to huggingface dataset and return the object

In [4]:
print(OmegaConf.to_yaml(L_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___ufl___parquet2ufl
  args:
    input_paths:
    - ./sample/pajama_preprocess_ufl.parquet
- name: data_load___huggingface___ufl2hf_obj



## 2. put config to ETLPipeline

In [5]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

In [6]:
# raw -> ufl
etl_pipeline.run(E_config)

# ufl -> dedup -> ufl
etl_pipeline.run(T_config)

# ufl -> hf_obj
dataset = etl_pipeline.run(L_config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/22 02:31:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found cached dataset red_pajama-data-1_t-sample (/root/.cache/huggingface/datasets/togethercomputer___red_pajama-data-1_t-sample/plain_text/1.0.0/6ea3bc8ec2e84ec6d2df1930942e9028ace8c5b9d9143823cf911c50bbd92039)


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset already exists at /root/.cache/dataverse/dataset/huggingface_0213b3219ff7d503.parquet


                                                                                

Downloading and preparing dataset spark/2077475119 to /root/.cache/huggingface/datasets/spark/2077475119/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/2077475119/0.0.0. Subsequent calls will reuse this data.


In [7]:
dataset

Dataset({
    features: ['id', 'meta', 'name', 'text'],
    num_rows: 9150
})

In [8]:
dataset[777]

{'id': 'c126e4e658a411eeb745d652e8cce297',
 'meta': "{'timestamp': '2019-04-23T03:54:30Z', 'url': 'https://www.globaltuners.com/forum/thread/1924?page=1', 'language': 'en', 'source': 'c4'}",
 'name': 'red_pajama',
 'text': "We've been working on a completely new mobile app for GlobalTuners and we have just released the first Android version for public beta testing. A version for iOS will hopefully follow later.\nAt the moment the app provides just the basic functionality to listen to and control the receivers, and we would like your feedback. Do you think it's missing something, something is confusing or just broken?\nA Premium Membership is needed to be able to use the new app. If you do not have a premium subscription and are not sharing a receiver, you can get a premium subscription via the app and your Google Play account at a discounted price (about 4 euros per month).\nIf you're interested in testing the new app, you can opt in for the beta test at https://play.google.com/apps/te