# ETL one cycle
> Normally ETL is processed by 3 steps, E, T, L :) but we could do it by one cycle, ETL.

We are going to use the 3 configs from `ETL_how_to_run.ipynb` and merge it to one config file.

## 1. prepare config

In [1]:
from omegaconf import OmegaConf

# E = Extract, T = Transform, L = Load
ETL_path = "/data/private/dataverse/dataverse/config/etl/sample/ETL___one_cycle.yaml"

ETL_config = OmegaConf.load(ETL_path)

### ETL Config
> One cycle from raw to huggingface dataset

- load huggingface dataset `RedPajama-Data-1T-Sample`
- Convert huggingface dataset to UFL format
    - `ufl format` is the following structure `List[Dict]`
- sample 1% of total data to reduce the size of dataset
- deduplicate by `text` column, 15-gram minhash jaccard similarity
- convert to huggingface dataset and return the object

In [2]:
print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___red_pajama___hf2ufl
- name: utils___sampling___random
  args:
    sample_n_or_frac: 0.01
- name: deduplication___polyglot___minhash
- name: data_load___huggingface___ufl2hf_obj



## 2. put config to ETLPipeline

In [3]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

In [4]:
# raw -> hf_obj
dataset = etl_pipeline.run(ETL_config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/22 02:24:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/22 02:24:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Found cached dataset red_pajama-data-1_t-sample (/root/.cache/huggingface/datasets/togethercomputer___red_pajama-data-1_t-sample/plain_text/1.0.0/6ea3bc8ec2e84ec6d2df1930942e9028ace8c5b9d9143823cf911c50bbd92039)


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset already exists at /root/.cache/dataverse/dataset/huggingface_0213b3219ff7d503.parquet


                                                                                

Downloading and preparing dataset spark/-714991180 to /root/.cache/huggingface/datasets/spark/-714991180/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-714991180/0.0.0. Subsequent calls will reuse this data.


In [5]:
dataset

Dataset({
    features: ['id', 'meta', 'name', 'text'],
    num_rows: 9150
})

In [6]:
dataset[777]

{'id': '178591d058a411eeba67d652e8cce297',
 'meta': "{'title': 'Equal detour point', 'url': 'https://en.wikipedia.org/wiki/Equal%20detour%20point', 'language': 'en', 'timestamp': '20230320'}",
 'name': 'red_pajama',
 'text': 'The equal detour point is a triangle center with the Kimberling number X(176). It is characterized by the equal detour property, that is if you travel from any vertex of a triangle  to another by taking a detour through some inner point  then the additional distance travelled is constant. This means the following equation has to hold:\n\nThe equal detour point is the only point with the equal detour property if and only if the following inequality holds for the angles  of the triangle :\n\nIf the inequality does not hold, then the isoperimetric point possesses the equal detour property as well.\n\nThe equal detour point, isoperimetric point, the incenter and the Gergonne point of a triangle are collinear, that is all four points lie on a common line. Furthermore, 