# ETL one cycle
> Normally ETL is processed by 3 steps, E, T, L :) but we could do it by one cycle, ETL.

We are going to use the 3 configs from `ETL_how_to_run.ipynb` and merge it to one config file.

## 🌌 1. prepare config

In [1]:
import os
from pathlib import Path
from dataverse.config import Config 
from omegaconf import OmegaConf

# E = Extract, T = Transform, L = Load
main_path = Path(os.path.abspath('../..'))
ETL_path = main_path / "./dataverse/config/etl/sample/ETL___one_cycle.yaml"

ETL_config = Config.load(ETL_path)

### 🌠 ETL Config
> One cycle from raw to huggingface dataset

- load fake generation UFL data
- sample 10% of total data to reduce the size of dataset
- deduplicate by `text` column, 15-gram minhash jaccard similarity
- convert to huggingface dataset and return the object

In [2]:
print(OmegaConf.to_yaml(ETL_config))

spark:
  appname: dataverse_etl_sample
  driver:
    memory: 16g
etl:
- name: data_ingestion___test___generate_fake_ufl
- name: utils___sampling___random
  args:
    sample_n_or_frac: 0.1
- name: deduplication___minhash___lsh_jaccard
- name: data_load___huggingface___ufl2hf_obj



## 🌌 2. put config to ETLPipeline

In [3]:
from dataverse.etl import ETLPipeline

etl_pipeline = ETLPipeline()

In [4]:
# raw -> hf_obj
spark, dataset = etl_pipeline.run(ETL_config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/14 19:01:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/14 19:01:30 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
                                                                                

Downloading and preparing dataset spark/-1276752201 to /root/.cache/huggingface/datasets/spark/-1276752201/0.0.0...


                                                                                

Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-1276752201/0.0.0. Subsequent calls will reuse this data.


In [5]:
dataset

Dataset({
    features: ['id', 'meta', 'name', 'text'],
    num_rows: 14
})

In [6]:
dataset[0]

{'id': 'f073d6d2-20c4-4244-8c5b-e47799840bc7',
 'meta': '{"name": "Emma Collins", "age": 37, "address": "4980 Thompson Plains\\nSouth Kimberly, NV 38999", "job": "Phytotherapist"}',
 'name': 'test_fake_ufl',
 'text': 'Agency tend rock teacher body collection spend. Thing surface close pretty.'}