# Data Collection for AI Model

In this notebook, we will collect datasets related to reasoning from Hugging Face and save them in a specified directory. We will use the `datasets` library from Hugging Face to download the datasets. The datasets we will download are:
1. [nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)
2. [Magpie-Align/Magpie-Reasoning-150K](https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-150K)
3. [KingNish/reasoning-base-20k](https://huggingface.co/datasets/KingNish/reasoning-base-20k)
3. [SkunkworksAI/reasoning-0.01](https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01)

### Dataset Configuration

In this section, define the dataset name, save path, and output path for the dataset to be worked with.

- **Dataset Name:** `Magpie-Align/Magpie-Reasoning-150K`
- **Save Path:** `data/raw/Magpie-Align/Magpie-Reasoning-150K`
- **Output Path:** `data/processed/Magpie-Align/Magpie-Reasoning-150K`

These variables will be used in the subsequent steps for downloading and processing the dataset.

In [None]:
dataset_name = "Magpie-Align/Magpie-Reasoning-150K"
save_path = f"data/raw/{dataset_name}"
output_path = f"data/processed/{dataset_name}"

### Data Collection and Transformation

In this section, we will use the `DataCollector` class to manage the dataset. The `DataCollector` class will be instantiated by passing the `dataset_name` and `save_path`. We will then call the `execute` method twice:

1. First, with the argument `"download"` to download the dataset.
2. Second, with the argument `"convert_parquet"` to transform the dataset into Parquet format and save it to the specified `output_path`.

In [None]:
from src.preprocessing.data_collection import DataCollector
    
manager = DataCollector(dataset_name, save_path)

manager.execute("download")
manager.execute("convert_parquet", output_path)