### **arXiv Instruct Synthetic Dataset**

This Jupyter notebook generates a synthetic instruction-following dataset based on arXiv full-text chunks. It leverages the Together API to process scientific paper segments and create instruction-output pairs for various tasks such as summarization, question answering, and information extraction.

The notebook is designed to efficiently process large amounts of text data and can be easily adapted for different models or instruction types. It also demonstrates how to upload the generated dataset to Hugging Face Hub for easy sharing and distribution.

The notebook includes the following key components:

1. **Setup and Configuration**:
   - Installs necessary libraries (together, datasets, tqdm, python-dotenv)
   - Imports required modules and sets up API authentication for Hugging Face and Together AI

2. **Data Loading**:
   - Loads the arXiv full-text chunked dataset from Hugging Face

3. **Data Processing**:
   - Utilizes multiprocessing for efficient chunk processing
   - Applies random selection of instruction templates (summarization, question answering, information extraction)
   - Generates outputs using the Together AI API with the `Mixtral-8x22B-Instruct` model

4. **Output Generation**:
   - Creates a JSONL file containing the synthetic instruction-following dataset

5. **Dataset Upload**:
   - Converts the JSONL data to a Hugging Face Dataset format
   - Uploads the dataset to the Hugging Face Hub

The `utils.py` file contains supporting functions:
- Instruction templates for different tasks
- Functions for interacting with the Together AI API
- A process_chunk function that generates a datapoint for each text chunk

Author: Amr Achraf  
Created Date: 2024-06-13  
Updated Date: 2024-09-27  
Version: 2.0

In [1]:
!pip install together
!pip install -q datasets
!pip install tqdm
!pip install python-dotenv

In [8]:
from huggingface_hub import notebook_login, login
from datasets import load_dataset
from dotenv import load_dotenv
import os
load_dotenv()

hfToken = os.getenv('hf')
login(token=hfToken, add_to_git_credential=True, new_session=False)

In [9]:
togetherToken= os.getenv('togetherAPI')

In [10]:
data = load_dataset("amrachraf/arXiv-full-text-chunked", "chunk_4", split="train")

In [11]:
print(data)

In [18]:
import multiprocessing
from concurrent.futures import ProcessPoolExecutor, as_completed
from utils import *
from together import Together
import json
from tqdm import tqdm
import logging

logging.basicConfig(level=logging.INFO)
client = Together(api_key=togetherToken)
templates = [summarization_template, question_answering_template, information_extraction_template]

total_chunks = len(data)
synthetic_data = []

with ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
    futures = {executor.submit(process_chunk, item, client): item for item in data}

    with tqdm(total=total_chunks, desc="Processing chunks", unit="chunk") as pbar:
      for future in as_completed(futures):
        result = future.result(timeout=30)
        synthetic_data.append(result)
        pbar.update(1)

In [21]:
with open("synthetic_finetuning_data.jsonl", "w") as f:
    for item in synthetic_data:
        f.write(json.dumps(item) + "\n")

In [73]:
print(synthetic_data[5]['instruction'])

In [40]:
data_file = "synthetic_finetuning_data.jsonl"
hf_username = "amrachraf"
hf_dataset_name = "arXiv-full-text-synthetic-instruct-tune"

with open(data_file, "r") as f:
    data = [json.loads(line) for line in f]

In [41]:
data_jsonl = {"instruction": [d["instruction"] for d in data],
 "input": [d["input"] for d in data],
 "output": [d["output"] for d in data]}

In [44]:
from huggingface_hub import HfApi, HfFolder, Repository
from datasets import Dataset, DatasetDict

HfFolder.save_token(hfToken)

In [45]:
try:
    dataset = Dataset.from_dict(data_jsonl)
    dataset_dict = DatasetDict({"train": dataset})

    repo_name = f'{hf_username}/{hf_dataset_name}'
    hf_api = HfApi()

    hf_api.create_repo(repo_name, repo_type="dataset", exist_ok=True)

    dataset_dict.push_to_hub(repo_name, private=False, token=hfToken)

except Exception as e:
    print(f"Failed to upload dataset {hf_dataset_name}: {e}")