# Full Stack Practice of LLM Training - LLM Data Curation @ RLChina 2024

- Author: [Cheng Deng](https://www.cdeng.net/)[✉️]((davendw49@gmail.com), [Jun Wang](http://www0.cs.ucl.ac.uk/staff/jun.wang/)

---
## Main Task

In this section, we will set up data preprocessing, including improving data quality and deduplication, in preparation for large language model training. Following this, we will develop the input pipeline by implementing a text tokenizer and creating a custom PyTorch DataLoader specifically designed for our LLM.

![](https://www.cdeng.net/resources/imgs/RLChina24/a.png)

Here is the prerequisite knowledge required:

- Python File Processing
- Regex

First, let's do some of the preparation, including the essential package installation. 😁

In [None]:
!pip install sciparser
!pip install datasets

## 1. Choose Data

First, like many other blogs, we introduce the idea to gather data. Choosing the right data for LLM (Large Language Model) pretraining is crucial for model performance, relevance, and generalization. Pretraining data directly impacts the knowledge, linguistic understanding, and contextual awareness of the model, so selecting diverse, high-quality, and appropriately sized datasets is essential.

Here's a guide to help you make informed decisions when choosing data for LLM pretraining:

### Key Criteria for Choosing Data

1. **Domain Relevance**
   - **General Purpose Models** (like GPT-3, BERT): Use diverse, broad-domain datasets covering a wide variety of topics, styles, and genres (e.g., books, websites, news articles).
   - **Domain-Specific Models** (like medical, legal, or geoscience models): Use datasets specific to the target domain, ensuring that the language and concepts relevant to the domain are well-represented.

2. **Size and Diversity**
   - Pretraining LLMs generally requires **massive datasets** (terabytes of data). The data should be diverse in terms of:
     - **Topics**: A wide range of subjects (e.g., science, technology, arts, history).
     - **Genres**: Books, blogs, forums, research papers, news articles, etc.
     - **Languages**: If creating a multilingual model, ensure a balanced mix of languages.
   - **High-Quality Sources**: Collect data from sources known for high-quality language use (e.g., books, academic papers, Wikipedia).

3. **Data Quality**
   - **Clean and Well-Formatted**: Data should be well-structured, with minimal noise (e.g., broken sentences, unrecognized characters, incomplete text).
   - **Balanced**: Avoid over-representation of certain topics, genres, or styles that might bias the model.
   - **Appropriate Length**: Use documents that allow the model to learn from long contexts, such as entire books or articles.

4. **Ethical Considerations**
   - **Bias and Fairness**: Make sure that the dataset isn’t skewed towards certain demographics, cultures, or ideologies. Models trained on biased data can reflect or even amplify these biases.
   - **Sensitive Data**: Avoid including personally identifiable information (PII) or confidential/sensitive data. This is crucial for privacy and security.
   - **Diversity**: Include datasets representing different social groups, cultures, and backgrounds to ensure fairness and inclusivity in the model.

5. **Scalability and Accessibility**
   - Ensure that the data can scale to the size required for pretraining (typically in the range of hundreds of billions of tokens).
   - Use publicly available datasets or data sources that have open licenses, so that the data can be used and redistributed without legal restrictions.

And this notebook, we go through the way to train a model to classify the quality of data involves several key steps, from defining what "quality" means for your specific use case to gathering labeled training data and choosing the right machine learning model.


First, we **Define Data Quality Criteria**

Before starting the process, it's essential to define what "quality" means in the context of your dataset. Common criteria for data quality include:

- **Completeness**: Whether the data is missing any critical components.
- **Consistency**: Whether the data follows the same format or structure across all entries.
- **Accuracy**: Whether the data accurately represents the real-world entities it refers to.
- **Validity**: Whether the data conforms to defined rules (e.g., date formats, numerical ranges).
- **Uniqueness**: Whether there are duplicate entries.
- **Bias**: Whether the data exhibits systematic bias that could affect downstream applications.
  
You may need to focus on one or multiple aspects of data quality depending on the requirements of your use case.

Second we **Collect and Label Training Data**

To build a supervised model, you’ll need a labeled dataset with examples of "high quality" and "low quality" data. You can collect this in two ways:

- **Manual Labeling**: Human annotators evaluate samples of data and label them as "high quality" or "low quality" based on predefined criteria.
  
- **Rule-Based Labeling**: Automatically label data using heuristic rules. For example, missing values, inconsistencies, or outliers can serve as indicators of low quality. Use these rules to label data automatically.

#### Example Data Quality Labeling:

For each sample (e.g., text, numerical data, image), you might assign a quality label:

| Data Sample | Quality Label |
|-------------|---------------|
| "John, Doe, 12/12/1995" | High Quality |
| "Jhn, , 1995/12/12" | Low Quality |
| "J.Do, ?19" | Low Quality |
| "Jane, Doe, 01/01/2000" | High Quality |

Here we refer to *DCLM-baseline-1.0* to train a model to classify the data based on their quality.

To implement the approach (**fastText OH-2.5 + ELI5** ) in DCLM's paper, we will need to train a fastText model using labeled instruction-formatted data drawn from **OpenHermes 2.5 (OH-2.5)** and high-scoring posts from the **r/ExplainLikeImFive (ELI5)** subreddit. Below is the full code to prepare, train, and evaluate a fastText model on this dataset.

In [None]:
import fasttext
from datasets import load_dataset

# Step 1: Prepare the OH-2.5 + ELI5 Dataset

def prepare_fasttext_format(dataset, file_name):
    """
    Convert the dataset into fastText format where each line starts with a label.
    :param dataset: Hugging Face dataset with 'text' and 'label' columns.
    :param file_name: File name to save the formatted data.
    """
    with open(file_name, 'w') as f:
        for example in dataset:
            label = "__label__" + example['label']  # Labels in fastText format (e.g., '__label__high_quality')
            text = example['text'].replace('\n', ' ')  # Replace newlines with spaces
            f.write(f"{label} {text}\n")

# Load OpenHermes 2.5 and ELI5 datasets (replace with actual paths or dataset loading methods)
# For the example, let's assume you have datasets OH2.5 and ELI5 already loaded.
def load_datasets():
    oh25 = load_dataset("openhermes_2_5")  # Replace with actual OH-2.5 loading method
    eli5 = load_dataset("eli5")  # Replace with actual ELI5 loading method
    return oh25['train'], eli5['train']

# Combine OH-2.5 and ELI5 datasets
def combine_datasets(oh25_dataset, eli5_dataset):
    combined_data = []

    # Add OH-2.5 data (assuming 'label' and 'text' fields)
    for example in oh25_dataset:
        combined_data.append({
            'label': example['label'],
            'text': example['text']
        })

    # Add ELI5 data (assuming 'label' and 'text' fields)
    for example in eli5_dataset:
        combined_data.append({
            'label': example['label'],
            'text': example['text']
        })

    return combined_data

# Step 2: Format the Combined Data for fastText

def save_dataset_for_fasttext(oh25, eli5, output_file):
    combined_data = combine_datasets(oh25, eli5)
    prepare_fasttext_format(combined_data, output_file)

# Step 3: Train the fastText Classifier

def train_fasttext_classifier(training_file, model_save_path):
    """
    Train the fastText classifier using instruction-formatted OH-2.5 + ELI5 data.
    :param training_file: Path to the training file in fastText format.
    :param model_save_path: Path to save the trained fastText model.
    """
    # Train the fastText model
    model = fasttext.train_supervised(
        input=training_file,
        epoch=10,  # Number of training epochs
        lr=1.0,  # Learning rate
        wordNgrams=2,  # Use bigrams
        dim=100,  # Dimension of word vectors
        loss='softmax'  # Loss function
    )

    # Save the trained model
    model.save_model(model_save_path)

    return model

# Step 4: Evaluate the Model

def evaluate_model(model, test_file):
    """
    Evaluate the fastText model on a test dataset.
    :param model: Trained fastText model.
    :param test_file: Test file in fastText format.
    """
    # Evaluate the model using fastText's built-in evaluation function
    result = model.test(test_file)

    # fastText returns:
    # (number of examples, precision@1, recall@1)
    print(f"Number of examples: {result[0]}")
    print(f"Precision@1: {result[1]}")
    print(f"Recall@1: {result[2]}")

# Step 5: Main Function to Run Everything

def main():
    # Load datasets
    oh25_data, eli5_data = load_datasets()

    # Prepare data for fastText
    training_file = 'oh25_eli5_train.txt'
    save_dataset_for_fasttext(oh25_data, eli5_data, training_file)

    # Train the fastText classifier
    model_save_path = 'fasttext_oh25_eli5.bin'
    model = train_fasttext_classifier(training_file, model_save_path)

    # Optionally, prepare a test set and evaluate the model
    test_file = 'oh25_eli5_test.txt'  # You should create a similar test set
    evaluate_model(model, test_file)

# Run the process
if __name__ == "__main__":
    main()

As we can see above, we do 4 step training.

1. **Dataset Loading**:
   - The `load_datasets` function loads both OpenHermes 2.5 (OH-2.5) and r/ExplainLikeImFive (ELI5) datasets.
   - Replace the `load_dataset("openhermes_2_5")` and `load_dataset("eli5")` with the actual paths or dataset loading methods you are using.

2. **Data Preparation**:
   - We combine OH-2.5 and ELI5 into a single dataset using `combine_datasets`.
   - The `prepare_fasttext_format` function formats the combined dataset into the fastText format, with labels (`__label__high_quality`, `__label__low_quality`) at the beginning of each line.

3. **Training the fastText Model**:
   - The `train_fasttext_classifier` function trains a fastText classifier using the formatted dataset.
   - It trains using **supervised learning** with hyperparameters like `epoch=10`, `wordNgrams=2`, and `lr=1.0`. You can adjust these values to experiment with model performance.

4. **Model Evaluation**:
   - After training, the `evaluate_model` function evaluates the trained model on a test set, printing out the precision and recall metrics.

And the following thing is waiting.

## 2. Data Cleaning

"In most cases, the data sources may come from websites, scanned books, documents stored in the form of PDFs, and other similar data formats. For example, the following is a website-style text, which is readable but unclean. We have to remove the tags in HTML like `<a>` and `\`, as well as non-ASCII characters like `｜`."

*A possible meaning for "Cosmopedia" could be an encyclopedi\<a\> or collection of inform@tion about different cultures, soci\eties, and topics from around the wor|ld, emphasizing diversity and global connec tedness.*

So we have to do data cleaning to remove the tokens we do not expect and filter the context too short.

### 2.1 **Data Filtering Based on Quality**
It's important to remove low-quality, irrelevant, or harmful content. This could involve filtering out texts that:
- Contain excessive typos or grammatical errors.
- Are too short or too long.
- Contain offensive, biased, or harmful language (important for ethical AI development).

**Techniques:**
- **Language Detection**: Ensure that only data in the target language is included.
- **Readability Scoring**: Use Flesch-Kincaid or other readability scores to filter out overly complex or nonsensical sentences.
- **Bias Detection**: Use automated tools to detect and remove biased or harmful content.

In [None]:
def filter_low_quality(texts, min_length=20, max_length=300):
  return [text for text in texts if min_length < len(text.split()) < max_length]

filtered_texts = filter_low_quality(list_of_texts)

### **2.2 Removing HTML Tags**

- Option A: Using Regular Expressions

In [None]:
import re

def remove_html_tags(text):
    """Remove HTML tags using regular expressions."""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

- Option B: Using BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
    """Remove HTML tags using BeautifulSoup."""
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

 - Option C: Remove HTML Tags or Special Characters
   - If the text contains HTML tags or artifacts (common when extracting from websites), remove them using a library like `BeautifulSoup` or regex.

In [None]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

cleaned_text = remove_html_tags(text)

### **2.3 Removing Backslashes**

In [None]:
def remove_backslashes(text):
    """Remove backslashes from the text."""
    return text.replace('\\', '')

### **2.4 Removing Non-ASCII Characters**

- Option A: Using Encoding

In [None]:
def remove_non_ascii(text):
    """Remove non-ASCII characters by encoding and decoding."""
    return text.encode('ascii', 'ignore').decode()

- Option B: Using Regular Expressions

In [None]:
import re

def remove_non_ascii(text):
    """Remove non-ASCII characters using regular expressions."""
    return re.sub(r'[^\x00-\x7F]+', '', text)

- Option C: Using unidecode

In [None]:
import unidecode

def remove_non_ascii(text):
    return unidecode.unidecode(text)

cleaned_text = remove_non_ascii(text)

### 2.5 **Fix Encoding Issues**
   - Text extracted from PDFs often has incorrect encodings or symbols (e.g., replacing `a` with `ɑ` or `d` with `ԁ`). Use libraries like `ftfy` to fix encoding issues automatically.

In [None]:
from ftfy import fix_text

def fix_encoding_issues(text):
   return fix_text(text)

cleaned_text = fix_encoding_issues(text)

### 2.6 **Remove or Replace Special Characters**
   - Strip out unwanted special characters or replace them with their proper forms. You can use regex to target specific characters that shouldn't be in the text.

In [None]:
import re

def remove_special_chars(text):
   return re.sub(r'[^A-Za-z0-9\s.,!?\'"]', '', text)  # Keeps common punctuation marks

cleaned_text = remove_special_chars(text)

### 2.7 **Normalize Whitespace**
   - Extra spaces, newlines, or tabs may appear in unclean text. Normalize the whitespace to a single space or strip extra spaces.

In [None]:
def normalize_whitespace(text):
   return " ".join(text.split())

cleaned_text = normalize_whitespace(text)

### 2.8 **Fix Spelling and Grammar**
   - Use a spell-checker to fix common spelling mistakes, which are prevalent in noisy data. Libraries like `pyspellchecker` can help.

In [None]:
from spellchecker import SpellChecker

def correct_spelling(text):
   spell = SpellChecker()
   corrected_text = []
   for word in text.split():
       corrected_text.append(spell.correction(word))
   return " ".join(corrected_text)

cleaned_text = correct_spelling(text)


### 2.9 **Convert to Lowercase**
   - Converting the text to lowercase ensures uniformity when performing NLP tasks like tokenization.

In [None]:
def to_lowercase(text):
   return text.lower()

cleaned_text = to_lowercase(text)

### 2.10 **Remove Stopwords (Optional)**
   - Depending on the task (e.g., text classification), you might want to remove common stopwords to reduce noise. Libraries like `nltk` or `spaCy` provide predefined stopword lists.



In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
   words = text.split()
   return ' '.join([word for word in words if word.lower() not in stop_words])

cleaned_text = remove_stopwords(cleaned_text)

### 2.11. **Remove URLs, Emails, or Phone Numbers**
   - If the text contains unwanted URLs, emails, or phone numbers, regex can help you target and remove them.

In [None]:
def remove_urls(text):
   return re.sub(r'http\S+|www\S+|https\S+', '', text)

def remove_emails(text):
   return re.sub(r'\S+@\S+', '', text)

cleaned_text = remove_urls(cleaned_text)
cleaned_text = remove_emails(cleaned_text)

### 2.12 **Tokenization and Lemmatization (Optional)**
   - If you are preparing data for NLP tasks, tokenizing and lemmatizing words can be useful to break the text into meaningful units.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

def lemmatize_text(text):
   doc = nlp(text)
   return ' '.join([token.lemma_ for token in doc])

cleaned_text = lemmatize_text(cleaned_text)

### Putting it all together
You can create a preprocessing pipeline by combining these steps:

In [None]:
def clean_text(text):
    text = remove_html_tags(text)
    text = fix_encoding_issues(text)
    text = remove_special_chars(text)
    text = normalize_whitespace(text)
    text = remove_non_ascii(text)
    text = remove_urls(text)
    text = remove_emails(text)
    text = to_lowercase(text)
    return text

raw_text = 'Your raw, unclean text extracted from PDF or website here.'
cleaned_text = clean_text(raw_text)
print(cleaned_text)

### Tools and Libraries to Use:
- **`ftfy`** for fixing encoding errors.
- **`BeautifulSoup`** for removing HTML tags.
- **`re`** (regex) for custom text replacements.
- **`nltk`/`spaCy`** for tokenization, stopword removal, and lemmatization.
- **`unidecode`** for ASCII normalization.

## 3. Deal with PDF

### 3.1 Common methods

Then, we choose a specific data format -- PDF. For PDFs, you can use libraries like `PyPDF2` or `pdfminer.six`. Here, we share the code of `PyPDF2`

In [None]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file."""
    text = ''
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

For scanned documents (images), use OCR tools like `pytesseract`.

In [None]:
import pytesseract
from PIL import Image

def extract_text_from_image(image_path):
    """Extract text from an image using OCR."""
    text = pytesseract.image_to_string(Image.open(image_path))
    return text

And the full workflow example as follows:

In [None]:
def process_document(file_path):
    """Extract and clean text from a document."""
    # Determine file type and extract text accordingly
    if file_path.endswith('.pdf'):
        text = extract_text_from_pdf(file_path)
    elif file_path.endswith(('.png', '.jpg', '.jpeg')):
        text = extract_text_from_image(file_path)
    else:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()

    # Clean the extracted text
    cleaned_text = clean_text(text)
    return cleaned_text

# Example usage
file_path = 'sample_document.pdf'
cleaned_text = process_document(file_path)
print(cleaned_text)

### 3.2 Using Sciparser for scientific document

Here introduce a repo call [sciparser](https://github.com/davendw49/sciparser). This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on [PDF Parser ToolKits](https://github.com/Acemap/pdf_parser), gathering most-use PDF OCR tools for academic papers, and inspired by `grobid_tei_xml`, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like [K2](https://github.com/davendw49/k2) and [GeoGalactica](https://github.com/davendw49/geogalactica), we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.

- Try [DEMO](https://sciparser.acemap.info/)

In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.

# Requirements

```bash
git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt
```

In [None]:
from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')

For the [pdf_parser](https://github.com/Acemap/pdf_parser), here is the tips:
Backend↓ / Type→ | text | image | reference
:-: | :-: | :-: | :-:
grobid | <font color="#00FF00">√</font> | <font color="#FF0000">×</font> | <font color="#FF0000">×</font>
cermine | <font color="#00FF00">√</font> | <font color="#00FF00">√</font> | <font color="#FF0000">×</font>
scienceparse | <font color="#00FF00">√</font> | <font color="#FF0000">×</font> | <font color="#FF0000">×</font>
pdffigures | <font color="#FF0000">×</font> | <font color="#00FF00">√</font> | <font color="#FF0000">×</font>
pdffigures2 | <font color="#00FF00">√</font> | <font color="#00FF00">√</font> | <font color="#FF0000">×</font>

## Detail demand

Backend↓ / Requirements→ | OS | java | Other
:-: | :-: | :-: | :-:
grobid | All (Windows/Linux/Mac) | Not Need | No
cermine | All (Windows/Linux/Mac) | Need | No
scienceparse | All (Windows/Linux/Mac) | Need | No
pdffigures | Linux/Mac | Not Need | leptonica & poppler (Ubuntu: sudo apt install libpoppler-private-dev libleptonica-dev)
pdffigures2 | All (Windows/Linux/Mac) | Need | No


In [None]:
# The Parser class takes the `backend` parameter to specify the backend to use.
# pdf_parser.Parser(backend='grobid')
# To parse the structural information of all PDF files in the `input_dir` and save the results to `output_dir`, use the following command:
pdf_parser.Parser.parse('text', input_dir, output_dir, n_threads=0)

# To parse the image information of all PDF files in the `input_dir` and save the results to `output_dir`, use the following command:
pdf_parser.Parser.parse('figure', input_dir, output_dir, n_threads=0)
# Note: The `n_threads` parameter specifies the number of threads to use for parsing. The default value is **0**, which means it will use all available `CPU cores`.

## 4. Data Deduplication

Usually, we have multiple data sources, the duplication between datasets are intuitively exsit, so we need to deduplicate the datasets to make sure the fairness between the data samples.
**Example:**
```python
def deduplicate_data(texts):
    return list(set(texts))

deduplicated_texts = deduplicate_data(list_of_texts)
```

There are more advanced methods for deduplication that identify semantically similar content beyond exact matching, using techniques like **MinHash** or **TF-IDF**.

In most of the time, we can use **SimHash** algorithm to finish our tasks. Here are the main steps:

- **Step 1**: Load two datasets using Hugging Face's datasets library.
- **Step 2**: Compute SimHash for each text in both datasets.
- **Step 3**: Deduplicate the datasets by comparing SimHashes and removing duplicates based on a Hamming distance threshold.
- **Step 4**: Combine the unique examples from both datasets into a new dataset.
- **Step 5**: Optionally, save the combined dataset to disk for future use.

In [None]:
from datasets import load_dataset, DatasetDict, Dataset
from tqdm import tqdm
import hashlib

### 4.1 SimHash

First we learn the SimHash here. SimHash is a hashing technique that captures the essence of the text, making it easier to compare the similarity between texts.

**Step-by-Step Implementation of SimHash:**

- **Tokenize the Text:** Convert the text into words or tokens.
- **Hash Each Token:** Use a hash function (e.g., hash()) to convert each token into an integer.
- **Vector Representation:** Convert each hash value into a binary vector. If the bit at a position is 1, increase a corresponding position in a counter. If the bit is 0, decrease the value.
- **Generate SimHash:** After processing all tokens, create a final fingerprint by taking the sign of each bit position in the vector (positive or negative sum of bit positions).
- **Hamming Distance:** Compare two SimHash values using the Hamming distance (number of differing bits between two binary representations).

In [None]:
class Simhash:
    def __init__(self, f=64):
        self.f = f

    def _hash(self, token):
        return int(hashlib.md5(token.encode('utf-8')).hexdigest(), 16)

    def compute_simhash(self, text):
        tokens = text.split()
        v = [0] * self.f
        for token in tokens:
            h = self._hash(token)
            for i in range(self.f):
                bitmask = 1 << i
                if h & bitmask:
                    v[i] += 1
                else:
                    v[i] -= 1
        fingerprint = 0
        for i in range(self.f):
            if v[i] > 0:
                fingerprint |= 1 << i
        return fingerprint

    def hamming_distance(self, simhash1, simhash2):
        x = simhash1 ^ simhash2
        tot = 0
        while x:
            tot += x & 1
            x >>= 1
        return tot

Now, let's analysis the time and space complexity of the algorithm.

**Time and Space Complexity Analysis of the SimHash**

1. **Time Complexity**:

The text is split into tokens, which takes $O(n)$ time, where $n$ is the number of characters in the input string (assuming space-based splitting). For each token, a hash is computed. Hashing a single token takes $O(1)$ because hashing has constant time complexity. For each of the $m$ tokens, we process each bit of the hash value, and the hash value has $f$ bits (e.g., 64), size of the fingerprint. Therefore, processing all tokens takes $O(m \times f)$ time, .
  
**Total Time Complexity**: The overall time complexity is $O(n) + O(m \times f)$, but since $m$ is proportional to $n$, this simplifies to **$O(n \times f)$**.

2. **Space Complexity**:

The space used for the vector representation of the hash is $O(f)$, where $f$ is the number of bits in the fingerprint (e.g., 64 bits or 128 bits). Then Storing the hash values takes $O(m)$, where $m$ is the number of tokens in the text.
  
**Total Space Complexity**: The overall space complexity is **$O(m + f)$**, which typically simplifies to **$O(m)$**, since $f$ (the number of bits) is constant (e.g., 64 bits).

3. Comparison Deployment Complexity:

**Hamming Distance**: Comparing two SimHash fingerprints using Hamming distance involves checking each of the \(f\) bits in the two fingerprints. Therefore, the time complexity of Hamming distance computation is **$O(f)$**.


In [None]:
# Load Hugging Face datasets
def load_datasets(dataset1_name, dataset2_name):
    dataset1 = load_dataset(dataset1_name)
    dataset2 = load_dataset(dataset2_name)
    return dataset1, dataset2

# Deduplicate datasets based on SimHash
def deduplicate_datasets(dataset1, dataset2, threshold=5):
    simhash = Simhash()

    simhashes_dataset1 = {idx: simhash.compute_simhash(item['text']) for idx, item in enumerate(dataset1)}
    simhashes_dataset2 = {idx: simhash.compute_simhash(item['text']) for idx, item in enumerate(dataset2)}

    unique_dataset1 = []
    unique_dataset2 = []

    # Process dataset1 and remove duplicates
    for idx1, simhash1 in tqdm(simhashes_dataset1.items(), desc="Processing dataset1"):
        is_duplicate = False
        for idx2, simhash2 in simhashes_dataset2.items():
            if simhash.hamming_distance(simhash1, simhash2) < threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            unique_dataset1.append(dataset1[idx1])

    # Process dataset2 and remove duplicates
    for idx2, simhash2 in tqdm(simhashes_dataset2.items(), desc="Processing dataset2"):
        is_duplicate = False
        for idx1, simhash1 in simhashes_dataset1.items():
            if simhash.hamming_distance(simhash2, simhash1) < threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            unique_dataset2.append(dataset2[idx2])

    return unique_dataset1, unique_dataset2

# Combine two datasets into one
def combine_datasets(dataset1, dataset2):
    combined_dataset = dataset1 + dataset2
    # If using Hugging Face `datasets`:
    return Dataset.from_dict(combined_dataset)

# Main function to load, deduplicate, and combine datasets
def do_deduplication(dataset1_name, dataset2_name):
    # Load the datasets
    dataset1, dataset2 = load_datasets(dataset1_name, dataset2_name)

    # Deduplicate the datasets
    unique_dataset1, unique_dataset2 = deduplicate_datasets(dataset1['train'], dataset2['train'])

    print(f"Unique records in dataset1: {len(unique_dataset1)}")
    print(f"Unique records in dataset2: {len(unique_dataset2)}")

    # Combine the two datasets
    combined_dataset = combine_datasets(unique_dataset1, unique_dataset2)

    # Save or return the combined dataset as necessary
    print(f"Combined dataset length: {len(combined_dataset)}")
    combined_dataset.save_to_disk('combined_dataset')  # Save to disk if needed

Then we can simply test the data deduplication using home-make SimHash algorithm.

In [None]:
dataset1_name = "cosmopedia-test-1"
dataset2_name = "cosmopedia-test-2"
do_deduplication(dataset1_name, dataset2_name)

Summary of Methods:
- **Data augmentation** and **deduplication** ensure diversity and uniqueness in the data.
- **Noise injection** and **dynamic masking** improve robustness.
- **Data filtering** and **balancing** ensure quality and fairness.
- **Custom tokenization and semantic clustering** optimize token handling and coverage.
- **Adversarial training** strengthens the model against challenging input.

## 5. More deeper in Data preprocessing

### 5.1 **Data Augmentation**
   - Augment your dataset by introducing variations or transformations to increase the diversity and size of the training data.
   - **Paraphrasing**: Use paraphrasing techniques to generate multiple variants of the same text while preserving meaning.
   - **Synonym Substitution**: Replace certain words with their synonyms.
   - **Back-translation**: Translate text to another language and back to the original to generate different expressions of the same content.


In [None]:
from transformers import MarianMTModel, MarianTokenizer

def back_translate(text, src_lang='en', tgt_lang='fr'):
    model_name = f'Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))
    return tokenizer.decode(translated[0], skip_special_tokens=True)

back_translated_text = back_translate(original_text)

### 5.2 **Noise Injection (Robustness Training)**
   - Purposefully injecting noise (e.g., spelling mistakes, missing punctuation, or small semantic errors) into the training data to make the model more robust to noisy input. This helps LLMs generalize better to real-world, uncleaned data.

   **Methods for Noise Injection:**
   - **Spelling perturbations**: Introduce typos or missing characters.
   - **Grammatical errors**: Introduce small grammatical inconsistencies.
   - **Random word swaps**: Swap adjacent words randomly.

   But actually, we usually do this in some SFT way, teaching the model to learn to clarify output with supervised data.

In [None]:
import random

def random_word_swap(text):
    words = text.split()
    idx1, idx2 = random.sample(range(len(words)), 2)
    words[idx1], words[idx2] = words[idx2], words[idx1]
    return " ".join(words)

noisy_text = random_word_swap(original_text)

### 5.4 **Data Stratification**
   - In order to make sure your model does not overfit to specific patterns (e.g., certain topics or styles dominating the training data), stratify your data by categories like topic, length, or language. This ensures that all categories are evenly represented.


In [None]:
def stratify_by_length(texts, bins=5):
    text_lengths = [len(text.split()) for text in texts]
    bin_thresholds = np.histogram(text_lengths, bins=bins)[1]
    stratified_data = [[] for _ in range(bins)]

    for text in texts:
        for i in range(len(bin_thresholds)-1):
            if bin_thresholds[i] < len(text.split()) <= bin_thresholds[i+1]:
                stratified_data[i].append(text)
                break

    return stratified_data

stratified_texts = stratify_by_length(list_of_texts)

### 5.5 **Data Balancing (Avoiding Bias)**
   - Make sure your dataset is not biased toward certain topics, terms, or cultural contexts. This can be done by balancing categories (e.g., gender, race, profession) within the training corpus.
   - Use topic modeling or clustering to detect if your dataset has an over-representation of certain topics, then down-sample or up-sample as needed.

In [None]:
from collections import Counter
from random import shuffle

def downsample_data_by_category(texts, categories, max_per_category=1000):
    category_count = Counter(categories)
    sampled_texts = []
    for category in category_count:
        texts_in_category = [text for text, cat in zip(texts, categories) if cat == category]
        shuffle(texts_in_category)
        sampled_texts.extend(texts_in_category[:max_per_category])
    return sampled_texts

sampled_texts = downsample_data_by_category(list_of_texts, list_of_categories)

### 5.6 **Semantic Clustering and Oversampling**
   - If certain topics or categories in your data are underrepresented, you can use semantic clustering to identify and oversample these categories to ensure the model doesn't ignore them.


In [None]:
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(list_of_texts)

# Build a Faiss index for fast clustering
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Perform clustering to find semantic clusters
D, I = index.search(embeddings, k=5)  # k is the number of nearest neighbors

### 5.7 **Adversarial Data Training**
   - Train your model with adversarial examples to help it become more robust. This can include slight perturbations to the input data or feeding it challenging cases (e.g., purposely ambiguous or contradicting text).


In [None]:
def add_perturbation(text):
    return ''.join([char if random.random() > 0.05 else random.choice(string.ascii_lowercase) for char in text])

perturbed_text = add_perturbation(original_text)

## 6. Implement a Text `Tokenizer`

As is well known, there is a significant gap between human language and machine understanding. Therefore, when processing language text with computers, the first step is to convert the raw text into a format that computers can effectively handle. This process is called tokenization. Tokenization involves breaking down text into smaller, more manageable units (called tokens), which can be words, subwords, or even individual characters.

In the vast majority of scenarios, the tokenizers we use include Byte-Pair Encoding (BPE), used by GPT-2; WordPiece, used by BERT; and the SentencePiece tokenizer used by T5, which was developed to address the situation where languages do not use spaces to separate words.

It should be noted that the Tiktoken tokenizer used by OpenAI models has recently gained significant attention.

In this tutorial, we first go through the BPE methods and then we discuss about the `Tiktoken` method.

### 6.1 Train a BPE Tokenizer

In [None]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

In [None]:
from datasets import load_dataset

ds = load_dataset("Syed-Hasan-8503/cosmopedia-10k")

In [None]:
train_data = ds['train']

In [None]:
def get_training_corpus(dataset):
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

In [None]:
tokenizer = Tokenizer(models.BPE())

In [None]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.StripAccents()] #normalizers.Lowercase()
)

In [None]:
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

In [None]:
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(train_data), trainer=trainer)

In [None]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

In [None]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

In [None]:
sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

In [None]:
tokenizer.decoder = decoders.ByteLevel()
tokenizer.decode(encoding.ids)

Hooray!🎉 We have trained a tokenizer from scratch.

In most cases, different large language models (LLMs) have different tokenizers. However, some research points out that even though tokenizers vary from model to model—leading to different models having different embedding tables—they result in relatively similar semantic representations [3]. Therefore, exploring this phenomenon is a great research direction.

### 6.2 Deep Dive into `Tiktoken`

Both Llama3 and OpenAI use `Tiktoken` BPE, while Llama2 uses `sentencepiece` BPE. The big difference between `TikToken` BPE and `sentencepiece` BPE is that `TikToken` BPE doesn't always split words into smaller parts if the whole word is already known. For example, if **"sentence"** is in the vocabulary, it stays as one token instead of splitting into **["sent","ence"]**. One can try on it on [tiktokenizer website app](https://tiktokenizer.vercel.app/)

First, let's take a look at the basic usage of the `Tiktoken` as follows (directly paste from the OpenAI tiktoken [Repo](https://github.com/openai/tiktoken)):

In [None]:
!pip install tiktoken

In [None]:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o")

Now, let's take a look at the tokenizer of "gpt-4o", quote [source code](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/tiktoken/core.py#L324) from `Tiktoken`

In [None]:
# Get the vocabulary (token string to token ID mapping)
gpt4o_vocab = enc.token_byte_values()
gpt4o_vocab_dict = enc._mergeable_ranks

In [None]:
# Get the special token
enc._special_tokens

As mentioned in Llama3 paper [6], Meta claims that Llama3 uses a vocabulary with 128K tokens, combineing **100K tokens from the tiktoken
tokenizer** with **28K additional tokens to better support non-English languages.** Some we can just use `Tiktoken` and combine our new tokens (Chinese tokens), and refer to the README of [extending tiktoken](https://github.com/openai/tiktoken/tree/main?tab=readme-ov-file#extending-tiktoken)

> Notice: Turn to [https://huggingface.co/docs/transformers/main/en/tiktoken](https://huggingface.co/docs/transformers/main/en/tiktoken) to see more example using package `transformers` and `tiktoken`.


## What's More

Besides what we discuss in this notebook, we also recomman the following aspects in data preprocessing.

### Monitoring Data Use in Pretraining

1. **Sampling Strategies**:
   - Use stratified or weighted sampling to ensure the model learns from all data sources proportionally.
   - For domain-specific models, prioritize domain data while also incorporating some general data for broader knowledge.

2. **Data Usage Policies**:
   - Ensure compliance with licenses (e.g., Creative Commons, Open Data, or public domain data).
   - Set clear guidelines on the use of sensitive data (e.g., data involving personal information, confidential data).

### Final Considerations:

- **Balanced Approach**: While diversity in data is crucial for general-purpose models, domain-specific models benefit from a fine balance between general data and specialized data.
- **Efficient Preprocessing**: Spend time ensuring that the data is clean, diverse, and high-quality before pretraining. It’s more efficient to preprocess properly upfront than to deal with performance issues during or after training.
- **Ethical Awareness**: Ensure that the pretraining data aligns with ethical guidelines, and be aware of potential biases introduced by the data.

By carefully selecting and preparing your dataset, you will set a strong foundation for training your LLM, leading to better performance, generalization, and fairness in downstream applications.


## Reference

1. https://github.com/jiangnanboy/llm_corpus_quality
2. https://huggingface.co/blog/cosmopedia
3. [The Platonic Representation Hypothesis](https://arxiv.org/pdf/2405.07987)
4. https://huggingface.co/learn/nlp-course/en/chapter6/2
5. https://huggingface.co/learn/nlp-course/en/chapter6/5