# Creating a Training Set for Text-Based Models

In this notebook, we will explore how to build a training dataset by combining text data sourced from the web with existing datasets available on platforms like Hugging Face. This process involves gathering, preprocessing, and integrating diverse text data to create a robust and comprehensive training set for machine learning models.

In [1]:
import warnings
warnings.filterwarnings("ignore")

### In this notebook, you will explore two sources of data for training:

1. **Download an existing dataset from Hugging Face**  
2. **Create a dataset of Python scripts sourced from GitHub**

##### In both cases, we will create a Hugging Face **`Dataset`** object.

In [None]:
# 1. Downloading data from Hugging Face.
import datasets
pretraining_dataset = datasets.load_dataset(
    "upstage/Pretraining_Dataset",
    split="train"
)

In [4]:
# check the downloaded dataset
print(pretraining_dataset)

Dataset({
    features: ['text', 'meta'],
    num_rows: 60000
})


In [6]:
# it seems like the dataset has 2 columns, text and meta, so
# for now we are only going to focus on the 'text' column.
# Lets look at first 500 characters of the first example
print(pretraining_dataset[0]["text"][:500])

In 1793 Zaman Shah, a grandson of Ahmad Shah Durrani, won a brief war of succession to become ruler of Afghanistan. The support of Painda Khan, chief of the Baraksai branch of the Durrani tribe, was decisive in his victory. In the next fifty year., the brothers of Zaman shah and the sons of Painda Khan were to dominate the affairs of Afghanistan. The Durrani tribe was very large with several branches and numerous clans. 1 Abmad Shah and his successors belonged to the Sadozai clan, but other clan


In [7]:
# 2. Create a dataset of Python scripts sourced from GitHub
# we will scrape the few python scripts from GitHub and then
# prepare them as a Hugging Face 'Dataset' object to use in training.
import os
import requests

# Path to directory to store python scripts
code_dir = "./code"

In [8]:
# urls to scrape the python script from GitHub. Feel free to add yours
urls = [
    "https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py",
    "https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py",
    "https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py",
    "https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py",
    "https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py",
    "https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py",
    "https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py",
    "https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/django/contrib/messages/__init__.py",
    "https://raw.githubusercontent.com/PaliC/pytorch/master/test/fx/test_subgraph_rewriter.py"
]

In [9]:
# let's retrieve the python scripts from the urls
for url in urls:
    print(f"Working on url: {url}")
    response = requests.get(url)
    file_name = os.path.basename(url)
    file_path = os.path.join(code_dir, file_name)
    
    # create the directory if it doesn't exist
    os.makedirs(code_dir, exist_ok=True)

    with open(file_path, "wb") as file:
        file.write(response.content)

Working on url: https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py
Working on url: https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py
Working on url: https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py
Working on url: https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py
Working on url: https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py
Working on url: https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py
Working on url: https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py
Working on url: https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/djan

In [10]:
# list the files in the directory
files = os.listdir(code_dir)
for file in files:
    print(file)

test_subgraph_rewriter.py
numpy_mlp.py
values.py
version.py
double_linear_search_recursion.py
__init__.py
visualize.py
module_util.py
distribute_coordinator_context.py


In [11]:
# Now as mentioned we are going to add the python scripts to the dataset
# we will augment the dataset with the python scripts we just scraped
# we will create a new dataset with the python scripts
# first, lets concatenate the python scripts into a list
code_dataset = []
for file in os.listdir(code_dir):
    code_dataset.append(
        {'text': open(os.path.join(code_dir, file), 'r').read()}
    )

In [12]:
# convert list to Hugging Face Dataset
code_dataset = datasets.Dataset.from_list(code_dataset)
print(code_dataset)

Dataset({
    features: ['text'],
    num_rows: 9
})


In [13]:
# The above format looks similar to what we downloaded from Hugging Face
# Now, Combine the python code dataset with the pretraining dataset
dataset = datasets.concatenate_datasets(
    [pretraining_dataset, code_dataset]
)
print(dataset)

Dataset({
    features: ['text', 'meta'],
    num_rows: 60009
})


### Perfect, we have a dataset with both the pretraining dataset and the python scripts

### Now we will clean the data with the following cleaning steps

1. **Filter out samples that are too short**  
2. **Remove repetitions within a single text example**
3. **Remove duplicated documents**
4. **Quality filter to remove non-English texts**

In [14]:
# the following are the number of rows we started with
dataset.num_rows

60009

In [15]:
# we will remove examples that are too short
import heapq

def paragraph_length_filter(x):
    """Returns False iff a page has too few lines or lines are too short."""
    lines = x['text'].split('\n')
    if (
        len(lines) < 3
        or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
    ):
        return False
    return True

# convenient 'filter' method from 'Dataset' to apply above function all rows
dataset = dataset.filter(
    paragraph_length_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/60009 [00:00<?, ? examples/s]

In [16]:
# check the number of rows after filtering
dataset.num_rows

52356

In [17]:
# Now, we will remove repeated text within training examples
import re

def find_duplicates(paragraphs):
    """
    Use this function to find the number of repetitions 
    in the paragraphs.
    """
    unique_x = set()
    duplicate_chars = 0
    duplicate_elements = 0
    for element in paragraphs:
        if element in unique_x:
            duplicate_chars += len(element)
            duplicate_elements += 1
        else:
            unique_x.add(element)
    return duplicate_elements, duplicate_chars


def paragraph_repetition_filter(x):
    """
    Returns False iff a page has too many repetitions.
    """
    text = x['text']
    paragraphs = re.compile(r"\n{2,}").split(text.strip())                # Split by paragraphs (2 or more newlines)
    paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)  # Find number of duplicates in paragraphs
    if paragraphs_duplicates / len(paragraphs) > 0.3:
        return False
    if char_duplicates / len(text) > 0.2:
        return False
    return True


dataset = dataset.filter(
    paragraph_repetition_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/52356 [00:00<?, ? examples/s]

In [18]:
# check the number of rows after filtering
dataset.num_rows

52326

In [19]:
# now, you'll remove duplicate examples from the entire dataset i.e.Deduplication.
# This is in contrast to the previous step where you were just looking for repeated text in each example.
def deduplication(ds):
    def dedup_func(x):
        """Use this function to remove duplicate entries"""
        if x['text'] in unique_text:
            return False
        else:
            unique_text.add(x['text'])
            return True

    unique_text = set()

    ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
    return ds

dataset = deduplication(dataset)

Filter:   0%|          | 0/52326 [00:00<?, ? examples/s]

In [20]:
dataset.num_rows

43597

In [21]:
# now, we'll remove any text examples that are in a language other than English
import urllib.request
import os
from fasttext import load_model

def english_language_filter(ds):
    # Create models directory if it doesn't exist
    os.makedirs('models', exist_ok=True)
    
    # Download the language detection model if not present
    model_path = 'models/lid.176.bin'
    if not os.path.exists(model_path):
        print("Downloading language detection model...")
        urllib.request.urlretrieve(
            'https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin',
            model_path
        )
    
    # Load the model
    model = load_model(model_path)
    
    def is_english(x):
        # Predict language of the text and probability
        predictions = model.predict(x['text'].replace("\n", ""))
        language = predictions[0][0]  # Get predicted language
        score = predictions[1][0]     # Get confidence score
        
        return score > 0.4 and language == '__label__en'

    ds = ds.filter(is_english, load_from_cache_file=False, num_proc=1)
    return ds

dataset = english_language_filter(dataset)

Downloading language detection model...




Filter:   0%|          | 0/43597 [00:00<?, ? examples/s]

In [22]:
dataset.num_rows

40473

In [23]:
# now, save the dataset to disk
file_path = "./data/preprocessed_dataset.parquet"
dataset.to_parquet(file_path)

Creating parquet from Arrow format:   0%|          | 0/41 [00:00<?, ?ba/s]

197879689