# Lesson 7: Text Data Preprocessing and Preparation

## Introduction (5 minutes)

Welcome to our lesson on text data preprocessing and preparation. In this 90-minute session, we'll explore the crucial steps involved in preparing text data for Large Language Model (LLM) training. We'll cover various techniques for data collection, cleaning, normalization, and augmentation.

## Lesson Objectives

By the end of this lesson, you will:
1. Understand the importance of data preprocessing in LLM training
2. Learn common methods for text data preprocessing
3. Explore techniques for handling large-scale text data
4. Gain familiarity with popular pre-training and fine-tuning datasets

## 1. Introduction to LLM Training Data Forms (10 minutes)

LLMs require vast amounts of text data for training. Common forms of training data include:

1. Web crawled data (e.g., Common Crawl)
2. Books and literature
3. Scientific papers and articles
4. Social media posts
5. Dialogue data (for conversational models)

Example of accessing web crawled data:

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Large_language_model"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract text from paragraphs
text = ' '.join([p.text for p in soup.find_all('p')])
print(text[:500])  # Print first 500 characters

## 2. Common Methods for Text Data Preprocessing (40 minutes)

### 2.1 Low-quality Filtering (10 minutes)

Removing or filtering out low-quality data is crucial for maintaining the quality of your training set.

Techniques include:
- Removing duplicate content
- Filtering out non-natural language text (e.g., code, random strings)
- Removing content with high perplexity scores

Example of removing duplicates:

In [None]:
def remove_duplicates(texts):
    return list(dict.fromkeys(texts))

texts = ["Hello world", "Hello world", "This is unique", "Hello world"]
unique_texts = remove_duplicates(texts)
print(unique_texts)

### 2.2 Redundancy Handling (10 minutes)

Handling redundancy involves identifying and removing or reducing similar content that doesn't add value to the training set.

Techniques include:
- Near-duplicate detection
- Semantic similarity clustering

Example using cosine similarity:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_texts(texts, threshold=0.8):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(texts)
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    similar_pairs = []
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            if cosine_sim[i][j] > threshold:
                similar_pairs.append((i, j))
    
    return similar_pairs

texts = [
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps above a sleepy canine",
    "Python is a popular programming language",
    "Java is widely used in enterprise software development"
]

similar_pairs = find_similar_texts(texts)
print("Similar text pairs:", similar_pairs)

### 2.3 Privacy Removal (10 minutes)

Ensuring privacy in training data is crucial. This involves removing or anonymizing personal information.

Techniques include:
- Named Entity Recognition (NER) for identifying personal information
- Regular expressions for detecting patterns like email addresses or phone numbers

Example using spaCy for NER:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

def anonymize_text(text):
    doc = nlp(text)
    anonymized = text
    for ent in reversed(doc.ents):  # Reverse to avoid indexing issues
        if ent.label_ in ["PERSON", "ORG", "GPE"]:
            anonymized = anonymized[:ent.start_char] + "[" + ent.label_ + "]" + anonymized[ent.end_char:]
    return anonymized

text = "John Doe works at Google in New York."
anonymized_text = anonymize_text(text)
print("Original:", text)
print("Anonymized:", anonymized_text)

### 2.4 Text Normalization (10 minutes)

Text normalization involves standardizing text to a common format.

Techniques include:
- Lowercasing
- Removing punctuation
- Expanding contractions
- Handling special characters and emojis

Example of basic text normalization:

In [None]:
import re

def normalize_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

text = "Hello, World!   How's it going?"
normalized_text = normalize_text(text)
print("Original:", text)
print("Normalized:", normalized_text)

## 3. Introduction to Large-scale Text Data Sources (20 minutes)

Large-scale text data sources are crucial for training LLMs. Let's explore some popular sources:

### 3.1 Common Crawl

Common Crawl is a vast repository of web-crawled data.

Example of accessing Common Crawl data (conceptual, not runnable here):

In [None]:
import warc
import gzip
import requests

def fetch_common_crawl_sample():
    url = "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610703495745.19/warc/CC-MAIN-20210115134348-20210115164348-00000.warc.gz"
    response = requests.get(url, stream=True)
    warc_file = warc.WARCFile(fileobj=gzip.GzipFile(fileobj=response.raw))
    
    for record in warc_file:
        if record['WARC-Type'] == 'response':
            print(record.url)
            print(record.content)
            break  # Just print the first record for demonstration

# fetch_common_crawl_sample()  # Uncomment to run (may take a while)

### 3.2 Wikipedia Dumps

Wikipedia provides regular dumps of its content, which are widely used in NLP tasks.

Example of processing a Wikipedia dump (conceptual, not runnable here):

In [None]:
from gensim.corpora import WikiCorpus

def process_wiki_dump(dump_path):
    wiki = WikiCorpus(dump_path)
    for text in wiki.get_texts():
        yield ' '.join(text)

# Usage:
# dump_path = "path_to_wikipedia_dump.xml.bz2"
# for article in process_wiki_dump(dump_path):
#     print(article[:100])  # Print first 100 characters of each article

## 4. Introduction to Famous Public Pre-training and Fine-tuning Datasets (10 minutes)

Let's briefly discuss some popular datasets used for pre-training and fine-tuning LLMs:

1. BookCorpus: A large collection of free books
2. OpenWebText: Web content extracted from URLs shared on Reddit
3. C4 (Colossal Clean Crawled Corpus): A colossal, cleaned version of Common Crawl
4. GLUE (General Language Understanding Evaluation): A collection of tasks for evaluating language understanding

Example of loading a dataset using Hugging Face Datasets:

In [None]:
from datasets import load_dataset

# Load the GLUE dataset for the CoLA task
dataset = load_dataset("glue", "cola")
print(dataset)

# Print a sample from the training set
print(dataset['train'][0])

## Conclusion and Q&A (5 minutes)

We've covered various aspects of text data preprocessing and preparation, including data cleaning, normalization, and sources of large-scale text data. Remember, the quality and diversity of your training data significantly impact the performance of your LLM.

Are there any questions about the topics we've covered?

## Additional Resources

1. "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper
2. Common Crawl documentation: https://commoncrawl.org/the-data/get-started/
3. Hugging Face Datasets documentation: https://huggingface.co/docs/datasets/
4. "Data Cleaning for Natural Language Processing: A Research Survey" paper: https://arxiv.org/abs/2103.05028

In our next lesson, we'll dive into the specifics of LLM training, including fine-tuning techniques.