# Lesson 27: Data Preparation and Preprocessing

## Introduction (5 minutes)

Welcome to our lesson on Data Preparation and Preprocessing for RAG systems. In this 60-minute session, we'll explore the crucial steps involved in preparing and processing data for effective retrieval and generation. We'll cover text loading, segmentation, cleaning, and integration with search engines.

## Lesson Objectives

By the end of this lesson, you will be able to:
1. Load and parse different types of documents
2. Implement text segmentation techniques
3. Apply data cleaning and normalization methods
4. Set up and use Elasticsearch for text indexing
5. Understand and implement various data processing techniques

## 1. Text Loading and Parsing (15 minutes)

Let's start by implementing functions to load and parse different types of documents:

In [None]:
import os
import PyPDF2
from bs4 import BeautifulSoup
import requests

def load_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def load_pdf_file(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        return ' '.join([page.extract_text() for page in reader.pages])

def load_html_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'html.parser')
        return soup.get_text()

def load_web_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup.get_text()

def load_document(path_or_url):
    if path_or_url.startswith('http'):
        return load_web_page(path_or_url)
    
    _, file_extension = os.path.splitext(path_or_url)
    if file_extension == '.txt':
        return load_text_file(path_or_url)
    elif file_extension == '.pdf':
        return load_pdf_file(path_or_url)
    elif file_extension in ['.html', '.htm']:
        return load_html_file(path_or_url)
    else:
        raise ValueError(f"Unsupported file type: {file_extension}")

# Usage
text = load_document('example.pdf')
print(text[:500])  # Print first 500 characters

## 2. Text Segmentation (15 minutes)

Next, let's implement text segmentation to break down large documents into manageable chunks:

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def segment_text(text, max_chunk_size=1000, overlap=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        sentence_size = len(sentence)
        if current_size + sentence_size > max_chunk_size:
            if current_chunk:
                chunks.append(' '.join(current_chunk))
                current_chunk = current_chunk[-overlap//10:]  # Keep some sentences for overlap
                current_size = sum(len(s) for s in current_chunk)
        
        current_chunk.append(sentence)
        current_size += sentence_size

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Usage
document = load_document('example.pdf')
segments = segment_text(document)
print(f"Number of segments: {len(segments)}")
print(f"First segment: {segments[0][:200]}...")

## 3. Data Cleaning and Normalization (10 minutes)

Let's implement some basic data cleaning and normalization functions:

In [None]:
import re
import unicodedata

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def normalize_text(text):
    # Normalize unicode characters
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')
    
    # Replace common contractions
    contractions = {
        "n't": " not",
        "'s": " is",
        "'m": " am",
        "'re": " are",
        "'ll": " will",
        "'ve": " have",
        "'d": " would"
    }
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    
    return text

def preprocess_text(text):
    text = clean_text(text)
    text = normalize_text(text)
    return text

# Usage
raw_text = "Here's an example text with some noise!!! It's got 123 numbers and special chars."
processed_text = preprocess_text(raw_text)
print(f"Original: {raw_text}")
print(f"Processed: {processed_text}")

## 4. Setting up Elasticsearch (10 minutes)

Now, let's set up Elasticsearch for efficient text indexing and search:

In [None]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

es = Elasticsearch(["http://localhost:9200"])

def create_index(index_name):
    settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "mappings": {
            "properties": {
                "content": {"type": "text"},
                "embedding": {"type": "dense_vector", "dims": 384}  # Adjust dims based on your embedding model
            }
        }
    }
    es.indices.create(index=index_name, body=settings)

def index_documents(index_name, documents):
    actions = [
        {
            "_index": index_name,
            "_source": {
                "content": doc,
                # Add embedding here if you're using vector search
            }
        }
        for doc in documents
    ]
    bulk(es, actions)

# Usage
index_name = "rag_documents"
create_index(index_name)
documents = segment_text(load_document('example.pdf'))
index_documents(index_name, documents)
print(f"Indexed {len(documents)} documents")

## 5. Data Processing Methods and Tools (15 minutes)

Let's explore some additional data processing techniques:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.summarization import summarize

def extract_keywords(text, top_n=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    sorted_items = sorted(zip(tfidf_matrix.tocsr().data, feature_names))
    keywords = [item[1] for item in sorted_items[-top_n:]]
    return keywords

def generate_summary(text, ratio=0.2):
    return summarize(text, ratio=ratio)

def detect_language(text):
    from langdetect import detect
    return detect(text)

# Usage
sample_text = load_document('example.txt')
keywords = extract_keywords(sample_text)
summary = generate_summary(sample_text)
language = detect_language(sample_text)

print(f"Keywords: {keywords}")
print(f"Summary: {summary[:200]}...")
print(f"Detected Language: {language}")

## Conclusion and Q&A (5 minutes)

In this lesson, we've covered essential data preparation and preprocessing techniques for RAG systems. We've learned how to load and parse different document types, segment text, clean and normalize data, set up Elasticsearch for indexing, and implement various data processing methods.

Are there any questions about the data preparation and preprocessing steps we've covered?

## Additional Resources

1. NLTK documentation: https://www.nltk.org/
2. Elasticsearch Python Client: https://elasticsearch-py.readthedocs.io/
3. "Text Preprocessing in Python: Steps, Tools, and Examples" article: https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
4. Gensim library documentation: https://radimrehurek.com/gensim/

In our next lesson, we'll focus on building the vector database with Milvus for efficient similarity search in our RAG system.