# Lab 02: Reading Markdown Files - Tutorial

## Loading Documents for RAG Systems

---

**Welcome!** In this tutorial, you will learn:
- How to open and read files in Python
- Different methods to read file content
- How to count characters, words, and lines
- How to search for text within files

**Time:** ~60 minutes

---

## Part 1: Why Read Files?

In RAG systems, we need to:
1. **Load** documents from files (this lab!)
2. **Chunk** the text into smaller pieces
3. **Embed** the chunks as vectors
4. **Store** in a vector database

This lab focuses on **Step 1: Loading documents**!

## Part 2: Opening Files with `open()`

Python's `open()` function opens a file for reading or writing.

In [None]:
# The basic syntax: open(filepath, mode, encoding)
# - filepath: path to the file
# - mode: "r" for read, "w" for write
# - encoding: "utf-8" for Thai/international text

# Open file, read content, close file
file = open("../data/rubella.md", "r", encoding="utf-8")
content = file.read()
file.close()  # Don't forget to close!

print(content[:200])  # Print first 200 characters

### Better Way: Using `with` Statement

The `with` statement automatically closes the file!

In [None]:
# Recommended: using 'with' statement
# The file is automatically closed after the block

with open("../data/rubella.md", "r", encoding="utf-8") as f:
    content = f.read()

# File is now closed automatically
print(content[:200])

## Part 3: Reading Methods

There are several ways to read file content:
1. `read()` - read entire file as one string
2. `readlines()` - read as list of lines
3. `readline()` - read one line at a time

### 3.1 Using read() - Entire File

In [None]:
# read() returns the entire file as a single string

with open("../data/rubella.md", "r", encoding="utf-8") as f:
    content = f.read()

print(type(content))  # <class 'str'>
print(f"Total length: {len(content)} characters")
print("\nFirst 300 characters:")
print(content[:300])

### 3.2 Using readlines() - List of Lines

In [None]:
# readlines() returns a list where each element is a line

with open("../data/rubella.md", "r", encoding="utf-8") as f:
    lines = f.readlines()

print(type(lines))  # <class 'list'>
print(f"Total lines: {len(lines)}")
print("\nFirst 5 lines:")
for i, line in enumerate(lines[:5]):
    print(f"Line {i+1}: {line.strip()}")  # strip() removes \n

### 3.3 Using Loop - Memory Efficient

In [None]:
# For large files, iterate line by line
# This doesn't load the entire file into memory

print("Lines containing 'symptom':")
print("-" * 40)

with open("../data/rubella.md", "r", encoding="utf-8") as f:
    for line_num, line in enumerate(f, 1):
        if "symptom" in line.lower():
            print(f"Line {line_num}: {line.strip()}")

## Part 4: Counting Text

Common operations when processing documents.

### 4.1 Count Characters

In [None]:
with open("../data/rubella.md", "r", encoding="utf-8") as f:
    content = f.read()

# Total characters including spaces and newlines
total_chars = len(content)
print(f"Total characters: {total_chars}")

# Characters without spaces
chars_no_space = len(content.replace(" ", "").replace("\n", ""))
print(f"Characters (no spaces): {chars_no_space}")

### 4.2 Count Lines

In [None]:
# Method 1: Count newline characters
line_count_1 = content.count("\n") + 1
print(f"Line count (method 1): {line_count_1}")

# Method 2: Split by newline
line_count_2 = len(content.split("\n"))
print(f"Line count (method 2): {line_count_2}")

# Method 3: Use readlines()
with open("../data/rubella.md", "r", encoding="utf-8") as f:
    line_count_3 = len(f.readlines())
print(f"Line count (method 3): {line_count_3}")

### 4.3 Count Words

In [None]:
# Split by whitespace to get words
words = content.split()
word_count = len(words)

print(f"Total words: {word_count}")
print(f"\nFirst 10 words: {words[:10]}")

## Part 5: Searching Text

### 5.1 Check if Text Exists

In [None]:
# Use 'in' operator to check if text exists
if "fever" in content:
    print("Found 'fever' in the document!")
else:
    print("'fever' not found")

# Case-insensitive search
if "RUBELLA" in content.upper():
    print("Found 'rubella' (case-insensitive)!")

### 5.2 Count Occurrences

In [None]:
# Count how many times a word appears
fever_count = content.lower().count("fever")
print(f"'fever' appears {fever_count} times")

symptom_count = content.lower().count("symptom")
print(f"'symptom' appears {symptom_count} times")

treatment_count = content.lower().count("treatment")
print(f"'treatment' appears {treatment_count} times")

### 5.3 Find Lines Containing Text

In [None]:
def find_lines_with_text(filepath, search_text):
    """Find all lines containing the search text."""
    results = []
    
    with open(filepath, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            if search_text.lower() in line.lower():
                results.append({
                    "line_num": line_num,
                    "text": line.strip()
                })
    
    return results

# Find lines with "vaccine"
results = find_lines_with_text("../data/rubella.md", "vaccine")
print(f"Found {len(results)} lines with 'vaccine':")
for r in results:
    print(f"  Line {r['line_num']}: {r['text']}")

## Part 6: Putting It All Together

Let's create a document loader function!

In [None]:
def load_document(filepath):
    """
    Load a document and return its information.
    
    Args:
        filepath: Path to the file
        
    Returns:
        Dictionary with file info and content
    """
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
    
    return {
        "filepath": filepath,
        "content": content,
        "char_count": len(content),
        "word_count": len(content.split()),
        "line_count": len(content.split("\n"))
    }

# Test with rubella.md
doc = load_document("../data/rubella.md")
print(f"File: {doc['filepath']}")
print(f"Characters: {doc['char_count']}")
print(f"Words: {doc['word_count']}")
print(f"Lines: {doc['line_count']}")

In [None]:
# Load multiple documents
documents = []

files = ["../data/rubella.md", "../data/cholera.md"]
for filepath in files:
    doc = load_document(filepath)
    documents.append(doc)

# Display summary
print("Document Summary")
print("=" * 50)
for doc in documents:
    print(f"\nFile: {doc['filepath']}")
    print(f"  Characters: {doc['char_count']}")
    print(f"  Words: {doc['word_count']}")
    print(f"  Lines: {doc['line_count']}")

---

## Summary

### What You Learned:

| Concept | Syntax | Example |
|---------|--------|--------|
| Open file | `open(path, mode, encoding)` | `open("file.md", "r", encoding="utf-8")` |
| With statement | `with open(...) as f:` | Auto-closes file |
| Read all | `f.read()` | Returns string |
| Read lines | `f.readlines()` | Returns list |
| Count chars | `len(content)` | Total characters |
| Count lines | `len(content.split("\n"))` | Total lines |
| Search | `"text" in content` | Returns True/False |
| Count occurrences | `content.count("text")` | Returns number |

### Next Step:

Now open `exercise/Lab02_Exercise.ipynb` and complete the 4 exercises!

Good luck!