<a href="https://colab.research.google.com/github/Yash4rma/PDF-to-DAPT-Corpus-Construction-Experiment/blob/main/PDF_to_DAPT_Corpus_Construction_Experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip uninstall -y datatrove
!pip install datatrove[io]==0.2.0

[0mCollecting datatrove==0.2.0 (from datatrove[io]==0.2.0)
  Downloading datatrove-0.2.0-py3-none-any.whl.metadata (22 kB)
Collecting loguru>=0.7.0 (from datatrove==0.2.0->datatrove[io]==0.2.0)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting faust-cchardet (from datatrove[io]==0.2.0)
  Downloading faust_cchardet-2.1.19-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.3 kB)
Collecting python-magic (from datatrove[io]==0.2.0)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting warcio (from datatrove[io]==0.2.0)
  Downloading warcio-1.7.5-py2.py3-none-any.whl.metadata (16 kB)
Downloading datatrove-0.2.0-py3-none-any.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading loguru-0.7.3-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:

CELL 1: DataTrove environment reset and version pinning.


*   This step removes any pre-installed DataTrove version and installs a pinned
*   public release (datatrove[io]==0.2.0) to test compatibility with FineWeb-style pipelines.
*   Output: DataTrove 0.2.0 installed successfully, confirming the runtime environment state.

Note: This step later revealed API limitations in the public DataTrove release.










In [2]:
!pip install -q \
  pypdf \
  langdetect \
  tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.1/329.1 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for langdetect (setup.py) ... [?25l[?25hdone


CELL 2: Install core processing libraries
This step installs supporting libraries required for the pipeline:
- pypdf: extract text content from PDF files
- langdetect: detect document language for quality filtering
- tqdm: provide progress visibility during processing
Output: All auxiliary dependencies installed and available in the runtime.

In [3]:
from pathlib import Path

BASE = Path("/content/datatrove_experiment")

DIRS = {
    "raw_pdfs": BASE / "01_raw_pdfs",
    "raw_text": BASE / "02_raw_text",
    "jsonl_input": BASE / "03_jsonl_input",
    "normalized": BASE / "04_normalized",
    "filtered": BASE / "05_filtered",
    "deduped": BASE / "06_deduped",
    "final": BASE / "07_final_corpus",
    "logs": BASE / "logs"
}

for d in DIRS.values():
    d.mkdir(parents=True, exist_ok=True)

print("Directory structure ready")

Directory structure ready


CELL 3: Directory structure initialization

*   This step defines and creates a staged directory layout for the corpus pipeline.
*   Each directory represents a processing stage and stores its intermediate artifacts.
*   Output: Directory structure created under /content/datatrove_experiment, including folders for raw PDFs, extracted text, JSONL inputs, normalized data, filtered data, deduplicated data, final corpus output, and logs.



In [4]:
import shutil
from pathlib import Path

uploaded_files = [
    "/content/Pharma-test01.pdf",
    "/content/Pharma-test02.pdf",
    "/content/Finance-test01.pdf",
    "/content/Finance-test02.pdf",
]

for f in uploaded_files:
    shutil.copy(f, DIRS["raw_pdfs"])

# Verify copy
list(DIRS["raw_pdfs"].iterdir())

[PosixPath('/content/datatrove_experiment/01_raw_pdfs/Pharma-test02.pdf'),
 PosixPath('/content/datatrove_experiment/01_raw_pdfs/Pharma-test01.pdf'),
 PosixPath('/content/datatrove_experiment/01_raw_pdfs/Finance-test01.pdf'),
 PosixPath('/content/datatrove_experiment/01_raw_pdfs/Finance-test02.pdf')]

CELL 4 – Copy uploaded PDF files into the pipeline workspace

This step copies the PDF files that were uploaded directly into the Google Colab environment into the pipeline’s `01_raw_pdfs` directory. This ensures that all raw input documents are stored in a controlled, versioned location and that all subsequent processing stages operate only on pipeline-managed files.

**Output:**  
All input PDF files (`Pharma-test01.pdf`, `Pharma-test02.pdf`, `Finance-test01.pdf`, `Finance-test02.pdf`) are copied into the `01_raw_pdfs` directory, and the copy is verified by listing the directory contents.

In [5]:
from pypdf import PdfReader
from tqdm import tqdm

def pdf_to_text(pdf_path):
    reader = PdfReader(pdf_path)
    pages = []
    for page in reader.pages:
        txt = page.extract_text()
        if txt:
            pages.append(txt)
    return "\n".join(pages)

for pdf in tqdm(DIRS["raw_pdfs"].glob("*.pdf")):
    text = pdf_to_text(pdf)
    out = DIRS["raw_text"] / f"{pdf.stem}.txt"
    out.write_text(text, encoding="utf-8")

print("Raw text extraction complete")

4it [00:39,  9.84s/it]

Raw text extraction complete





CELL 5 – Extract raw text from PDF documents

This step performs format-aware text extraction from each PDF file using the `pypdf` library. Each page of the document is processed sequentially, and any extractable text is collected and concatenated. No cleaning, filtering, or normalization is applied at this stage in order to preserve the original document content.

**Output:**  
One raw text file per PDF is generated and stored in the `02_raw_text` directory (for example, `Pharma-test01.txt` and `Finance-test02.txt`). These files contain the unprocessed text extracted directly from the PDFs.

In [6]:
print((DIRS["raw_text"] / "Pharma-test01.txt").read_text()[:2000])

CENTER FOR DRUG EVALUATION AND 
RESEARCH 
A
PPLICATION NUMBER: 
214787Orig1s000 
CLINICAL PHARMACOLOGY 
REVIEW(S) 
 
1 
 
Date: July 16, 2020   
 
From: Neil Hartman, PhD, Marlene Kim, PhD, Naomi Kruhlak, PhD, Rebecca Racz, PharmD, Division of 
Applied Regulatory Science/Office of Clinical Pharmacology (DARS/OCP) 
 
Through: James Weaver Ph.D., Consult Lead and David Strauss M.D., Ph.D., Director; DARS/OCP  
 
To: Neha Gada, Division of Pharmacovigilance II, Office of Surveillance and Epidemiology 
 
Subject:  In silico Analyses on the Potential Association of Remdesivir with Renal and Hepatic Events 
(NDA 21487) 
 
Executive Summary 
 
Remdesivir is currently approved under an Emergency Use Authorization (EUA) for COVID-19. It is 
closely related to adenosine and the adenosine nucleotides in structure.  Multiple adverse events have been 
reported to the Agency, including acute kidney injury.  Additionally, the Emergency Use Authorization 
describes a known risk of increased transamina

CELL 6 – Validate raw text extraction

This step performs a manual sanity check on the extracted text by reading and displaying a portion of one text file. This verification ensures that the PDF-to-text extraction step completed successfully and that the extracted content is readable and representative of the source document.

**Output:**  
A preview of the extracted text is displayed in the notebook output, confirming that raw text extraction completed correctly and produced usable text.

In [7]:
import json

jsonl_path = DIRS["jsonl_input"] / "documents.jsonl"

with open(jsonl_path, "w", encoding="utf-8") as f:
    for txt in DIRS["raw_text"].glob("*.txt"):
        domain = "pharma" if "Pharma" in txt.name else "finance"
        record = {
            "id": txt.stem,
            "text": txt.read_text(),
            "domain": domain,
            "source": "pdf"
        }
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

jsonl_path

PosixPath('/content/datatrove_experiment/03_jsonl_input/documents.jsonl')

CELL 7 – Convert extracted text files into structured JSONL format

This step transforms the raw text files produced from PDF extraction into a structured JSONL dataset. Each document is wrapped as a single JSON object containing an identifier, the full text content, a domain label (pharma or finance inferred from the filename), and the data source. This step establishes a standardized schema that all downstream processing stages rely on.

**Output:**  
A single JSONL file (`documents.jsonl`) is created inside the `03_jsonl_input` directory, where each line represents one document with metadata and text content.

In [8]:
!pip show datatrove


Name: datatrove
Version: 0.2.0
Summary: HuggingFace library to process and filter large amounts of webdata
Home-page: 
Author: 
Author-email: "HuggingFace Inc." <guilherme@huggingface.co>
License: Apache-2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: dill, fsspec, huggingface-hub, humanize, loguru, multiprocess, numpy, tqdm
Required-by: 


CELL 8 – Verify installed DataTrove version

This step checks the installed DataTrove package version using `pip show`. The purpose is to confirm the exact public version available in the Colab runtime and to document the environment state for reproducibility and debugging.

**Output:**  
The DataTrove package metadata is displayed, including the version number (`0.2.0`), installation location, and dependency information.

In [9]:
import json
import re
import unicodedata

CELL 9 – Import libraries for text normalization

This step imports standard Python libraries required for text normalization, including JSON handling, regular expressions, and Unicode utilities. These libraries are used to implement explicit and transparent text-cleaning logic without relying on external processing frameworks.

**Output:**  
Required standard libraries are successfully loaded into the runtime, preparing the environment for normalization.

In [10]:
def normalize_text(text):
    text = unicodedata.normalize("NFC", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

input_file = DIRS["jsonl_input"] / "documents.jsonl"
norm_file = DIRS["normalized"] / "normalized.jsonl"

with open(input_file) as fin, open(norm_file, "w") as fout:
    for line in fin:
        doc = json.loads(line)
        doc["text"] = normalize_text(doc["text"])
        fout.write(json.dumps(doc, ensure_ascii=False) + "\n")

norm_file

PosixPath('/content/datatrove_experiment/04_normalized/normalized.jsonl')

CELL 10 – Normalize document text and generate normalized JSONL

This step applies text normalization to each document in the JSONL dataset. Unicode normalization (NFC) is applied, excess whitespace is collapsed, and leading/trailing spaces are removed. The normalized text replaces the original text field while preserving all metadata.

**Output:**  
A new JSONL file (`normalized.jsonl`) is created in the `04_normalized` directory, containing normalized versions of all documents.

In [11]:
with open(norm_file) as f:
    sample = json.loads(next(f))
    print(sample["id"])
    print(sample["text"][:600])

Pharma-test01
CENTER FOR DRUG EVALUATION AND RESEARCH A PPLICATION NUMBER: 214787Orig1s000 CLINICAL PHARMACOLOGY REVIEW(S) 1 Date: July 16, 2020 From: Neil Hartman, PhD, Marlene Kim, PhD, Naomi Kruhlak, PhD, Rebecca Racz, PharmD, Division of Applied Regulatory Science/Office of Clinical Pharmacology (DARS/OCP) Through: James Weaver Ph.D., Consult Lead and David Strauss M.D., Ph.D., Director; DARS/OCP To: Neha Gada, Division of Pharmacovigilance II, Office of Surveillance and Epidemiology Subject: In silico Analyses on the Potential Association of Remdesivir with Renal and Hepatic Events (NDA 21487) Executiv


CELL 11 – Validate normalized document output

This step performs a sanity check on the normalized dataset by reading a single document from the normalized JSONL file and printing its identifier and a text preview. This ensures normalization was applied correctly and that document structure is preserved.

**Output:**  
A sample document ID and a snippet of normalized text are displayed in the notebook output.

In [12]:
from langdetect import detect

CELL 12 – Import language detection utility

This step imports the language detection function used to identify the language of document text. This prepares the environment for applying language-based quality filtering in the next stage.

**Output:**  
The language detection library is successfully loaded and ready for use.

In [13]:
def is_english(text):
    try:
        return detect(text) == "en"
    except:
        return False

filtered_file = DIRS["filtered"] / "filtered.jsonl"

with open(norm_file) as fin, open(filtered_file, "w") as fout:
    for line in fin:
        doc = json.loads(line)
        text = doc["text"]

        if len(text) < 300:
            continue

        if not is_english(text):
            continue

        fout.write(json.dumps(doc, ensure_ascii=False) + "\n")

filtered_file

PosixPath('/content/datatrove_experiment/05_filtered/filtered.jsonl')

CELL 13 – Apply quality filtering based on length and language

This step filters the normalized documents using basic quality heuristics. Documents shorter than a minimum length threshold and documents not detected as English are removed. Only documents that pass both checks are retained.

**Output:**  
A filtered JSONL file (`filtered.jsonl`) is created in the `05_filtered` directory, containing only high-quality English documents.

In [14]:
!pip install -q datasketch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/96.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.5/96.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

CELL 14 – Install MinHash-based deduplication dependency

This step installs the `datasketch` library, which provides MinHash and Locality Sensitive Hashing (LSH) implementations. These are required for near-duplicate document detection and removal.

**Output:**  
The `datasketch` library is successfully installed in the Colab environment.

In [15]:
from datasketch import MinHash, MinHashLSH

CELL 15 – Import MinHash and LSH utilities for deduplication

This step imports the MinHash and MinHashLSH classes from the `datasketch` library. These utilities are used to compute similarity fingerprints for documents and identify near-duplicates efficiently.

**Output:**  
MinHash and LSH classes are loaded and available for use in the deduplication stage.

In [16]:
def get_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for token in set(text.split()):
        m.update(token.encode("utf8"))
    return m

lsh = MinHashLSH(threshold=0.85, num_perm=128)
dedup_file = DIRS["deduped"] / "deduped.jsonl"

with open(filtered_file) as fin, open(dedup_file, "w") as fout:
    for i, line in enumerate(fin):
        doc = json.loads(line)
        mh = get_minhash(doc["text"])

        if lsh.query(mh):
            continue  # duplicate

        lsh.insert(str(i), mh)
        fout.write(json.dumps(doc, ensure_ascii=False) + "\n")

dedup_file

PosixPath('/content/datatrove_experiment/06_deduped/deduped.jsonl')

CELL 16 – Perform near-duplicate document removal using MinHash and LSH

This step applies near-duplicate detection to the filtered dataset using MinHash fingerprints and Locality Sensitive Hashing (LSH). Each document’s text is converted into a MinHash signature, which is compared against previously seen documents. Documents that exceed the similarity threshold are treated as duplicates and excluded.

**Output:**  
A deduplicated JSONL file (`deduped.jsonl`) is created in the `06_deduped` directory, containing only unique or sufficiently distinct documents.

In [17]:
final_path = DIRS["final"] / "final_corpus.jsonl"

with open(dedup_file) as fin, open(final_path, "w") as fout:
    for line in fin:
        fout.write(line)

final_path

PosixPath('/content/datatrove_experiment/07_final_corpus/final_corpus.jsonl')

CELL 17 – Assemble and freeze the final corpus snapshot

This step creates the final, immutable corpus snapshot by copying all deduplicated documents into a single output file. No further processing is applied at this stage, ensuring that the resulting dataset represents the final training-ready corpus.

**Output:**  
The final corpus file (`final_corpus.jsonl`) is created in the `07_final_corpus` directory. This file contains the cleaned, normalized, filtered, and deduplicated documents and is ready for downstream use such as domain-adaptive pretraining.