## Setup
Being a Jupyter Notebook outside the main `./src` directory we need to configure a `PATH`s

In [1]:
%pip install --upgrade pip

%pip install -q -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import sys
from pathlib import Path

# This feels like a hack, but it is necessary to import the module from the parent directory
# when running this notebook in Jupyter Lab or Jupyter Notebook.
# The __file__ variable is not defined in Jupyter notebooks, so we need to handle that case.

# Get the root path relative to the notebook
root_path = (
    Path(__file__).resolve().parent.parent
    if "__file__" in globals()
    else Path().resolve().parent
)

# Add it to sys.path if not already added
if str(root_path) not in sys.path:
    sys.path.append(str(root_path))

In [3]:
# print("Current working directory:", Path().resolve())
# print("Python search paths:")
# for p in sys.path:
#     print("-", p)

## Holy Code Batman
This is where we will do the following
- Loop through the data source
- Read and extract text
    - This is straight forward, but it is obvious that some things are lost. E.g. any tables, context like headers.
    - Old PDF's are just scans of typed documents. So how reliable is this? 
- Clean the text
    - This is basic and written to be expanded on. 
    - How do we know what to clean or remove?
- Create metadata
    - Some basic stuff
- Append an `<output>.txt` and an `<output>.jsonl` file
    - This is fine and works as expected.
    - Saving to a file is fine for local development 
- Log steps, issues, etc?
    - **Todo**
        - Better debugging


In [None]:
from pathlib import Path
from src.config import settings

In [None]:
from dataclasses import dataclass ## dataclass is a Python decorator that automatically generates special methods for classes
from src.lib.text_file import TextFile ## Importing the TextFile class from the text_file module
from src.lib.append_jsonl_entry import append_jsonl_entry
from src.lib.append_to_file import AppendToTextFile
from src.lib.create_meta_data import CreateMetaData
from src.lib.document_loader import DocumentLoader
from src.lib.txt_handler import TxtHandler
from src.lib.pdf_handler import PDFHandler
from src.lib.clean_text import clean_text

import datetime

In [None]:
loader = DocumentLoader()
loader.register_handler(TxtHandler())
loader.register_handler(PDFHandler())

count_docs = 0

for file in settings.DATA_DIR.rglob("*"):  ## Is this recursive? Yes, rglob is recursive
    if not file.is_file():
        continue
    try:
        count_docs += 1
        print(count_docs)
        content = loader.extract_text(file)
        cleaned = clean_text(content)

        # Append cleaned text to output file
        stat = file.stat()
        doc = TextFile(
            date=str(datetime.datetime.fromtimestamp(stat.st_mtime)),
            source=str(file.resolve()),
            filename=file.name,
            suffix=file.suffix,
            size=stat.st_size,
            content=cleaned,
        )
        # Append the cleaned text to the processed .txt file
        meta_data = CreateMetaData(doc, separator=True)
        AppendToTextFile(settings.PROCESSED_FILE, meta_data + cleaned)

        # Append the cleaned text to the JSONL file
        meta_data_no_separator = CreateMetaData(doc, separator=False)
        append_jsonl_entry(doc, meta_data_no_separator + cleaned, settings.JSONL_FILE)

        print(f"Processed {file.name}")
    except Exception as e:
        print(f"⚠️ Error with {file.name}: {e}")

1
Processed citation-374847716.txt
2
Processed 19760018356.pdf
3


incorrect startxref pointer(1)
parsing for Object Streams


Processed P10.pdf
4
⚠️ Error with companies_house_document.pdf: Unable to get page count. Is poppler installed and in PATH?
5
Processed Dockerfile.txt
6
Processed LHH Values Audit Exercise.pdf
7
Processed Pandas cheat sheet.pdf
8
Processed 19770020304.pdf
9
⚠️ Error with .DS_Store: No handler for file type: 
10
Processed 19770014315.pdf
11
Processed 19770014316.pdf
12
Processed Simple_Sabotage_Field_Manual.txt
13
Processed Stories_and_Folk-lore _of_West_Cornwall.txt
14
Processed 19750024188.pdf
15
Processed See Important Preparation Requirements.pdf
16
Processed AI with Your Own Data- SCALE 21x presentation.pdf
17
Processed 19750004090.pdf
18
Processed 19790005042.pdf
19
Processed 19780019349.pdf
20
Processed 19780007361.pdf
21
Processed 19780017314.pdf
22
Processed mail-attachment.googleusercontent.com.pdf
23
Processed 19770006294.pdf
