<a href="https://colab.research.google.com/github/alex-jk/buyers-ID-manual-project/blob/main/extract_text_from_pdfs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.5


In [2]:
import fitz  # PyMuPDF
import os
import re

In [3]:
def process_pdf_original_text_only(pdf_path):
    """
    Opens a PDF and prints ONLY the ORIGINAL extracted text page by page.
    Waits for user input before proceeding.
    Copy this text and paste into Google Docs. Manual cleanup might be needed there.
    """
    if not os.path.exists(pdf_path):
        print(f"Error: PDF file not found at '{pdf_path}'")
        print("Please make sure the file is uploaded and the path is correct.")
        return

    try:
        doc = fitz.open(pdf_path)
        num_pages = len(doc)
        print(f"Opened '{pdf_path}' with {num_pages} pages.")

        for page_num in range(num_pages):
            current_page_number = page_num + 1
            print(f"\n{'='*20} Page {current_page_number} / {num_pages} {'='*20}")

            page = doc.load_page(page_num)
            # Get the basic text extraction, no cleaning applied here
            original_text = page.get_text("text")

            # --- Print Original Text ---
            if original_text.strip():
                print(original_text)
            else:
                print("[No text found on this page]")

            print(f"\n{'='*20} End of Page {current_page_number} {'='*20}\n")

            # Pause and wait for user input
            try:
                input(f">>> Ready for page {current_page_number}. Copy the ORIGINAL text above. "
                      f"Press Enter to continue to page {current_page_number + 1} (or to finish)...")
            except EOFError:
                print("\nInput stream closed, ending processing.")
                break

        doc.close()
        print("\nFinished processing all pages.")

    except Exception as e:
        print(f"\nAn error occurred while processing the PDF: {e}")
        if 'doc' in locals() and doc:
            doc.close() # Try to close the document

**Script to upload file**

In [6]:
from google.colab import files

print("Waiting for file upload...")

# Prompt user to upload files. Returns a dictionary: {filename: content}
uploaded = files.upload()

# Check if the upload was successful (dictionary is not empty)
if not uploaded:
  raise ValueError("⛔️ File upload cancelled or failed. You MUST upload a file to continue.")
else:
  # Get the filename of the first uploaded file
  # (Assumes only one file is uploaded, which is typical for this prompt)
  uploaded_filename = list(uploaded.keys())[0]
  print(f"\n✅ Successfully uploaded: '{uploaded_filename}'")
  print(f"   The file is now available at: '/content/{uploaded_filename}'")
  print("\n---> You can now use this filename in subsequent cells. <---")

Waiting for file upload...


Saving buyers-manual-article-02.txt to buyers-manual-article-02.txt

✅ Successfully uploaded: 'buyers-manual-article-02.txt'
   The file is now available at: '/content/buyers-manual-article-02.txt'

---> You can now use this filename in subsequent cells. <---


In [5]:
# 1. Upload your PDF file.
# 2. Replace 'your_article.pdf' with the exact name of your file.
pdf_filename = 'Stemming the Tide_ Social Norms and Child Sex Trafficking.pdf'  # <--- CHANGE THIS FILENAME
pdf_filepath = os.path.join('/content', pdf_filename)

# Run the interactive processing with the rewrapping function
process_pdf_original_text_only(pdf_filepath)

Opened '/content/Stemming the Tide_ Social Norms and Child Sex Trafficking.pdf' with 59 pages.

Volume 126
Issue 1 
Fall 2021 
Stemming the Tide: Social Norms and Child Sex Trafficking 
Stemming the Tide: Social Norms and Child Sex Trafficking 
Melissa L. Breger 
Albany Law School 
Follow this and additional works at: https://ideas.dickinsonlaw.psu.edu/dlr 
 Part of the Criminal Law Commons, Human Rights Law Commons, International Humanitarian Law 
Commons, International Law Commons, Legal Writing and Research Commons, Public Law and Legal 
Theory Commons, and the Sexuality and the Law Commons 
Recommended Citation 
Recommended Citation 
Melissa L. Breger, Stemming the Tide: Social Norms and Child Sex Trafficking, 126 DICK. L. REV. 135 
(2021). 
Available at: https://ideas.dickinsonlaw.psu.edu/dlr/vol126/iss1/6 
This Article is brought to you for free and open access by the Law Reviews at Dickinson Law IDEAS. It has been 
accepted for inclusion in Dickinson Law Review by an authorized 

**Count tokens in the created txt file**

In [7]:
!pip install -q tiktoken

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
# @title ## Count Tokens in Uploaded Text File

import tiktoken
import os

# 2. Get the filename from the user
#    (Make sure you have uploaded the file in a previous cell)
uploaded_filename = input("Enter the exact filename of your uploaded .txt file (e.g., buyers-manual-article-01.txt): ")

# 3. Construct the full path (Colab uploads usually go to /content/)
file_path = os.path.join('/content/', uploaded_filename)

# 4. Define the tokenizer (cl100k_base is common for GPT-3.5/4 and a good general estimator)
#    For different models, you might use others like 'p50k_base' or 'r50k_base'
try:
    encoding = tiktoken.get_encoding("cl100k_base")
    print("Using tokenizer: cl100k_base")
except Exception as e:
    print(f"Error getting tokenizer: {e}")
    # Fallback or handle error appropriately
    encoding = None

# 5. Read the file and count tokens
# 5. Read the file and count tokens
token_count = 0
file_content = ""
if encoding:
    print(f"Attempting to read '{file_path}'...")
    try:
        # Try reading with UTF-8 first
        print("Trying UTF-8 encoding...")
        with open(file_path, 'r', encoding='utf-8') as f:
            file_content = f.read()
        print("Successfully read with UTF-8.")

    except UnicodeDecodeError:
        print("UTF-8 decoding failed. Trying latin-1 encoding...")
        try:
            # If UTF-8 fails, try latin-1
            with open(file_path, 'r', encoding='latin-1') as f:
                file_content = f.read()
            print("Successfully read with latin-1.")
        except Exception as e_latin:
             print(f"🚨 Also failed to read with latin-1: {e_latin}")
             file_content = "" # Ensure content is empty if reading failed
    except FileNotFoundError:
        print(f"🚨 Error: File not found at '{file_path}'.")
        print("   Please ensure the filename is correct and the file was uploaded.")
        file_content = "" # Ensure content is empty if reading failed
    except Exception as e:
        print(f"🚨 An unexpected error occurred while reading the file: {e}")
        file_content = "" # Ensure content is empty if reading failed

    # Proceed only if file content was successfully read
    if file_content:
        try:
            # Encode the text into tokens
            tokens = encoding.encode(file_content)

            # Count the tokens
            token_count = len(tokens)

            print("-" * 30)
            print(f"✅ File: '{uploaded_filename}'")
            print(f"✅ Estimated Token Count: {token_count}")
            print("-" * 30)
            print("(Note: This is an estimate based on the 'cl100k_base' tokenizer.)")
            if "latin-1" in encoding.name: # Check if fallback was used
                 print("(Warning: File was read using latin-1, original characters might differ slightly)")

        except Exception as e_encode:
            print(f"🚨 An error occurred during tokenization: {e_encode}")

else:
    print("🚨 Tokenizer could not be initialized. Cannot count tokens.")

Enter the exact filename of your uploaded .txt file (e.g., buyers-manual-article-01.txt): buyers-manual-article-02.txt
Using tokenizer: cl100k_base
Attempting to read '/content/buyers-manual-article-02.txt'...
Trying UTF-8 encoding...
UTF-8 decoding failed. Trying latin-1 encoding...
Successfully read with latin-1.
------------------------------
✅ File: 'buyers-manual-article-02.txt'
✅ Estimated Token Count: 18795
------------------------------
(Note: This is an estimate based on the 'cl100k_base' tokenizer.)


Save txt file in another encoding

In [10]:
original_filename = input("Enter the filename of the uploaded file with encoding issues (e.g., buyers-manual-article-02.txt): ")
original_path = os.path.join('/content/', original_filename)

# --- Create the name for the new UTF-8 file ---
base_name, ext = os.path.splitext(original_filename)
if not ext: # Handle cases with no extension
    ext = '.txt' # Assume .txt if none exists
    original_filename += ext
    original_path += ext
new_filename = f"{base_name}_utf8{ext}"
new_path = os.path.join('/content/', new_filename)
# ---

file_content = None
read_encoding = None

print(f"\nAttempting to read: '{original_path}'")

# --- Try reading the original file ---
try:
    print("Trying UTF-8 encoding...")
    with open(original_path, 'r', encoding='utf-8') as f:
        file_content = f.read()
    read_encoding = 'utf-8'
    print("Successfully read using UTF-8 (Original file might have been okay or UTF-8 compatible).")

except UnicodeDecodeError:
    print("UTF-8 reading failed due to decoding error.")
    print("Trying Latin-1 encoding as a fallback...")
    try:
        with open(original_path, 'r', encoding='latin-1') as f:
            file_content = f.read()
        read_encoding = 'latin-1'
        print("Successfully read using Latin-1.")
    except Exception as e_latin:
        print(f"🚨 Failed to read using Latin-1 as well: {e_latin}")

except FileNotFoundError:
    print(f"🚨 Error: File not found at '{original_path}'.")
    print("   Please ensure the filename is correct and the file was previously uploaded.")
except Exception as e_other:
    print(f"🚨 An unexpected error occurred while reading the original file: {e_other}")

# --- Write the new file if content was successfully read ---
if file_content is not None:
    print(f"\nAttempting to write UTF-8 version to: '{new_path}'")
    try:
        with open(new_path, 'w', encoding='utf-8') as f_out:
            f_out.write(file_content)
        print("-" * 30)
        print(f"✅ Successfully created UTF-8 encoded file: '{new_filename}'")
        if read_encoding == 'latin-1':
            print("   (Warning: Original was read using Latin-1, some special characters might look different than intended if the source wasn't truly Latin-1 compatible)")
        print(f"\n==> You can now use '{new_filename}' in your token counting script or other cells. <==")
        print("-" * 30)
    except Exception as e_write:
        print(f"🚨 An error occurred while writing the new UTF-8 file: {e_write}")
else:
    print("\nSkipping write operation because the original file could not be read.")

Enter the filename of the uploaded file with encoding issues (e.g., buyers-manual-article-02.txt): buyers-manual-article-02.txt

Attempting to read: '/content/buyers-manual-article-02.txt'
Trying UTF-8 encoding...
UTF-8 reading failed due to decoding error.
Trying Latin-1 encoding as a fallback...
Successfully read using Latin-1.

Attempting to write UTF-8 version to: '/content/buyers-manual-article-02_utf8.txt'
------------------------------
✅ Successfully created UTF-8 encoded file: 'buyers-manual-article-02_utf8.txt'

==> You can now use 'buyers-manual-article-02_utf8.txt' in your token counting script or other cells. <==
------------------------------
