<a href="https://colab.research.google.com/github/Zen-Teng/Stylos/blob/main/cleaner_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Read a PDF and print raw text to the console.

In [11]:
!pip install pdfplumber
!pip install pandas openpyxl



In [2]:
import pdfplumber

def read_pdf(file_path):
    text = ""

    with pdfplumber.open(file_path) as pdf:
        for page_number, page in enumerate(pdf.pages, start=1):
            page_text = page.extract_text()
            if page_text:
                text += f"\n--- Page {page_number} ---\n"
                text += page_text

    return text


In [9]:
from google.colab import files
uploaded = files.upload()

pdf_path = next(iter(uploaded))  # gets the uploaded filename
pdf_text = read_pdf(pdf_path)

print(pdf_text)

#For multiple upload
# for pdf_path in uploaded.keys():
#     print(f"Reading {pdf_path}")
#     pdf_text = read_pdf(pdf_path)
#     print(pdf_text)

Saving sample2.pdf to sample2.pdf

--- Page 1 ---
PETER SINGER Famine, Affluence, and
Morality
As I write this, in November Ig7I, people are dying in East Bengal
from lack of food, shelter, and medical care. The suffering and death
that are occurring there now are not inevitable, not unavoidable in
any fatalistic sense of the term. Constant poverty, a cyclone, and a
civil war have turned at least nine million people into destitute refu-
gees; nevertheless, it is not beyond the capacity of the richer nations
to give enough assistance to reduce any further suffering to very small
proportions. The decisions and actions of human beings can prevent
this kind of suffering. Unfortunately, human beings have not made
the necessary decisions. At the individual level, people have, with very
few exceptions, not responded to the situation in any significant way.
Generally speaking, people have not given large sums to relief funds;
they have not written to their parliamentary representatives demand-

Split text into sentences and randomly return 10 sentences

In [10]:
import random
import nltk

# 1. Download the punctuation dataset (only need to run this once)
nltk.download('punkt_tab')

def get_random_sentences(text, sample_size=10):
    # 2. Use the NLTK tokenizer to split text intelligently
    # It handles "Mr. Smith" and "U.S.A." correctly
    sentences = nltk.sent_tokenize(text)

    # 3. Handle edge case: What if the PDF has fewer than 10 sentences?
    if len(sentences) < sample_size:
        return sentences # Return everything if text is short

    # 4. Randomly select 10 sentences
    return random.sample(sentences, sample_size)

# --- Usage Example ---

# Run the function
sample = get_random_sentences(pdf_text, sample_size=10)
print(sample)

['Unfortunately, human beings have not made\nthe necessary decisions.', 'I shall, however, try to argue for the moral position\ni.', 'The situation is no longer that described above, but this does not affect\nmy argument, as the next paragraph indicates.', 'At the government level, no government has given the sort of massive\naid that would enable the refugees to survive for more than a few days.', "Australia's aid, however, amounts to less than one-\ntwelfth of the cost of Sydney's new opera house.", 'Most of the refugees have now been in the camps for more than six\nmonths.', 'Constant poverty, a cyclone, and a\ncivil war have turned at least nine million people into destitute refu-\ngees; nevertheless, it is not beyond the capacity of the richer nations\nto give enough assistance to reduce any further suffering to very small\nproportions.', 'In arguing for this conclusion I will not, of course, claim to be\nmorally neutral.', 'The Bengal emergency is just the latest and\nmost acute 

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Split that text into paragraphs. Save to data.json.

In [4]:
import re

#Method 1: Blank-line based (high confidence)
def split_by_blank_lines(text):
    paragraphs = re.split(r'\n\s*\n', text)
    return [p.strip() for p in paragraphs if p.strip()]


#Method 2: Wrapped-line reconstruction (most common)
def split_wrapped_lines(text):
    paragraphs = []
    current = []

    for line in text.splitlines():
        line = line.strip()

        if not line:
            if current:
                paragraphs.append(" ".join(current))
                current = []
        else:
            current.append(line)

    if current:
        paragraphs.append(" ".join(current))

    return paragraphs


#Method 3: Heuristic sentence-based (fallback)
def split_by_heuristic(text):
    text = re.sub(r'\n+', ' ', text)  # remove line breaks
    paragraphs = re.split(
        r'(?<=[.!?])\s+(?=[A-Z])',
        text
    )
    return [p.strip() for p in paragraphs]


In [5]:
#Step 1: Analyse the text
def analyze_text(text):
    lines = text.splitlines()

    empty_lines = sum(1 for l in lines if not l.strip())
    avg_line_length = sum(len(l) for l in lines) / max(len(lines), 1)

    return {
        "empty_lines": empty_lines,
        "avg_line_length": avg_line_length
    }


#Step 2: From the result in Step 1, determine which methods to use for paragraph identification
def split_paragraphs(text):
    stats = analyze_text(text)

    if stats["empty_lines"] > 2:
        print("Method used: split_by_blank_lines")
        return split_by_blank_lines(text)

    # if stats["avg_line_length"] < 80:
    #     print("Method used: split_wrapped_lines")
    #     return split_wrapped_lines(text)

    print("Method used: split_by_heuristic")
    return split_by_heuristic(text)


#Step 3: Sanity check: A threshold to see if the splitted paragraphs are of conventional length
def is_reasonable(paragraphs):
    if not paragraphs:
        return False
    avg_len = sum(len(p) for p in paragraphs) / len(paragraphs)
    return 50 < avg_len < 2000


In [6]:
text_stats = analyze_text(pdf_text)
print(text_stats)

paragraphs = split_paragraphs(pdf_text)
print(len(paragraphs))

is_reasonable(paragraphs)

print(paragraphs)

{'empty_lines': 1, 'avg_line_length': 77.84782608695652}
Method used: split_by_heuristic
43


todo:
- how to preserve table structure
- how to split into paragraphs
- research: how to extract wiritng style

In [12]:
import pandas as pd

# Basic reading of the first sheet
uploaded_xlsx = files.upload()

pdf_path = next(iter(uploaded_xlsx))
df = pd.read_excel(pdf_path)

# If need a specific sheet, use sheet_name
# df = pd.read_excel('filename.xlsx', sheet_name='Sheet1')

print(df.head())

Saving QnA pairs.xlsx to QnA pairs.xlsx
         Type                                              Input  \
0  Expository  Reporting loved ones for dangerous behavior in...   
1  Expository  Focusing on the future boosts the intent to re...   
2  Expository  Incidents of mass violence and extremism in th...   
3   Narrative  An acquaintance of the Las Vegas shooter had s...   
4  Expository  Gatekeepers experience conflict between their ...   

                                              Output  Sample  
0  Reporting a close-other to authorities can sav...     1.0  
1  Notably, a future time perspective increased r...     1.0  
2  Mass shootings and violent extremism in the Un...     1.0  
3  A friend of Steven Paddock, the perpetrator of...     1.0  
4  Gatekeepers’ dilemmas reflect the interplay be...     1.0  


In [14]:
import json

# 1. Initialize the empty list
qa_dataset = []

# 2. Loop through the DataFrame
# index is the row number, row is the data for that row
for index, row in df.iterrows():
    if pd.notna(row["Type"]):
    # Create the dictionary for this specific row
        entry = {
            "Type": row["Type"],      # Put Type first
            "Input": row["Input"],
            "Output": row["Output"]
        }

        # Add it to your main list
        qa_dataset.append(entry)

# 4. Verify the result
print(qa_dataset)

# 2. Write this data to a file called 'dataset.json'
# 'w' means write mode
# 'indent=4' makes it readable (pretty-printed) instead of one long line
with open("dataset.json", "w", encoding="utf-8") as f:
    json.dump(qa_dataset, f, indent=4)

print("JSON file created successfully!")

JSON file created successfully!
