PURPOSE: this is code to test my ability to extract references from one pdf and categorize them as academic articles or grey literature for the purpose of extracting sources for content analysis. This will then be scaled up and used to extract relevant references from a folder of pdfs that make up our own references for our literature review. 

In [2]:
!pip install pdfminer.six pandas




In [7]:
from pdfminer.high_level import extract_text
import os
import re
import pandas as pd

pdf_folder = "/Users/ellievance/Documents/WPI/content_analysis_test/pdfs"
#debugging
test_pdf = os.path.join(pdf_folder, sorted([f for f in os.listdir(pdf_folder) if f.lower().endswith(".pdf")])[0])
print("Testing on:", test_pdf)

Testing on: /Users/ellievance/Documents/WPI/content_analysis_test/pdfs/paper1.pdf


In [8]:
#EXTRACT TEXT FROM PDFS
def extract_pdf_text(pdf_path):
    return extract_text(pdf_path)

text = extract_pdf_text(test_pdf)

print("Total characters extracted:", len(text))
print("\n--- FIRST 1200 chars ---\n")
print(text[:1200])


Total characters extracted: 49614

--- FIRST 1200 chars ---

Janatha Aragalaya: 
The People’s Struggle in Sri Lanka 

A.R.M. Imtiyaz 
Delaware Valley University, USA 

Introduction 

On 9 July 2022, tens of thousands of protesters in Sri Lanka’s commercial 
capital Colombo stormed and occupied the Presidential office and official 
residences of both President and Prime Minister. Social media and global 
media  carried  ‘videos  of  protesters  swimming  in  the  president’s  pool, 
resting on his bed, using his gym, and fixing meals in his kitchen—after 
overcoming barricades, tear gas, and beatings’ (DeVotta, 2022a). Studies 
suggest that social unrests or protected protests occur with the presence 
______ 

 
of  solid  socio-economic  vulnerabilities  (Arrow  1951,  Acemoglu  et  al., 
2015, Barro, 2005). Thus, Sri Lanka did not surprise many observers.  
  Major  protests  have  occurred  around  the  world,  including  in  the 
Middle East, with increasing frequency since the sec

In [9]:
#DEBUG REFERENCES IN PDF
# Show every occurrence of the substring "references" (case-insensitive)
matches = list(re.finditer(r"references", text, flags=re.IGNORECASE))

print("Number of 'references' matches:", len(matches))

# Print context around the last few matches (often the real section header is near the end)
for m in matches[-5:]:
    start = max(m.start() - 80, 0)
    end = min(m.end() + 120, len(text))
    print("\n--- CONTEXT ---")
    print(text[start:end].replace("\n", "\\n"))


Number of 'references' matches: 1

--- CONTEXT ---
ese and the \nisland of Sri Lanka, http://lakdiva.org/mahavamsa/chap025.html. \n\nReferences  \n\nAcemoglu, D., Naidu, S., Restrepo, P. & Robinson, J. (2015). Democracy, \nredistribution, and inequality. In A. B. At


In [10]:
#EXTRACT TEXT FROM PDFS
def extract_references_section(text):
    """
    Robustly find the start of the references section and return the text after it.
    """
    # Normalize: convert form feeds to newlines, collapse some whitespace
    t = text.replace("\x0c", "\n")
    
    # Candidate headers (add more if needed)
    headers = [
        r"references",
        r"bibliography",
        r"works cited",
        r"references and notes",
        r"notes",
        r"sources",
    ]
    
    # We look for headers that appear as stand-alone-ish lines.
    # This pattern tries to match a header near line boundaries, allowing punctuation.
    header_pattern = r"(?im)^\s*({})\s*[:]*\s*$".format("|".join(headers))
    
    header_matches = list(re.finditer(header_pattern, t))
    
    if not header_matches:
        # Fallback: match header words anywhere, but prefer later occurrences
        fallback_pattern = r"(?im)\b({})\b".format("|".join(headers))
        header_matches = list(re.finditer(fallback_pattern, t))
        if not header_matches:
            return None
    
    # Choose the LAST match (most likely the actual reference section header)
    m = header_matches[-1]
    return t[m.end():]

ref_section = extract_references_section(text)

if ref_section:
    print("References section found. Characters:", len(ref_section))
    print("\n--- FIRST 1200 chars of references section ---\n")
    print(ref_section[:1200])
else:
    print("No references section found.")



References section found. Characters: 7974

--- FIRST 1200 chars of references section ---


Acemoglu, D., Naidu, S., Restrepo, P. & Robinson, J. (2015). Democracy, 
redistribution, and inequality. In A. B. Atkinson & F. Bourguignon (eds.), 
Handbook  of  Income  Distribution  Vol  2  (pp.  1885-1966).  Amsterdam: 
Elsevier. 

protestors  must 

Amnesty  International  (2022)  Sri  Lanka:  Shameful,  brutal  assault  on 
peaceful 
at 
https://www.amnesty.org/en/latest/news/2022/07/sri-lanka-
shameful-brutal-assault-on-peaceful-protestors-must-immediately-
stop/. Accessed 30 July 2022. 

stop.  Available 

immediately 

Al-Jazeera,  May  (2022)  Sri  Lanka  PM  Mahinda  Rajapaksa  resigns  as 
crisis  worsens.  Available 
https://www.aljazeera.com/news/ 
at 
2022/5/9/sri-lanka-pm-mahinda-rajapaksa-offers-to-resign-as-crisis-
worsens. (Accessed 26 July 2022) 

Arrow, K. (1951). Social Choice and Individual Values, New York: John 
Wiley & Sons. 

Bartholomeusz, T. J. & de Silva, C. R. (19

In [11]:
#STOP REFERENCE SECTION BEFORE AUTHOR BIO
def trim_trailing_non_references(ref_text):
    """
    Cut off common post-reference sections like 'About the author', 'Acknowledgements', etc.
    """
    end_markers = [
        r"about the author",
        r"author biography",
        r"biography",
        r"acknowledg(e)?ments",
        r"appendix",
        r"annex",
    ]
    pattern = r"(?im)^\s*({})\b.*$".format("|".join(end_markers))
    m = re.search(pattern, ref_text)
    if m:
        return ref_text[:m.start()]
    return ref_text

ref_section_trimmed = trim_trailing_non_references(ref_section) if ref_section else None

if ref_section_trimmed:
    print("Trimmed references section length:", len(ref_section_trimmed))



Trimmed references section length: 7226


In [12]:
#SPLIT REFERENCES INTO INDIVIDUAL CITATIONS

def split_references(ref_text):
    """
    Split references into chunks, then merge URL-only / continuation fragments into the previous chunk.
    """
    # Normalize whitespace
    t = ref_text.replace("\x0c", "\n")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{2,}", "\n", t).strip()

    # Split on likely starts of new references:
    # - "Lastname, I." patterns
    # - OR Organization names (capitalized words) + year soon after
    # - OR lines that start with a capital word and contain a year in parentheses later
    start_pat = r"(?m)^\s*(?=(?:[A-Z][A-Za-z\-']+,\s*[A-Z])|(?:[A-Z][A-Za-z&\-\s]{2,}\(\d{4}[a-z]?\))|(?:[A-Z][A-Za-z&\-\s]{2,}\s+\(\d{4}[a-z]?\)))"
    chunks = re.split(start_pat, t)
    chunks = [c.strip() for c in chunks if c.strip()]

    # Merge continuation chunks (URL-only or “Accessed …” fragments)
    merged = []
    for c in chunks:
        c_one_line = re.sub(r"\s+", " ", c).strip()
        is_urlish = bool(re.fullmatch(r"(https?://\S+|www\.\S+)", c_one_line))
        is_accessed = c_one_line.lower().startswith(("accessed", "available at", "retrieved", "at http", "at https"))
        
        if merged and (is_urlish or is_accessed):
            merged[-1] = merged[-1] + " " + c_one_line
        else:
            merged.append(c_one_line)

    # Filter out very short noise
    merged = [m for m in merged if len(m) > 40]
    return merged

refs = split_references(ref_section_trimmed) if ref_section_trimmed else []
print("Number of split references:", len(refs))
print("\nFirst 3 references:\n")
for r in refs[:3]:
    print("-", r[:220], "...\n")


Number of split references: 32

First 3 references:

- Acemoglu, D., Naidu, S., Restrepo, P. & Robinson, J. (2015). Democracy, redistribution, and inequality. In A. B. Atkinson & F. Bourguignon (eds.), Handbook of Income Distribution Vol 2 (pp. 1885-1966). Amsterdam: Elsevie ...

- Amnesty International (2022) Sri Lanka: Shameful, brutal assault on peaceful at https://www.amnesty.org/en/latest/news/2022/07/sri-lanka- shameful-brutal-assault-on-peaceful-protestors-must-immediately- stop/. Accessed 3 ...

- Al-Jazeera, May (2022) Sri Lanka PM Mahinda Rajapaksa resigns as crisis worsens. Available https://www.aljazeera.com/news/ at 2022/5/9/sri-lanka-pm-mahinda-rajapaksa-offers-to-resign-as-crisis- worsens. (Accessed 26 July ...



In [59]:
#EXTRACT TITLE AND URL
def extract_first_url(s):
    m = re.search(r"(https?://\S+|www\.\S+)", s)
    if m:
        url = m.group(1).rstrip(").,;")
        # normalize www. → https://www.
        if url.startswith("www."):
            url = "https://" + url
        return url
    return ""

def clean_reference_for_title(s):
    # Remove URL and common trailing access text
    s2 = re.sub(r"(https?://\S+|www\.\S+)", "", s)
    s2 = re.sub(r"(?i)\bavailable at\b.*$", "", s2)
    s2 = re.sub(r"(?i)\baccessed\b.*$", "", s2)
    s2 = re.sub(r"\s+", " ", s2).strip(" .;:-")
    return s2

def title_guess_from_reference(s):
    """
    Heuristic: for many refs, the title is after the year and before the next period.
    If that fails, return the cleaned reference.
    """
    s_clean = clean_reference_for_title(s)

    # Try: (YEAR). Title.
    m = re.search(r"\(\d{4}[a-z]?\)\.\s*([^\.]{8,200})\.", s_clean)
    if m:
        return m.group(1).strip()

    # Try: YEAR) Title.
    m = re.search(r"\(\d{4}[a-z]?\)\s*([^\.]{8,200})\.", s_clean)
    if m:
        return m.group(1).strip()

    return s_clean[:220]  # fallback

# Quick test on first 8
for r in refs[:8]:
    print("\nREF:", r[:180], "...")
    print("URL:", extract_first_url(r))
    print("TITLE_GUESS:", title_guess_from_reference(r))

        



REF: Acemoglu, D., Naidu, S., Restrepo, P. & Robinson, J. (2015). Democracy, redistribution, and inequality. In A. B. Atkinson & F. Bourguignon (eds.), Handbook of Income Distribution V ...
URL: 
TITLE_GUESS: Democracy, redistribution, and inequality

REF: Amnesty International (2022) Sri Lanka: Shameful, brutal assault on peaceful at https://www.amnesty.org/en/latest/news/2022/07/sri-lanka- shameful-brutal-assault-on-peaceful-protes ...
URL: https://www.amnesty.org/en/latest/news/2022/07/sri-lanka-
TITLE_GUESS: Amnesty International (2022) Sri Lanka: Shameful, brutal assault on peaceful at shameful-brutal-assault-on-peaceful-protestors-must-immediately- stop/

REF: Al-Jazeera, May (2022) Sri Lanka PM Mahinda Rajapaksa resigns as crisis worsens. Available https://www.aljazeera.com/news/ at 2022/5/9/sri-lanka-pm-mahinda-rajapaksa-offers-to-resi ...
URL: https://www.aljazeera.com/news/
TITLE_GUESS: Sri Lanka PM Mahinda Rajapaksa resigns as crisis worsens

REF: Arrow, K. (1951). Social C

In [40]:
#RUN PIPELINE ON PDFS
results = []

for filename in os.listdir(pdf_folder):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder, filename)
        
        # Extract full text
        text = extract_pdf_text(pdf_path)
        
        # Extract references section
        ref_section = extract_references_section(text)
        
        if ref_section:
            # Split into individual references
            refs = split_references(ref_section)
            
            # Classify each reference
            for ref in refs:
                classification = classify_reference(ref)
                results.append({
                    "source_pdf": filename,
                    "reference_text": ref,
                    "classification": classification
                })

# -----------------------------
# Step 6: Save Results to CSV
# -----------------------------
df = pd.DataFrame(results)
df.to_csv("extracted_references.csv", index=False)
df.head()  # preview first 5 rows

Unnamed: 0,source_pdf,reference_text,classification
0,paper1.pdf,"Acemoglu, D., Naidu, S., Restrepo, P. & Robins...",academic
1,paper1.pdf,Handbook of Income Distribution Vol 2 (p...,academic
2,paper1.pdf,Amnesty International (2022) Sri Lanka: S...,grey literature
3,paper1.pdf,https://www.amnesty.org/en/latest/news/2022/07...,grey literature
4,paper1.pdf,https://www.aljazeera.com/news/ at 2022/5/9/...,grey literature


In [13]:
#CLASSIFY ACADEMIC VS GREY LIT
def classify_academic_vs_grey(ref):
    s = ref.lower()

    academic_cues = [
        "doi", "journal", "vol.", "no.", "pp.",
        "springer", "elsevier", "wiley", "oxford", "cambridge",
        "in ", "(eds", "(ed", "handbook", "university press"
    ]

    has_url = bool(re.search(r"(https?://|www\.)", s))

    # If it has strong academic cues, label academic
    if any(cue in s for cue in academic_cues):
        return "academic"

    # If it has a URL and lacks academic cues, very often grey literature
    if has_url:
        return "grey_literature"

    # Otherwise: default to academic? or grey? depends on your tolerance.
    # For your use case (keep grey lit), better to be inclusive:
    return "grey_literature"


In [61]:
#RUN TEST PDF AND CSV PREVIEW
rows = []
for r in refs:
    rows.append({
        "source_pdf": os.path.basename(test_pdf),
        "reference_text": r,
        "url": extract_first_url(r),
        "title_guess": title_guess_from_reference(r),
        "class": classify_academic_vs_grey(r),
    })

df_test = pd.DataFrame(rows)
print(df_test["class"].value_counts())
df_test.head(10)


class
academic           16
grey_literature    16
Name: count, dtype: int64


Unnamed: 0,source_pdf,reference_text,url,title_guess,class
0,paper1.pdf,"Acemoglu, D., Naidu, S., Restrepo, P. & Robins...",,"Democracy, redistribution, and inequality",academic
1,paper1.pdf,Amnesty International (2022) Sri Lanka: Shamef...,https://www.amnesty.org/en/latest/news/2022/07...,Amnesty International (2022) Sri Lanka: Shamef...,grey_literature
2,paper1.pdf,"Al-Jazeera, May (2022) Sri Lanka PM Mahinda Ra...",https://www.aljazeera.com/news/,Sri Lanka PM Mahinda Rajapaksa resigns as cris...,grey_literature
3,paper1.pdf,"Arrow, K. (1951). Social Choice and Individual...",,"Arrow, K. (1951). Social Choice and Individual...",academic
4,paper1.pdf,"Bartholomeusz, T. J. & de Silva, C. R. (1998)....",,Buddhist Fundamentalism and Minority Identitie...,academic
5,paper1.pdf,"Barro, R. J. (2000). Inequality and growth in ...",,Inequality and growth in a panel of countries,academic
6,paper1.pdf,CIA (2023). The World Fact Book: Sri Lanka. Av...,https://www.cia.gov/the-world-factbook/countri...,CIA (2023). The World Fact Book: Sri Lanka,grey_literature
7,paper1.pdf,"Colombage, Q. (2022). Police hunt for Sri Lank...",https://www.ucanews.com/news/police-hunt-for-sri-,Police hunt for Sri Lankan priest deplored,grey_literature
8,paper1.pdf,Colombo Telegraph (2015). Ranil Tops All Islan...,https://theconnectionsworld.com/what-lies-behind-,Colombo Telegraph (2015). Ranil Tops All Islan...,grey_literature
9,paper1.pdf,"DeVotta, N. (2022a) Sri Lanka’s Road to Ruin W...",,"DeVotta, N. (2022a) Sri Lanka’s Road to Ruin W...",academic


In [14]:
#FINAL PDF EXTRACT INTO CSV
def process_one_pdf(pdf_path):
    text = extract_pdf_text(pdf_path)
    ref = extract_references_section(text)
    if not ref:
        return []
    ref = trim_trailing_non_references(ref)
    refs = split_references(ref)

    rows = []
    for r in refs:
        rows.append({
            "source_pdf": os.path.basename(pdf_path),
            "reference_text": r,
            "url": extract_first_url(r),
            "title_guess": title_guess_from_reference(r),
            "class": classify_academic_vs_grey(r),
        })
    return rows

all_rows = []
for filename in os.listdir(pdf_folder):
    if filename.lower().endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder, filename)
        all_rows.extend(process_one_pdf(pdf_path))

df_all = pd.DataFrame(all_rows)
df_all.to_csv("extracted_references_with_urls.csv", index=False)

print("Saved:", "extracted_references_with_urls.csv")
print("Total rows:", len(df_all))
df_all.head()


NameError: name 'extract_first_url' is not defined