Learning script for logistic regression classifier for document inclusion criteria
1/15/26

"""
SIMPLE LEARNING VERSION (Python)
TF-IDF + Logistic Regression screening for LexisNexis .docx exports

Expected files:
- input/nexis_export_translated.docx
- input/labels.csv     columns: article_id,label  where label is include/exclude/blank

Outputs:
- output/screened_ranked_all_articles.csv
"""

OVERVIEW + FOLDER EXPECTATIONS FOR END TO END FORMATTING
This notebook expects folder structure to look like:

fnp_project_directory/
    fnp_vscode/  (this notebook lives here)
    input/
        nexis_export_translated.docx
        input_labels.csv
    output/

This notebook will:
1. read the docx
2. split into articles using the line "LexisNexis" as an article start marker
3. read your labels in the csv
4. train a TF-IDF + logistic regression classifier on labeled rows
5. score all articles and write output to folder output/

In [3]:
#Impport libraries for paths, regex, and data. Imports python-docx to read docx files, scikit-learn to build classifier.
#If error returned that package isn't found, run pip install python-docx scikit-learn pandas numpy in terminal. However, shouldn't happen because they are all installed on venv.

from __future__ import annotations

import re
from pathlib import Path

import numpy as np 
import pandas as pd

from docx import Document
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [4]:
#Path safe setup for VS Code structure
#Sets current project root (FNP_project_directory) as project directory 
#Constructs paths to input files and creates output folder if missing, prints paths so you can see what it's using, and stops early with errors if files are missing
#If you need to hardcode (not running in VS Code), can run: PROJECT_DIR = Path("/Users/ellievance/Documents/fnp_project_directory")

VSCODE_DIR = Path.cwd() #gets current wd as path object

#Check to see if running from fnp_vscode folder
if VSCODE_DIR.name == "fnp_vscode": # if yes, then PROJECT_DIR set to parent directory (fnp_project_directory)
    PROJECT_DIR = VSCODE_DIR.parent
else: 
    PROJECT_DIR = VSCODE_DIR #if no, assumes already in the project root, so PROJECT_DIR stays current directory
#this allows you to run the script from within fnp_vscode folder or from the parent directory 

#Construct paths to input and output subdirectories within project direcotry
INPUT_DIR = PROJECT_DIR / "input" 
OUTPUT_DIR = PROJECT_DIR / "output"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True) #creates output folder if it doesn't exist 

#Store input file paths so you can reference them later without repeating the entire path structure
IN_DOCX = INPUT_DIR / "nexis_export_translated.docx" #creates path object to Lexis document file
IN_LABELS = INPUT_DIR / "input_labels.csv" #path to csv with in/exclusion labels

print("Working directory:", VSCODE_DIR)
print("Project directory:", PROJECT_DIR)
print("Input directory:", INPUT_DIR)
print("Output directory:", OUTPUT_DIR)
print("Docx path:", IN_DOCX)
print("Labels path:", IN_LABELS)

#Double check that files exist
if not IN_DOCX.exists():
    raise FileNotFoundError(f"Docx not found: {IN_DOCX}")
if not IN_LABELS.exists():
    raise FileNotFoundError(f"Labels CSV not found: {IN_LABELS}")


Working directory: /Users/ellievance/Documents/WPI/fnp_project_directory/fnp_vscode
Project directory: /Users/ellievance/Documents/WPI/fnp_project_directory
Input directory: /Users/ellievance/Documents/WPI/fnp_project_directory/input
Output directory: /Users/ellievance/Documents/WPI/fnp_project_directory/output
Docx path: /Users/ellievance/Documents/WPI/fnp_project_directory/input/nexis_export_translated.docx
Labels path: /Users/ellievance/Documents/WPI/fnp_project_directory/input/input_labels.csv


In [5]:
#READ THE DOCX AND CONVERT INTO PLAIN TEXT FOR CLASSIFICATION

def read_docx_as_text(docx_path: Path) -> str:
   
# Reads a .docx file and returns all paragraph text in single string with  new lines
    
    doc = Document(str(docx_path))
    paras = [p.text for p in doc.paragraphs if p.text is not None]
    return "\n".join(paras)

raw_text = read_docx_as_text(IN_DOCX)

#Print total character count and article preview
print("Characters read from docx:", len(raw_text))
print("Preview:", raw_text[:300])


Characters read from docx: 158624
Preview: 
User Name: =
Date and Time: = 2026-01-12
Job Number: = 272569488

Documents (39)
Client/Matter: -None-
Search Terms:
Search Type: nat

1. Mercosur, that necessary yes from Italy The ideas corner The comments

2. Vivaldini and the protection of Made in Brescia products: "Mercosur should avoid low-qu


In [6]:
#Splitting document containing all articles into separate articles, using the "end of document" text as a separator 
def split_into_articles_by_end_marker(text: str, min_chars: int = 500) -> pd.DataFrame:
    """
    Split LexisNexis-export text into articles by using 'End of Document' as the separator.
    Filters out front-matter chunks by requiring 'Load Date:' (a strong sign it's an actual article).
    """

   #Remove footers
    text = re.sub(r"(?m)^\s*Page\s*\d+\s*(?:of|of\s*)\s*\d+\s*$\n?", "", text)

    # Split on lines that are exactly "End of Document"
    # Using MULTILINE so ^ and $ work per-line.
    chunks = re.split(r"(?m)^\s*End of Document\s*$", text)

    cleaned_chunks = []
    for ch in chunks:
        ch = ch.strip()
        if not ch:
            continue

        #Fix whitespace so everything is single spaces
        ch = re.sub(r"\s+", " ", ch).strip()

        # To keep the table of contents from being processed, filtering out chunks of text that don't contain "Load date", since all articles have this
        if "Load Date:" not in ch:
            continue

        if len(ch) >= min_chars:
            cleaned_chunks.append(ch)

    return pd.DataFrame(
        {"article_id": np.arange(1, len(cleaned_chunks) + 1), "text": cleaned_chunks}
    )

articles = split_into_articles_by_end_marker(raw_text, min_chars=500)

print("Articles found:", len(articles))
print("\nFirst article preview:\n", articles.loc[0, "text"][:450])



Articles found: 39

First article preview:
 User Name: = Date and Time: = 2026-01-12 Job Number: = 272569488 Documents (39) Client/Matter: -None- Search Terms: Search Type: nat 1. Mercosur, that necessary yes from Italy The ideas corner The comments 2. Vivaldini and the protection of Made in Brescia products: "Mercosur should avoid low-quality imports." 3. the fear of joining 4. the fear of joining 5. Tractors return to the heart of the EU 6. The EU revises its agricultural policy. Fitto: 


In [9]:
#SDEBUGGING SPLITTING SO IT DOESN'T CHUNK TOC WITH FIRST ARTICLE
def split_into_articles_by_end_marker(text: str, min_chars: int = 500) -> pd.DataFrame:
    """
    Split Nexis export text into articles using 'End of Document' as the separator.
    Removes the 'Documents (N)' table-of-contents block if present.
    Filters out front matter by requiring 'Load Date:'.
    """

    # 0) Remove page header/footer lines like "Page 63 of 64"
    text = re.sub(r"(?m)^\s*Page\s*\d+\s*(?:of)?\s*\d*\s*$\n?", "", text)

    # 1) Remove the Nexis "Documents (N)" table of contents block (front matter)
    # This block starts at "Documents (39)" and contains numbered lines like "1. ...", "2. ...", etc.
    # We remove from "Documents (N)" through the end of the numbered list.
    text = re.sub(
        r"(?ms)^\s*Documents\s*\(\s*\d+\s*\)\s*.*?(?=^\s*[A-Z].+\n|\Z)",
        "",
        text
    )

    # If the above is too aggressive/too weak for a future export, we can tighten it.
    # For your current file, it removes the TOC list shown in the preview CSV.

    # 2) Split on lines that are exactly "End of Document"
    chunks = re.split(r"(?m)^\s*End of Document\s*$", text)

    cleaned_chunks = []
    for ch in chunks:
        ch = ch.strip()
        if not ch:
            continue

        # Keep only chunks that look like real articles
        if "Load Date:" not in ch:
            continue

        # Normalize whitespace after filtering
        ch = re.sub(r"\s+", " ", ch).strip()

        if len(ch) >= min_chars:
            cleaned_chunks.append(ch)

    return pd.DataFrame(
        {"article_id": np.arange(1, len(cleaned_chunks) + 1), "text": cleaned_chunks}
    )


In [10]:
#SANITY CHECK FOR DEBUGGING SPLITTING
print("Articles found:", len(articles))

# The TOC content should not appear inside any chunk now
toc_hits = articles["text"].str.contains(r"Documents\s*\(\s*\d+\s*\)", regex=True).sum()
print("Chunks containing 'Documents (N)' (should be 0):", toc_hits)

print("First chunk preview:\n", articles.loc[0, "text"][:300])


Articles found: 39
Chunks containing 'Documents (N)' (should be 0): 1
First chunk preview:
 User Name: = Date and Time: = 2026-01-12 Job Number: = 272569488 Documents (39) Client/Matter: -None- Search Terms: Search Type: nat 1. Mercosur, that necessary yes from Italy The ideas corner The comments 2. Vivaldini and the protection of Made in Brescia products: "Mercosur should avoid low-qualit


In [8]:
# Optional: create an index csv to double check that chunks correspond to correct article IDs
index = articles.copy()
index["preview"] = index["text"].str[:250] #preview column
index_path = OUTPUT_DIR / "article_index_preview.csv" #file path in output folder
index.to_csv(index_path, index=False) #df written to csv

print("Wrote article index preview to:", index_path) #double check where file was written
index.head(5) #preview new file



Wrote article index preview to: /Users/ellievance/Documents/WPI/fnp_project_directory/output/article_index_preview.csv


Unnamed: 0,article_id,text,preview
0,1,User Name: = Date and Time: = 2026-01-12 Job N...,User Name: = Date and Time: = 2026-01-12 Job N...
1,2,Vivaldini and the protection of Made in Bresci...,Vivaldini and the protection of Made in Bresci...
2,3,the fear of joining Corriere della Sera (Italy...,the fear of joining Corriere della Sera (Italy...
3,4,the fear of joining Corriere della Sera (Italy...,the fear of joining Corriere della Sera (Italy...
4,5,Tractors return to the heart of the EU Corrier...,Tractors return to the heart of the EU Corrier...
