# Liquidity Preference Classification & Extraction Pipeline

**Purpose:**  
This notebook trains a classifier to identify which documents mention “liquidation preference,” then extracts and tags the specific sentences containing key properties (Company Name, Date, Document Type, Preferred Stocks, Priority Order, Liquidation Value).

**Inputs:**  
- Labeled CSVs:  
  - `URAP VC Research - [Readable] Batch 1 Main.csv` (has a binary label “Contains Liquidity Preference”)  
  - `URAP VC Research - Batch 1 Details.csv` (additional metadata)  
- Plain-text `.txt` files converted from PDFs in `Batch1_text_readable`.

**Outputs:**  
1. A trained RandomForest classifier with TF-IDF + BERT embeddings.  
2. A DataFrame of test-set predictions and confidence scores.  
3. A fine-tuned Sentence-Transformer model for tagging sentences.  
4. A CSV (`Extracted Sentences - Batch 1.csv`) listing, for each file, the sentences tagged with each property.

---

## Table of Contents

1. [Environment Setup](#setup)  
2. [Paths & Data Loading](#data)  
3. [DataFrame Preparation](#prep)  
4. [Text Preprocessing & Label Loading](#text)  
5. [Document Classification Pipeline](#classify)  
   1. [Vectorization & Embeddings](#vect)  
   2. [Train & Evaluate Classifier](#train)  
6. [Sentence Tagging & Extraction](#tagging)  
   1. [Heuristic Labeling](#heuristic)  
   2. [Build Training Examples](#examples)  
   3. [Fine-Tune BERT](#finetune)  
   4. [Classify Sentences](#sentclass)  
   5. [Batch Extraction](#batchextract)  
7. [Save Results](#save)  
8. [Next Steps & Extensions](#next)


In [12]:
# 1. Environment Setup
from collections import defaultdict
import os
import re
import logging
import numpy as np
import pandas as pd
from datasets import Dataset
# NLP & ML libraries
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

import torch
from sentence_transformers import SentenceTransformer, InputExample, losses, util
from torch.utils.data import DataLoader

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Configure logging level
logging.basicConfig(level=logging.WARNING)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# 2. Paths & Data Loading

# Filepaths to labeled CSVs
batch1_labeled_path = r"D:\vc-research\vc-research\URAP VC Research - [Readable] Batch 1 Main.csv"
batch1_lp_path      = r"D:\vc-research\vc-research\URAP VC Research - Batch 1 Details.csv"

# Folder containing the .txt documents
txt_folder_path     = r"D:\vc-research\vc-research\Batch1_text_readable"

# Load the metadata and labels
batch1_labeled = pd.read_csv(batch1_labeled_path)
batch1_lp      = pd.read_csv(batch1_lp_path)

In [3]:
# 3. DataFrame Preparation

def prepare_multiindex(df, date_col="Date", group_cols=["Company Name"]):
    """
    Convert 'Date' to datetime, set a multi-index of [Company Name, Date],
    and sort chronologically within each company.
    """
    df[date_col] = pd.to_datetime(df[date_col])
    df = df.set_index(group_cols + [date_col]).sort_index()
    # Sort only by date within each company
    df = df.groupby(level=0, sort=False).apply(lambda x: x.sort_index(level=1))
    # Drop the outer index level (company) to leave Date as the index
    df.index = df.index.droplevel(0)
    return df

# Apply multi-index to both DataFrames
batch1_labeled_mi = prepare_multiindex(batch1_labeled)
batch1_lp_mi      = prepare_multiindex(batch1_lp)

In [4]:
# 4. Text Preprocessing & Label Loading

def clean_text(text: str) -> str:
    """
    Normalize text to lowercase, collapse newlines and extra spaces.
    """
    text = text.lower()
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Load each labeled document’s text and its binary label
text_data, labels, doc_names = [], [], []
for _, row in batch1_labeled.iterrows():
    fname = row['File Name']
    label = row['Contains Liquidity Preference']
    path  = os.path.join(txt_folder_path, fname + ".txt")
    if os.path.exists(path):
        raw = open(path, "r", encoding="utf-8").read()
        text_data.append(clean_text(raw))
        labels.append(label)
        doc_names.append(fname)
    else:
        logging.warning(f"File not found: {path}")

In [5]:
# 5.1 Vectorization & Embeddings

# 1) Split into train/test preserving label proportions
X_train, X_test, y_train, y_test, train_docs, test_docs = train_test_split(
    text_data, labels, doc_names,
    test_size=0.25, stratify=labels, random_state=42
)

# 2) TF-IDF on 1–3 grams, limit to top 2,000 features
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=2000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf  = tfidf.transform(X_test)

# 3) Sentence-Transformer embeddings (mean pooling of sentence vectors)
bert_model = SentenceTransformer('all-MiniLM-L6-v2')
def embed_docs(docs):
    # Tokenize into sentences, encode each, then average
    return np.vstack([
        bert_model.encode(sent_tokenize(d), convert_to_numpy=True).mean(axis=0)
        for d in docs
    ])

X_train_bert = embed_docs(X_train)
X_test_bert  = embed_docs(X_test)

# 4) Combine TF-IDF and BERT embeddings horizontally
X_train_combined = np.hstack([X_train_tfidf.toarray(), X_train_bert])
X_test_combined  = np.hstack([X_test_tfidf.toarray(), X_test_bert])

In [6]:
# 5.2 Train & Evaluate Classifier

# Hyperparameter grid for Random Forest
param_grid = {
    'n_estimators':      [100, 200],
    'max_depth':         [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Balanced RandomForest with 5-fold CV
rf_cv = GridSearchCV(
    RandomForestClassifier(random_state=42, class_weight="balanced"),
    param_grid, cv=5, n_jobs=-1
)
rf_cv.fit(X_train_combined, y_train)

# Predict on test set
y_pred      = rf_cv.best_estimator_.predict(X_test_combined)
y_pred_prob = rf_cv.best_estimator_.predict_proba(X_test_combined)[:,1]

# Build results DataFrame
predictions_df = pd.DataFrame({
    'Document':             test_docs,
    'True Label':           y_test,
    'Predicted Label':      y_pred,
    'Confidence (Prob. of 1)': y_pred_prob
})
predictions_df.head()

Unnamed: 0,Document,True Label,Predicted Label,Confidence (Prob. of 1)
0,48_2013-12-06_Certificates of Incorporation,1,1,0.98
1,27_2004-08-17_Certificates of Incorporation,0,0,0.0
2,27_2006-08-30_Certificates of Incorporation,1,1,0.95
3,24_2004-12-01_Certificates of Incorporation,0,0,0.08
4,16_2015-04-22_Certificates of Incorporation,0,0,0.01


In [17]:

import pandas as pd

# Count correct predictions
correct = (predictions_df['True Label'] == predictions_df['Predicted Label']).sum()
print(f"Correct predictions: {correct}")

# False Positives: predicted 1 but true is 0
false_positives = ((predictions_df['True Label'] == 0) & (predictions_df['Predicted Label'] == 1)).sum()
# False Negatives: predicted 0 but true is 1
false_negatives = ((predictions_df['True Label'] == 1) & (predictions_df['Predicted Label'] == 0)).sum()

# Total positives and negatives for rate calculation
total_actual_positives = (predictions_df['True Label'] == 1).sum()
total_actual_negatives = (predictions_df['True Label'] == 0).sum()

# Calculate rates
false_positive_rate = false_positives / total_actual_negatives if total_actual_negatives > 0 else float('nan')
false_negative_rate = false_negatives / total_actual_positives if total_actual_positives > 0 else float('nan')

print(f"False Positive Rate: {false_positive_rate:.4f}")
print(f"False Negative Rate: {false_negative_rate:.4f}")



Correct predictions: 22
False Positive Rate: 0.0000
False Negative Rate: 0.0667


## Sentence Tagging & Extraction

In [7]:
# 6.1 Heuristic Labeling

PROPERTY_TAGS = [
    'Company Name', 'Date', 'Document Type',
    'Preferred Stocks', 'Priority Order', 'Liquidation Value'
]

# Keywords/regex for each tag
KEYWORDS = {
    'Company Name':     ["certificate of incorporation", "incorporated", r"\bcompany name\b"],
    'Date':             ["filed", "effective date", r"\d{2}/\d{2}/\d{4}"],
    'Document Type':    ["certificate of amendment", "articles of incorporation", "restated"],
    'Preferred Stocks': ["preferred stock", "series a", "series b"],
    'Priority Order':   ["prior and in preference", "ranking junior", "paid before"],
    'Liquidation Value':["liquidation preference", "liquidation value", r"\$\d+(?:,\d{3})*(?:\.\d{2})?"]
}

def label_sentences_heuristically(folder_path: str) -> pd.DataFrame:
    """
    Tokenize each document into sentences, then assign all tags whose
    keywords/regex match that sentence. Returns a DataFrame.
    """
    records = []
    for fname in os.listdir(folder_path):
        if not fname.endswith(".txt"): continue
        text = open(os.path.join(folder_path, fname), 'r', encoding='utf-8', errors='ignore').read()
        for sent in sent_tokenize(text.replace("\n"," ")):
            tags = [tag for tag, kws in KEYWORDS.items()
                    if any(re.search(kw, sent, re.IGNORECASE) for kw in kws)]
            if tags:
                records.append({
                    'Filename': fname,
                    'Sentence': sent,
                    'Labels':  ", ".join(tags)
                })
    return pd.DataFrame(records)

# Run heuristic labeling
labeled_sent_df = label_sentences_heuristically(txt_folder_path)
labeled_sent_df.head()

Unnamed: 0,Filename,Sentence,Labels
0,100_2007-02-22_Certificates of Incorporation.txt,State of Delaware Secretary of State Division ...,"Company Name, Date, Document Type"
1,100_2007-02-22_Certificates of Incorporation.txt,The corporation was originally incorporated un...,"Company Name, Date"
2,100_2007-02-22_Certificates of Incorporation.txt,B. Pursuant to Sections 242 and 245 of the Gen...,"Company Name, Document Type"
3,100_2007-02-22_Certificates of Incorporation.txt,"Cc, The text of the Certificate of Incorporati...","Company Name, Document Type"
4,100_2007-02-22_Certificates of Incorporation.txt,The corporation is authorized to issue two cla...,Preferred Stocks


In [8]:
# 6.2 Build Training Examples

def build_training_examples(df: pd.DataFrame) -> list:
    """
    Create InputExample pairs ([sentence, tag], label) for fine-tuning.
    Label = 1.0 if sentence was tagged, else 0.0.
    """
    examples = []
    for _, row in df.iterrows():
        sent  = row['Sentence']
        present_tags = set(row['Labels'].split(', '))
        for tag in PROPERTY_TAGS:
            label = 1.0 if tag in present_tags else 0.0
            examples.append(InputExample(texts=[sent, tag], label=label))
    return examples

training_examples = build_training_examples(labeled_sent_df)

In [9]:
# 6.3 Fine-Tune BERT

def fine_tune_bert_model(base_model, examples, epochs=1, warmup_steps=5):
    """
    Fine-tune the SentenceTransformer on our tagging examples
    using a cosine-similarity loss.
    """
    dataloader = DataLoader(examples, shuffle=True, batch_size=32)
    loss_fn    = losses.CosineSimilarityLoss(model=base_model)
    base_model.fit(
        train_objectives=[(dataloader, loss_fn)],
        epochs=epochs,
        warmup_steps=warmup_steps,
        show_progress_bar=True
    )
    return base_model

fine_tuned_model = fine_tune_bert_model(bert_model, training_examples)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss
500,0.0348
1000,0.0233


In [13]:
# 6.4 Classify Sentences

def classify_sentences(sentences: list, model, threshold: float=0.5) -> dict:
    """
    Compute cosine similarity between each sentence and each tag.
    Return dict[tag] = list of (sentence, score) above threshold.
    """
    out = defaultdict(list)
    for sent in sentences:
        vec_sent = model.encode(sent, convert_to_numpy=True)
        for tag in PROPERTY_TAGS:
            vec_tag = model.encode(tag, convert_to_numpy=True)
            score   = util.cos_sim(vec_sent, vec_tag)[0][0].item()
            if score >= threshold:
                out[tag].append((sent, score))
    return out


In [14]:
# 6.5 Batch Extraction of Tagged Sentences

def process_directory_with_model(folder_path: str, model, threshold: float=0.5) -> pd.DataFrame:
    """
    For each text file in folder, extract all sentences that the model
    tags for each property. Returns a DataFrame of one row per document.
    """
    rows = []
    for fname in os.listdir(folder_path):
        if not fname.endswith(".txt"): continue
        text      = open(os.path.join(folder_path, fname), 'r', encoding='utf-8', errors='ignore').read()
        sentences = sent_tokenize(text.replace("\n"," "))
        tagged    = classify_sentences(sentences, model, threshold)
        row       = {'Filename': fname}
        # Join all matching sentences per tag
        for tag in PROPERTY_TAGS:
            row[tag] = "; ".join([s for s, _ in tagged.get(tag, [])])
        rows.append(row)
    return pd.DataFrame(rows)

# Run the full extraction
extracted_df = process_directory_with_model(txt_folder_path, fine_tuned_model)
extracted_df.head()

Unnamed: 0,Filename,Company Name,Date,Document Type,Preferred Stocks,Priority Order,Liquidation Value
0,100_2007-02-22_Certificates of Incorporation.txt,State of Delaware Secretary of State Division ...,The corporation was originally incorporated un...,State of Delaware Secretary of State Division ...,The corporation is authorized to issue two cla...,,The total number of shares of Common Stock whi...
1,100_2008-12-03_Certificates of Incorporation.txt,State of Delaware Secre of State Division of ...,The corporation was originally incorporated un...,State of Delaware Secre of State Division of ...,The corporation is authorized to issue two cla...,,The total number of shares of Common Stock whi...
2,16_2003-07-03_Certificates of Incorporation.txt,State of Delaware Secretary of State Division ...,State of Delaware Secretary of State Division ...,State of Delaware Secretary of State Division ...,FOURTH: The total number of shares of stock wh...,,
3,16_2004-01-22_Certificates of Incorporation.txt,State of Delaware Secretary of State Division ...,State of Delaware Secretary of State Division ...,State of Delaware Secretary of State Division ...,SECOND: The corporation has not received any p...,,ARTICLE IV This corporation is authorized to ...
4,16_2004-07-14_Certificates of Incorporation.txt,RESTATED CERTIFICATE OF INCORPORATION OF 3-D M...,RESTATED CERTIFICATE OF INCORPORATION OF 3-D M...,RESTATED CERTIFICATE OF INCORPORATION OF 3-D M...,Authorization of Stock.; This corporation is a...,,The total number of GDSVF&H\567653.3 shares ...


In [15]:
# 7. Save Results

# Save the tagged sentences for manual review and downstream use
out_csv = 'Extracted Sentences - Batch 1.csv'
extracted_df.to_csv(out_csv, index=False)
print(f"✅ Saved extracted sentences to {out_csv}")

✅ Saved extracted sentences to Extracted Sentences - Batch 1.csv


## Next Steps & Extensions

- **Error Logging:** capture parsing or model errors to a log file.  
- **Cross-Validation:** validate sentence tagging threshold and model performance.  
- **Parallel Processing:** speed up embedding & extraction with multiprocessing.  
- **Edge Cases:** refine KEYWORDS and regex for better coverage of rare phrasing.  
- **Integration:** combine document-level classification and sentence-level extraction into a unified pipeline or API.


## ChatGPT Prompt:

Based on the strings in each of the cells, isolate just the desired information as describe below: 
File Name: Do not modify values in this column
Company Name: Identify and extract the company's name as a string type (Example: "3VR Security INC.", "The 41st Parameter INC.", etc.)
Date: Identify and extract the date when the article was filed as a datetime type (Example: "FILED 10:43 AM 06/28/2007", "FILED 05:05 PM 06/10/2010", etc.)
Document Type: Identify and extract the type of document that was submitted as a string type (Example: "Certificate of Incorporation", "Amended and Restated Certificate of Incorporation", etc.) 
Preferred Stock: Identify and extract the unique types of preferred shares as a list of strings (Example: ['Series A', 'Series B', 'Series C', 'Series D'])
Liquidation Value: Identify and extract the dollar liquidation amount for each preferred stock as a list of floats; the length of the list should be the same length as the list for preferred stocks; if the liquidation preference is the original issue price use that value (Example: [0.431469, 0.624136, 0.474550, 0.152430])

Return the result after this extraction in the form of a dataframe and then export as a CSV