# Time Series Information Extraction Pipeline

**Purpose:**  
This notebook automates the extraction of key fields from “readable” text documents (converted from PDFs).  
It reads each text file and pulls out:
- **Company Name** (e.g. “Acme Corp. Inc.”)  
- **Date** (e.g. “January 1, 2020”)  
- **Document Type** (e.g. “Certificate of Incorporation”)  
- **Preferred Stocks** (unique class names)  
- **Priority Order** (e.g. “First”, “Second”)  
- **Liquidation Value** (numeric)

**Inputs:**  
A directory of plain-text `.txt` files.

**Outputs:**  
A `pandas.DataFrame` with one row per document and the six extracted fields.

---

## Table of Contents

1. [Environment Setup](#setup)  
2. [Constants & Configuration](#constants)  
3. [Utility Functions](#utils)  
4. [Extraction Functions](#extraction)  
5. [File Processing Pipeline](#processing)  
6. [Batch Execution & Data Assembly](#batch)  
7. [Saving Results](#save)  
8. [Next Steps & Extensions](#next)

In [2]:
# 1. Environment Setup

import os
from pathlib import Path
import re
from datetime import datetime

import spacy               # for Named Entity Recognition
import pandas as pd        # for DataFrame operations

# Configure pandas to display all rows/columns when inspecting DataFrames
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Load SpaCy English model for NER
nlp = spacy.load('en_core_web_sm')

In [3]:
# 2. Constants & Configuration

# Path to the folder containing converted .txt files
TEXT_DIR = Path('/Users/alexchen/Downloads/Projects/vc-research/Batch1_text_readable')

# Regex pattern for Company Names ending with "Inc."
COMPANY_NAME_REGEX = r'\b[A-Z][A-Za-z0-9&\-,\s]+Inc\.\b'

# Date extraction patterns covering multiple formats
DATE_PATTERNS = [
    # e.g. "August 10, 2020"
    r'\b(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s+\d{4}\b',
    # e.g. "08/10/2020" or "8/10/2020"
    r'\b\d{1,2}/\d{1,2}/\d{4}\b',
    # e.g. "08-10-2020"
    r'\b\d{1,2}-\d{1,2}-\d{4}\b',
    # e.g. "2020-08-10"
    r'\b\d{4}-\d{2}-\d{2}\b'
]

# Phrases that often precede a date in legal documents
DATE_CONTEXT_PHRASES = [
    'Filed on',
    'Dated',
    'Effective as of',
    'Executed on',
    'Signed this'
]

# Known certificate types, sorted longest-first to avoid partial matches
CERTIFICATE_TYPES = sorted([
    "Certificate of Incorporation",
    "Restated Certificate of Incorporation",
    "Certificate of Amendment",
    "Certificate of Merger",
    "Certificate of Conversion",
    "Certificate of Cancellation",
    "Amended and Restated Certificate of Incorporation",
    "Articles of Incorporation",
    "Amended and Restated Articles of Incorporation",
    "Certificate of Correction"
], key=len, reverse=True)

In [4]:
# 3. Utility Functions

def format_date(date_str: str) -> pd.Timestamp:
    """
    Parse a date string into a pandas Timestamp.
    Supports multiple date formats; returns pd.NaT on failure.
    """
    for fmt in ('%B %d, %Y', '%m/%d/%Y', '%d-%m-%Y', '%Y-%m-%d'):
        try:
            # first convert via datetime, then wrap in pandas Timestamp
            return pd.to_datetime(datetime.strptime(date_str, fmt))
        except (ValueError, TypeError):
            continue
    return pd.NaT

In [5]:
# 4. Extraction Functions

def extract_company_name(text: str) -> str:
    """
    Extract the first organization entity ending with 'Inc.' via SpaCy NER.
    Fallback to regex if NER yields nothing. Returns 'N/A' if all fail.
    """
    # 1) Try SpaCy NER
    for ent in nlp(text).ents:
        name = ent.text.strip()
        if ent.label_ == 'ORG' and name.endswith('Inc.'):
            return name

    # 2) Fallback to simple regex search
    match = re.search(COMPANY_NAME_REGEX, text)
    if match:
        return match.group().strip()

    return 'N/A'


def extract_date(text: str) -> pd.Timestamp:
    """
    Locate and parse the document date using context phrases and regex patterns.
    Returns pandas Timestamp or pd.NaT if none found.
    """
    # 1) Look for dates immediately following context phrases
    for phrase in DATE_CONTEXT_PHRASES:
        # e.g. "Filed on August 10, 2020,"
        pattern = rf'{re.escape(phrase)}\s+(.*?)(?=[\n,;])'
        m = re.search(pattern, text, flags=re.IGNORECASE)
        if m:
            date_segment = m.group(1).strip()
            for pat in DATE_PATTERNS:
                dm = re.search(pat, date_segment)
                if dm:
                    return format_date(dm.group())

    # 2) Fallback: search entire text
    for pat in DATE_PATTERNS:
        m = re.search(pat, text)
        if m:
            return format_date(m.group())

    return pd.NaT


def extract_certificate_type(text: str) -> str:
    """
    Identify certificate type by matching known types (longest-first).
    Returns matched type or 'N/A'.
    """
    for cert in CERTIFICATE_TYPES:
        if re.search(rf'\b{re.escape(cert)}\b', text, flags=re.IGNORECASE):
            return cert
    return 'N/A'


# (Placeholders for remaining extractors – implement as needed)
def extract_preferred_stocks(text: str):
    """TODO: Extract unique preferred stock class names."""
    return 'N/A'


def extract_priority_order(text: str):
    """TODO: Extract priority order (e.g. 'First', 'Second')."""
    return 'N/A'


def extract_liquidation_value(text: str):
    """TODO: Extract numeric liquidation value."""
    return 'N/A'

In [6]:
# 5. File Processing Pipeline

def process_text_file(file_path: Path) -> dict:
    """
    Open a .txt file, run all extractors, and return a result dict.
    Returns None on error.
    """
    try:
        text = file_path.read_text(encoding='utf-8')

        return {
            "Company Name":     extract_company_name(text),
            "Date":             extract_date(text),
            "File Name":        file_path.name,
            "Document Type":    extract_certificate_type(text),
            "Preferred Stocks": extract_preferred_stocks(text),
            "Priority Order":   extract_priority_order(text),
            "Liquidation Value":extract_liquidation_value(text)
        }

    except Exception as e:
        print(f"⚠️ Error processing {file_path.name}: {e}")
        return None


def process_all_files(directory: Path) -> pd.DataFrame:
    """
    Iterate over every .txt in the directory, process it,
    and assemble a clean DataFrame. Normalizes dates to YYYY-MM-DD or 'N/A'.
    """
    records = []
    for txt_file in directory.glob('*.txt'):
        res = process_text_file(txt_file)
        if res:
            records.append(res)

    df = pd.DataFrame(records)

    # Normalize the Date column to string format; fill missing
    df["Date"] = (
        pd.to_datetime(df["Date"], errors="coerce")
          .dt.strftime('%Y-%m-%d')
          .fillna("N/A")
    )
    return df

In [7]:
# 6. Batch Execution & Data Assembly

# Run the pipeline over all text files
final_df = process_all_files(TEXT_DIR)

# Organize by Company Name → Date for readability
final_df.set_index(["Company Name", "Date"], inplace=True)
final_df.sort_index(inplace=True)

# Display the resulting DataFrame
final_df


Unnamed: 0_level_0,Unnamed: 1_level_0,File Name,Document Type,Preferred Stocks,Priority Order,Liquidation Value
Company Name,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"3POINTS, Inc.",,24_2004-12-01_Certificates of Incorporation.txt,Certificate of Incorporation,,,
"3VR Security, Inc.",2006-08-23,27_2006-08-23_Certificates of Incorporation.txt,Amended and Restated Articles of Incorporation,,,
"3VR Security, Inc.",2013-09-19,27_2013-09-26_Certificates of Incorporation.txt,Amended and Restated Articles of Incorporation,,,
"3VR Security, Inc.",,27_2005-12-22_Certificates of Incorporation.txt,Amended and Restated Certificate of Incorporation,,,
"3VR Security, Inc.",,27_2006-08-30_Certificates of Incorporation.txt,Amended and Restated Certificate of Incorporation,,,
"3VR Security, Inc.",,27_2009-05-15_Certificates of Incorporation.txt,Amended and Restated Articles of Incorporation,,,
"3VR Security, Inc.",,27_2010-09-16_Certificates of Incorporation.txt,Articles of Incorporation,,,
"3VR Security, Inc.",,27_2010-10-10_Certificates of Incorporation.txt,Amended and Restated Articles of Incorporation,,,
"3VR Security, Inc.",,27_2008-07-31_Certificates of Incorporation.txt,Amended and Restated Articles of Incorporation,,,
"3jam, Inc.",2006-04-03,21_2006-04-21_Certificates of Incorporation.txt,Amended and Restated Certificate of Incorporation,,,


## Next Steps & Extensions

- **Error Logging**: capture failed files in a log for manual review.  
- **Unit Tests**: write pytest cases for each extractor.  
- **Parallelism**: speed up with `concurrent.futures`.  
- **Advanced NLP**: fine-tune an LLM to handle edge-case phrasing.  
- **Additional Fields**: complete `extract_preferred_stocks`, `extract_priority_order`, and `extract_liquidation_value`.