# Named Entity Recognition for Contract Analysis

## Business Context: JPMorgan's COIN System

JPMorgan Chase developed Contract Intelligence (COIN), an NLP system that:
- Reviews and interprets commercial loan agreements
- Processes 12,000+ agreements annually
- Extracts important data points and clauses in seconds
- Reduces 360,000+ hours of legal work each year
- Minimizes human error in contract interpretation

In this notebook, we'll build a simplified version of a contract analysis system using Named Entity Recognition (NER) to automatically extract key information from legal documents.

## 1. Setup and Installation

First, let's install and import the necessary libraries. We'll use spaCy, a popular NLP library with built-in NER capabilities.

In [None]:
# Install required libraries (if not already installed)
# !pip install spacy
# !python -m spacy download en_core_web_sm

# Import libraries
import spacy
import pandas as pd
import matplotlib.pyplot as plt
import re
from spacy import displacy
from collections import Counter

# Load spaCy's pre-trained English model
nlp = spacy.load("en_core_web_sm")

## 2. Sample Contract Data

Let's create some sample contract clauses. In a real-world scenario, these would be extracted from actual loan agreements or other legal documents.

In [None]:
# Sample loan agreement clauses
loan_agreement = """
LOAN AGREEMENT

This Loan Agreement (the "Agreement") is made and entered into on March 15, 2023 (the "Effective Date"), by and between:

First National Bank, a financial institution with its principal place of business at 123 Finance Street, New York, NY 10004 (the "Lender"), and

ABC Corporation, a Delaware corporation with its principal place of business at 456 Business Avenue, Chicago, IL 60601 (the "Borrower").

1. LOAN AMOUNT AND TERM
   The Lender agrees to provide the Borrower with a term loan in the amount of $2,500,000.00 (Two Million Five Hundred Thousand Dollars) (the "Loan"). The Loan shall be repaid over a period of 60 months (5 years) from the date of disbursement.

2. INTEREST RATE
   The Loan shall bear interest at a fixed annual rate of 6.75% (six point seven five percent).

3. REPAYMENT SCHEDULE
   The Borrower shall repay the Loan in 60 equal monthly installments of $49,125.87 (Forty-Nine Thousand One Hundred Twenty-Five Dollars and Eighty-Seven Cents), due on the 1st day of each month beginning on May 1, 2023.

4. PREPAYMENT
   The Borrower may prepay the Loan in whole or in part at any time without penalty.

5. DEFAULT
   If the Borrower fails to make any payment within 15 days of the due date, the entire outstanding balance shall become immediately due and payable, and additional interest at a rate of 12% per annum shall accrue on the outstanding balance until paid in full.
"""

# Display the sample contract
print(loan_agreement)

## 3. Basic Named Entity Recognition

Now, let's use spaCy's pre-trained NER model to identify entities in our contract. This is the first step in automatically extracting key information.

In [None]:
# Process the text with spaCy
doc = nlp(loan_agreement)

# Extract all entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Display entities in a DataFrame
entities_df = pd.DataFrame(entities, columns=['Text', 'Entity Type'])
entities_df

In [None]:
# Visualize entity recognition
# Select a smaller section for better visualization
section = loan_agreement.split('\n\n')[3]  # Loan amount section
section_doc = nlp(section)

# Display entities in this section visually
displacy.render(section_doc, style="ent", jupyter=True)

### Analyzing Entity Types

Let's examine the distribution of entity types in our contract. This helps us understand what kind of information spaCy automatically identifies.

In [None]:
# Count the frequency of each entity type
entity_counts = entities_df['Entity Type'].value_counts()

# Create a bar chart
plt.figure(figsize=(10, 6))
entity_counts.plot(kind='bar')
plt.title('Entity Types Found in Loan Agreement')
plt.xlabel('Entity Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 4. Advanced Extraction: Key Contract Terms

Let's extract specific financial and temporal information from our contract using a combination of NER and regular expressions. This mimics how systems like JPMorgan's COIN would extract key contract terms.

In [None]:
def extract_loan_details(text):
    """Extract key loan details from contract text"""
    
    # Process with spaCy
    doc = nlp(text)
    
    # Initialize dictionary to store extracted information
    loan_details = {
        'loan_amount': None,
        'interest_rate': None,
        'term_months': None,
        'monthly_payment': None,
        'start_date': None,
        'parties': {'lender': None, 'borrower': None}
    }
    
    # Extract loan amount (using regex to find dollar amounts)
    amount_pattern = r'\$([\d,]+\.\d{2}|[\d,]+)'
    amount_matches = re.findall(amount_pattern, text)
    if amount_matches:
        # Look for context to identify the principal loan amount
        loan_amount_section = re.search(r'loan.*?amount.*?\$(\d[\d,\.]*)', text, re.IGNORECASE)
        if loan_amount_section:
            loan_details['loan_amount'] = '$' + loan_amount_section.group(1)
    
    # Extract interest rate
    interest_pattern = r'(\d+\.\d+)\s*%|\s(\d+)\s*percent'
    interest_match = re.search(interest_pattern, text)
    if interest_match:
        rate = interest_match.group(1) or interest_match.group(2)
        loan_details['interest_rate'] = float(rate)
    
    # Extract loan term
    term_pattern = r'(\d+)\s*(?:months|month)|(\d+)\s*(?:years|year)'
    term_match = re.search(term_pattern, text, re.IGNORECASE)
    if term_match:
        if term_match.group(1):  # If months
            loan_details['term_months'] = int(term_match.group(1))
        elif term_match.group(2):  # If years
            loan_details['term_months'] = int(term_match.group(2)) * 12
    
    # Extract monthly payment
    payment_pattern = r'installments of \$(([\d,]+\.\d{2})|([\d,]+))'
    payment_match = re.search(payment_pattern, text, re.IGNORECASE)
    if payment_match:
        payment = payment_match.group(1).replace(',', '')
        loan_details['monthly_payment'] = '$' + payment
    
    # Extract start date
    for ent in doc.ents:
        if ent.label_ == 'DATE' and ('beginning' in text[max(0, ent.start_char-30):ent.start_char] or 
                                     'beginning' in text[ent.end_char:min(len(text), ent.end_char+30)]):
            loan_details['start_date'] = ent.text
            break
    
    # Extract parties (lender and borrower)
    lender_pattern = r'([\w\s]+)\s*(?:\(the "Lender"\))'
    borrower_pattern = r'([\w\s]+)\s*(?:\(the "Borrower"\))'
    
    lender_match = re.search(lender_pattern, text)
    if lender_match:
        loan_details['parties']['lender'] = lender_match.group(1).strip()
        
    borrower_match = re.search(borrower_pattern, text)
    if borrower_match:
        loan_details['parties']['borrower'] = borrower_match.group(1).strip()
    
    return loan_details

# Extract loan details from our sample contract
extracted_details = extract_loan_details(loan_agreement)

# Display the extracted information in a structured format
print("EXTRACTED LOAN DETAILS:")
print(f"Lender: {extracted_details['parties']['lender']}")
print(f"Borrower: {extracted_details['parties']['borrower']}")
print(f"Loan Amount: {extracted_details['loan_amount']}")
print(f"Interest Rate: {extracted_details['interest_rate']}%")
print(f"Term: {extracted_details['term_months']} months")
print(f"Monthly Payment: {extracted_details['monthly_payment']}")
print(f"Start Date: {extracted_details['start_date']}")

## 5. Custom Entity Recognition: Contract-Specific Entities

spaCy's pre-trained model doesn't specifically target all legal or financial concepts we might care about in contracts. In real-world applications like COIN, custom entities would be defined and models would be trained to recognize them.

Let's demonstrate a simple rule-based approach to identify contract-specific entities using spaCy's pattern matcher.

In [None]:
from spacy.matcher import Matcher

# Create a matcher object
matcher = Matcher(nlp.vocab)

# Define patterns for contract-specific terms
# Pattern for "prepayment clause"
prepayment_pattern = [
    {"LOWER": "prepayment"}, 
    {"IS_PUNCT": True, "OP": "?"},
    {"OP": "*", "IS_ALPHA": True}
]

# Pattern for "default clause"
default_pattern = [
    {"LOWER": "default"}, 
    {"IS_PUNCT": True, "OP": "?"},
    {"OP": "*", "IS_ALPHA": True}
]

# Add patterns to the matcher
matcher.add("PREPAYMENT_CLAUSE", [prepayment_pattern])
matcher.add("DEFAULT_CLAUSE", [default_pattern])

# Process the loan agreement
doc = nlp(loan_agreement)

# Find matches
matches = matcher(doc)

# Extract and display custom entities
print("CUSTOM CONTRACT ENTITIES:")
for match_id, start, end in matches:
    # Get the matched span
    span = doc[start:end]
    print(f"Entity Type: {nlp.vocab.strings[match_id]}, Text: {span.text}")
    
    # Get the paragraph containing this entity
    # Find the beginning of the paragraph
    paragraph_start = loan_agreement[:span.start_char].rfind('\n\n')
    if paragraph_start == -1:  # If not found, start from the beginning
        paragraph_start = 0
    else:
        paragraph_start += 2  # Skip the newline characters
        
    # Find the end of the paragraph
    paragraph_end = loan_agreement[span.end_char:].find('\n\n')
    if paragraph_end == -1:  # If not found, go to the end
        paragraph_end = len(loan_agreement) - span.end_char
    paragraph_end += span.end_char
    
    # Extract the paragraph
    paragraph = loan_agreement[paragraph_start:paragraph_end].strip()
    print(f"Context: {paragraph}\n")

## 6. Extracting Contract Structure

Another important aspect of contract analysis is understanding the document's structure. Let's identify the main sections of our loan agreement.

In [None]:
def extract_contract_sections(text):
    """Extract the main sections of a contract"""
    
    # Look for numbered sections (e.g., "1. LOAN AMOUNT")
    section_pattern = r'(\d+\.\s*[A-Z][A-Z\s]+)'
    sections = re.findall(section_pattern, text)
    
    # Extract content for each section
    section_contents = {}
    
    for i, section in enumerate(sections):
        # Find the start of this section
        section_start = text.find(section)
        
        # Find the end of this section (start of next section or end of text)
        if i < len(sections) - 1:
            next_section = sections[i + 1]
            section_end = text.find(next_section)
        else:
            section_end = len(text)
        
        # Extract the content
        content = text[section_start:section_end].strip()
        
        # Clean the section name
        section_name = section.strip()
        
        # Store in our dictionary
        section_contents[section_name] = content
    
    return section_contents

# Extract sections
contract_sections = extract_contract_sections(loan_agreement)

# Display the structure
print("CONTRACT STRUCTURE:")
for section, content in contract_sections.items():
    print(f"\n=== {section} ===")
    # Print a shortened version of the content
    content_preview = content[:100] + '...' if len(content) > 100 else content
    print(content_preview)

## 7. Business Value and Applications

The techniques demonstrated in this notebook have significant business value in the financial and legal sectors:

### Business Benefits
1. **Efficiency**: Automatically extract key information from contracts in seconds vs. hours of manual review
2. **Consistency**: Eliminate human variability in contract interpretation
3. **Scalability**: Process thousands of contracts with the same level of scrutiny
4. **Risk Management**: Easily identify problematic clauses or missing information
5. **Knowledge Management**: Create structured databases of contract terms for analysis

### Real-world Applications
- **Due Diligence**: Quickly analyze contracts during mergers and acquisitions
- **Compliance**: Flag contracts that don't meet regulatory requirements
- **Portfolio Management**: Aggregate terms across thousands of agreements
- **Risk Assessment**: Identify unusual terms or conditions that require review
- **Contract Creation**: Auto-generate standard contracts with customized terms

### Implementation Considerations
A full-scale implementation like JPMorgan's COIN would include:
- Custom-trained NER models for financial and legal entities
- Integration with document OCR for scanning paper contracts
- Validation workflows for human review of uncertain extractions
- Database storage of extracted terms for analytics
- Integration with contract lifecycle management systems

## 8. Learning Challenge

Now it's your turn to apply what you've learned about Named Entity Recognition for contract analysis!

### Challenge: Analyze a Lease Agreement

Below is a sample lease agreement snippet. Your tasks are:

1. Extract the security deposit amount using regex
2. Identify the lease expiration date
3. Create a custom pattern to match the "maintenance and repairs" clause

Use the code below as a starting point:

In [None]:
# Sample lease agreement
lease_agreement = """
RESIDENTIAL LEASE AGREEMENT

This Residential Lease Agreement ("Agreement") is made on April 10, 2023, between Green Property Management ("Landlord") and John Smith ("Tenant").

1. PROPERTY
   Landlord leases to Tenant the residential property located at 789 Oak Street, Apt 3B, Boston, MA 02108 (the "Property").

2. TERM
   The lease term begins on May 1, 2023, and ends on April 30, 2024, unless terminated earlier as provided in this Agreement.

3. RENT
   Tenant shall pay rent in the amount of $2,200.00 per month, due on the 1st day of each month.

4. SECURITY DEPOSIT
   Tenant shall pay a security deposit of $3,300.00 upon signing this Agreement, to be returned within 30 days after Tenant vacates the Property, less any deductions for damages beyond normal wear and tear.

5. MAINTENANCE AND REPAIRS
   Tenant shall keep the Property in a clean and sanitary condition. Landlord shall be responsible for repairs not due to Tenant's negligence or misuse. Tenant must report any needed repairs promptly to Landlord.
"""

# Your code here to extract key information from the lease agreement
# 1. Process with spaCy
# 2. Extract security deposit
# 3. Find lease expiration date
# 4. Create a pattern for the maintenance clause

## Conclusion

In this notebook, we've explored how Named Entity Recognition and rule-based extraction can automatically analyze legal contracts, similar to JPMorgan's COIN system. We've demonstrated:

1. Using pre-trained NER to identify people, organizations, dates, and monetary amounts
2. Extracting specific loan details using a combination of NER and regex
3. Creating custom patterns to identify contract-specific clauses
4. Analyzing document structure to understand the organization of a contract

These techniques form the foundation of automated contract analysis systems that save time, reduce errors, and enable large-scale contract processing that would be impossible to do manually.

While our demonstration used simplified examples, the same principles apply to sophisticated systems processing thousands of complex legal documents. By combining machine learning with domain expertise, organizations can transform unstructured contract text into structured, actionable business intelligence.