# Full Pipeline


The purpose of this notebook is to demonstrate the complete production pipeline for the patent novelty assessment system.

**Pipeline Setup:**
1. Load models and data
2. Process user input (patent title, abstract, claims)
3. Retrieve similar patents (FAISS + Online search)
4. Extract features for each patent pair
5. Score similarity using PyTorch Neural Network
6. Generate novelty assessment
7. Create LLM explanation

**Usage:** Run all cells sequentially to see the full pipeline in action.


In [11]:
import sys
from pathlib import Path
import numpy as np
import json
import os
from typing import Dict, List, Optional

current_path = Path().resolve()
if current_path.name == 'notebooks':
    project_root = current_path.parent
elif current_path.name == 'CS372-final-project':
    project_root = current_path
else:
    # Assume we're in project root
    project_root = current_path

sys.path.insert(0, str(project_root))
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

# Get SerpAPI key from environment
serpapi_key = os.getenv('SERPAPI_KEY', '')
if serpapi_key:
    print(f"SerpAPI key configured ({len(serpapi_key)} characters)")
else:
    print("SerpAPI key not found in environment")

from src.app.patent_analyzer import PatentAnalyzer
from src.app.pytorch_classifier import PyTorchPatentClassifier
from src.features.feature_extract import FeatureExtractor
from src.embeddings.patent_sberta import PatentEmbedder

print("Imports successful")

Working directory: /Users/abhinavmeduri/projects/CS372-final-project
SerpAPI key configured (64 characters)
Imports successful


## Step 1: Initialize the Patent Analyzer

Load all required models and data.


In [12]:
# Get SerpAPI key from environment
serpapi_key = os.getenv('SERPAPI_KEY', '')
analyzer = PatentAnalyzer(
    use_online_search=True,
    use_llm_keywords=True,
    serpapi_key=serpapi_key
)

analyzer.load()

print("Patent Analyzer initialized")
print(f"Embedding model: PatentSBERTa")
print(f"Classification model: PyTorch Neural Network")
print(f"LLM explainer: Phi-3 (Ollama)")
print(f"Online search: Enabled (SerpAPI)")
print(f"Patent database: {len(analyzer.patents)} patents")

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: AI-Growth-Lab/PatentSBERTa


  Loading Patent Analyzer resources...
  Loading embeddings...
  Loaded 200,000 embeddings
  Loading PatentSBERTa model...


INFO:data.api.online_search:[OK] Using SerpAPI for Google Patents search (millions of patents)
INFO:data.api.online_search:  API key configured: 729754ca...
INFO:src.app.pytorch_classifier:Initialized (device: mps)
INFO:src.app.pytorch_classifier:Loaded from models/pytorch_nn


  PatentSBERTa model loaded
  Initializing Phi-3 explainer...
Phi-3 Ollama Explainer initialized
  Model: phi3
  KV caching: Enabled (built-in)
  Metal acceleration: Enabled
  Phi-3 explainer ready
  Initializing LLM keyword extractor...
  LLM Keyword Extractor ready
  Initializing online search...
  Online Patent Search ready (SerpAPI)
  Loading PyTorch model...
  PyTorch model and feature extractor ready
  Ready! (Components loaded on-demand)
Patent Analyzer initialized
Embedding model: PatentSBERTa
Classification model: PyTorch Neural Network
LLM explainer: Phi-3 (Ollama)
Online search: Enabled (SerpAPI)
Patent database: 0 patents


## Step 2: Define Example Patent

Input a patent application to assess for novelty.

In [None]:
example_patent = {
    "title": "Real-time video recommendation system using sequence-aware user embeddings",
    "abstract": "A recommendation system for a video streaming platform that generates personalized suggestions by learning embeddings from users' watch histories. The system updates embeddings in real time and combines short-term and long-term viewing patterns to improve recommendation accuracy.",
    "claims": [
        "1. A video recommendation system comprising: a watch history tracker configured to log sequences of videos watched by a user; an embedding generator configured to produce low-dimensional embeddings representing a user's viewing preferences based on the sequence of watched videos; a model updater configured to adjust embeddings and predictive model parameters in real time as new videos are watched; and a recommendation engine configured to rank and suggest videos by combining user embeddings with video metadata features such as genre, duration, and popularity.",
        "2. The system of claim 1, wherein the embedding generator uses a recurrent neural network or transformer-based model to capture temporal patterns in user watch behavior.",
        "3. The system of claim 1, further comprising a short-term preference module that emphasizes recently watched videos to adjust immediate recommendations.",
        "4. The system of claim 1, wherein the recommendation engine filters suggested videos to prioritize content available in the user's region and compatible with their device type."
    ]
}

## Step 3: Run Novelty Assessment

Execute the complete pipeline.


In [4]:
input_text = f"{example_patent['title']}\n\n{example_patent['abstract']}\n\nClaims:\n" + "\n".join(example_patent['claims'])

result = analyzer.analyze(input_text)


INFO:data.api.online_search:Generated 5 search terms via LLM


Generated 5 search terms: ['real-time recommendation system AND user watch history', 'sequence learning for video suggestions OR embeddings based recommendations', 'video streaming platform personalization AND dynamic user embeddings']...


INFO:data.api.online_search:Searching 5 terms with max_per_term=10
INFO:data.api.online_search:Using SerpAPI: True, API key present: True
INFO:data.api.online_search:[1/5] Searching term: 'real-time recommendation system AND user watch history...'
INFO:data.api.online_search:Searching SerpAPI with query: real-time recommendation system AND user watch history...
INFO:data.api.online_search:API key present: True (first 8 chars: 729754ca...)
INFO:data.api.online_search:Making SerpAPI request with params: engine=google_patents, q=real-time recommendation system AND user watch his..., num=10 (requested 10)


Calling online_searcher.search_multiple_terms with 5 terms
Online searcher type: <class 'data.api.online_search.GooglePatentsSearch'>
Online searcher use_serpapi: True


INFO:data.api.online_search:SerpAPI response keys: ['search_metadata', 'search_parameters', 'search_information', 'organic_results', 'summary', 'pagination', 'serpapi_pagination']
INFO:data.api.online_search:Found 10 organic results
INFO:data.api.online_search:SerpAPI found 10 patents for: real-time recommendation system AND user watch his...
INFO:data.api.online_search:  'real-time recommendation system AND user watch his...': 10 patents found
INFO:data.api.online_search:    First result: US10362940B2 - Personal emergency response (PER) system...
INFO:data.api.online_search:[2/5] Searching term: 'sequence learning for video suggestions OR embeddings based recommendations...'
INFO:data.api.online_search:Searching SerpAPI with query: sequence learning for video suggestions OR embeddings based recommendations...
INFO:data.api.online_search:API key present: True (first 8 chars: 729754ca...)
INFO:data.api.online_search:Making SerpAPI request with params: engine=google_patents, q=sequence l

search_multiple_terms returned 5 terms with results
Term 'real-time recommendation system AND user watch his...': 10 results (type: <class 'list'>)
Term 'sequence learning for video suggestions OR embeddi...': 10 results (type: <class 'list'>)
Term 'video streaming platform personalization AND dynam...': 10 results (type: <class 'list'>)
Term '(realtime) AND (sensor error correction) replaced ...': 10 results (type: <class 'list'>)
Term 'long short-term memory networks for recommendation...': 10 results (type: <class 'list'>)
Online search found 50 unique patents across 5 terms
Ranking-based assessment: scored 20 patents, mean similarity=0.740, novelty=0.260
Generating explanation via Ollama (with KV caching)...
[OK] Generated 650 tokens in 46.4s (14.0 tok/s)


## Step 4: Display Results

Show the novelty assessment results.


In [10]:
input_text = f"""Title: {example_patent["title"]}

Abstract: {example_patent["abstract"]}

Claims:
{chr(10).join(example_patent["claims"])}
"""

result = analyzer.analyze(input_text)

# Display results in frontend-like format
print("Patent Novelty Assessment Results")

print(f"\nNovelty Score: {result.novelty_score:.3f}")
if result.novelty_score > 0.5:
    print(f"   Likely Novel")
else:
    print(f"   Potential Prior Art Found")

if result.similar_patents:
    print(f"\nFound {len(result.similar_patents)} Similar Patents:")
    for idx, patent in enumerate(result.similar_patents[:5], 1):
        print(f"\n   {idx}. Patent ID: {patent.get('patent_id', 'N/A')}")
        print(f"      Model Similarity: {patent.get('model_similarity', 0):.3f}")
        print(f"      Model Novelty: {patent.get('model_novelty', 0):.3f}")
        if 'title' in patent:
            print(f"      Title: {patent['title'][:60]}...")
        if 'year' in patent:
            print(f"      Year: {patent['year']}")

if result.explanation:
    print(f"\nExplanation:")
    explanation_lines = result.explanation.split('\n')
    for line in explanation_lines[:5]:  # Show first 5 lines
        print(f"   {line}")
    if len(explanation_lines) > 5:
        print(f"   ... ({len(explanation_lines) - 5} more lines)")



INFO:data.api.online_search:Generated 5 search terms via LLM


Generated 5 search terms: ['recommendation system AND personalized suggestions OR user watch history embeddings', '(real-time updates) AND (embedding learning) OR short/long term viewing patterns', 'video streaming platform recommendation algorithms AND pattern recognition']...


INFO:data.api.online_search:Searching 5 terms with max_per_term=10
INFO:data.api.online_search:Using SerpAPI: True, API key present: True
INFO:data.api.online_search:[1/5] Searching term: 'recommendation system AND personalized suggestions OR user watch history embeddings...'
INFO:data.api.online_search:Searching SerpAPI with query: recommendation system AND personalized suggestions OR user watch history embeddings...
INFO:data.api.online_search:API key present: True (first 8 chars: 729754ca...)
INFO:data.api.online_search:Making SerpAPI request with params: engine=google_patents, q=recommendation system AND personalized suggestions..., num=10 (requested 10)


Calling online_searcher.search_multiple_terms with 5 terms
Online searcher type: <class 'data.api.online_search.GooglePatentsSearch'>
Online searcher use_serpapi: True


INFO:data.api.online_search:SerpAPI response keys: ['search_metadata', 'search_parameters', 'search_information', 'organic_results', 'summary', 'pagination', 'serpapi_pagination']
INFO:data.api.online_search:Found 10 organic results
INFO:data.api.online_search:SerpAPI found 10 patents for: recommendation system AND personalized suggestions...
INFO:data.api.online_search:  'recommendation system AND personalized suggestions...': 10 patents found
INFO:data.api.online_search:    First result: US11122333B2 - User feature generation method and apparatus, devi...
INFO:data.api.online_search:[2/5] Searching term: '(real-time updates) AND (embedding learning) OR short/long term viewing patterns...'
INFO:data.api.online_search:Searching SerpAPI with query: (real-time updates) AND (embedding learning) OR short/long term viewing patterns...
INFO:data.api.online_search:API key present: True (first 8 chars: 729754ca...)
INFO:data.api.online_search:Making SerpAPI request with params: engine=google_p

search_multiple_terms returned 5 terms with results
Term 'recommendation system AND personalized suggestions...': 10 results (type: <class 'list'>)
Term '(real-time updates) AND (embedding learning) OR sh...': 10 results (type: <class 'list'>)
Term 'video streaming platform recommendation algorithms...': 10 results (type: <class 'list'>)
Term 'sensor error correction techniques in real time sy...': 10 results (type: <class 'list'>)
Term '(personalized recommendations) AND (embedding gene...': 10 results (type: <class 'list'>)
Online search found 49 unique patents across 5 terms
Ranking-based assessment: scored 20 patents, mean similarity=0.672, novelty=0.328
Generating explanation via Ollama (with KV caching)...
[OK] Generated 819 tokens in 34.0s (24.1 tok/s)
Patent Novelty Assessment Results

Novelty Score: 0.328
   Potential Prior Art Found

Found 59 Similar Patents:

   1. Patent ID: 12288236
      Model Similarity: 0.000
      Model Novelty: 1.000
      Title: This invention relate

## Step 5: Pipeline Breakdown

Examine each step of the pipeline in detail.


In [None]:
# Format input text
input_text = f"""Title: {example_patent["title"]}

Abstract: {example_patent["abstract"]}

Claims:
{chr(10).join(example_patent["claims"])}
"""

# Run the complete pipeline
result = analyzer.analyze(input_text)

print("Patent Novelty Assessment Results")

novelty_score = result.novelty_score if result.novelty_score is not None else 0.5
rank_percentile = result.search_metadata.get('rank_percentile', None) if result.search_metadata else None
top_k_scored = result.search_metadata.get('top_k_scored', 0) if result.search_metadata else 0

print(f"\nNovelty Score: {novelty_score:.3f}")
if rank_percentile is not None:
    print(f"Rank Percentile: {rank_percentile:.1f}%")
    print(f"Interpretation: Ranks in top {rank_percentile:.1f}% most novel among {top_k_scored} analyzed patents")
else:
    print(f"Interpretation: Continuous score (0.0 = Not Novel, 1.0 = Highly Novel)")

print(f"\nScale: 0.0 = Not Novel (high similarity) | 1.0 = Highly Novel (low similarity)")
print(f"Current Score: {novelty_score:.3f} indicates {'high' if novelty_score > 0.7 else 'moderate' if novelty_score > 0.5 else 'low'} novelty")

if result.similar_patents:
    print(f"\nTop {min(10, len(result.similar_patents))} Ranked Similar Patents:")
    
    sorted_patents = sorted(
        result.similar_patents[:10],
        key=lambda p: p.get('model_novelty', 1 - p.get('similarity', 0)),
        reverse=True
    )
    
    for idx, patent in enumerate(sorted_patents, 1):
        rank = patent.get('rank', idx)
        model_sim = patent.get('model_similarity', patent.get('similarity', 0))
        model_novelty = 1 - model_sim if model_sim else (1 - patent.get('similarity', 0))
        
        print(f"\n#{rank} - Patent ID: {patent.get('patent_id', 'N/A')}")
        print(f"   Model Similarity: {model_sim:.3f} | Model Novelty: {model_novelty:.3f}")
        if 'title' in patent:
            print(f"   Title: {patent['title'][:70]}...")
        if 'year' in patent:
            print(f"   Year: {patent['year']}")
        if 'source' in patent:
            print(f"   Source: {patent['source']}")

if result.explanation:
    print(f"\n\nDetailed Explanation:")
    explanation_lines = result.explanation.split('\n')
    for line in explanation_lines[:8]:
        print(f"   {line}")
    if len(explanation_lines) > 8:
        print(f"   ... ({len(explanation_lines) - 8} more lines)")

INFO:data.api.online_search:Generated 5 search terms via LLM


Generated 5 search terms: ['recommendation system AND video streaming platform OR personalized suggestions', '(embeddings learning) AND (short-term patterns) AND (long-term viewing habits)', 'real-time updates AND sensor error correction in recommendation systems']...


INFO:data.api.online_search:Searching 5 terms with max_per_term=10
INFO:data.api.online_search:Using SerpAPI: True, API key present: True
INFO:data.api.online_search:[1/5] Searching term: 'recommendation system AND video streaming platform OR personalized suggestions...'
INFO:data.api.online_search:Searching SerpAPI with query: recommendation system AND video streaming platform OR personalized suggestions...
INFO:data.api.online_search:API key present: True (first 8 chars: 6e0db2ea...)
INFO:data.api.online_search:Making SerpAPI request with params: engine=google_patents, q=recommendation system AND video streaming platform..., num=10 (requested 10)
INFO:data.api.online_search:SerpAPI response keys: ['search_metadata', 'search_parameters', 'search_information']
INFO:data.api.online_search:Found 0 organic results
INFO:data.api.online_search:SerpAPI found 0 patents for: recommendation system AND video streaming platform...
INFO:data.api.online_search:  'recommendation system AND video str

Calling online_searcher.search_multiple_terms with 5 terms
Online searcher type: <class 'data.api.online_search.GooglePatentsSearch'>
Online searcher use_serpapi: True


KeyboardInterrupt: 

## Summary

1. **Input Processing**: Parse patent title, abstract, and claims
2. **Retrieval**: Find similar patents using FAISS (local) and optionally online search
3. **Feature Extraction**: Compute 10 features for each patent pair (reduced from 13 via ablation study)
4. **Classification**: Score similarity using PyTorch Neural Network (91.73% accuracy, architecture: [256, 128])
5. **Ranking-Based Assessment**: Score top-K candidates (top 20) and derive novelty from rank distribution
6. **Explanation**: Generate detailed explanation using Phi-3 LLM

**Key Components:**
- PatentSBERTa embeddings for semantic similarity
- PyTorch Neural Network for classification
- Phi-3 (Ollama) for natural language explanations
- Feature engineering (10 features: cosine similarity, embeddings, text overlap, metadata)

**Output:**
- Novelty score (0-1, continuous scale, higher = more novel)
- Rank percentile (how novel compared to top-K analyzed patents)
- Ranked list of similar prior art patents with model similarity/novelty scores
- Detailed explanation of the assessment
