# App: AI Patent Advisor
### Dataset: BigPatent (https://huggingface.co/datasets/NortheasternUniversity/big_patent)

### Features:
**Simplified Prior Art Search:** Instead of complex keyword searches, users can describe their invention in natural language. A RAG system retrieves relevant patents from BigPatent based on semantic similarity and provides them as context to an LLM. A prompt like, "Given this invention description [user input] and these similar patents [retrieved patents], summarize the most relevant prior art and potential novelty issues" can generate a concise report. This simplifies the search process and makes it accessible to non-patent experts.

**Competitive Technology Monitoring:** Users define a specific technology area or competitor. The system automatically retrieves newly published patents from BigPatent within that domain. A prompt such as, "Summarize the key innovations disclosed in these recently published patents [retrieved patents] related to [technology area/competitor]" provides a quick overview of competitive activity.

**Patent Claim Analysis & Comparison:** Input two or more patent claims. The system retrieves relevant contextual information from the full patent text within BigPatent. A prompt like, "Compare and contrast the scope of these patent claims [input claims] considering their full patent specifications [retrieved patent text]. Identify key differences and potential areas of overlap" facilitates detailed claim analysis.

**Automated Patent Summary Generation:** Input a full patent text. The system uses a prompt like, "Generate a concise, non-legal summary of this patent [input patent] highlighting the key innovation and potential applications." This quickly generates summaries suitable for business audiences or technical teams, saving time and resources.

**Patent Landscape Overview by CPC Class:** Users specify a CPC classification code. The system retrieves a sample of patents from BigPatent within that class. A prompt like, "Based on these patents [retrieved patents] within CPC class [input code], summarize the current state of the art, key players, and emerging trends" provides a quick overview of a specific technology domain. The structured nature of CPC codes simplifies retrieval and analysis.

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "False"

In [None]:
!pip -q install gradio datasets transformers langchain faiss-cpu sentence_transformers langchain-community openai

In [None]:
!pip -q install bert_score
!pip -q install langchain_openai

In [None]:
!pip -q install tqdm

In [None]:
!pip -q install -U langchain

In [None]:
!pip -q install pandas

In [None]:
import gradio as gr
from datasets import load_dataset
from transformers import pipeline
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings
#from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.llms import HuggingFacePipeline
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm import tqdm
import pandas as pd

In [None]:
from google.colab import userdata
# Hugging face API Token
HUGGINGFACEHUB_API_TOKEN = userdata.get('HF_KEY')
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN
os.environ["HF_TOKEN"] = HUGGINGFACEHUB_API_TOKEN
# OpenAI API key
OPENAI_API_KEY = userdata.get('OPENAI_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
# Load a smaller subset of the 'g' CPC code patents for initial development (100 samples)
dataset = load_dataset("big_patent", "g", trust_remote_code=True, split="train[:100]")
texts = dataset["description"]

README.md:   0%|          | 0.00/9.71k [00:00<?, ?B/s]

big_patent.py:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

In [None]:
! rm -rf .cache/huggingface/hub/datasets--big_patent

In [None]:
#embeddings = SentenceTransformerEmbeddings(model_name='allenai-specter')  # Make sure to install sentence-transformers
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [None]:
def create_vectordb(texts, chunk_size=500, chunk_overlap=50):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    docs = []
    for text in tqdm(texts, desc="Creating chunks"):
        chunks = text_splitter.split_text(text)
        docs.extend(chunks)

    # Initialize embeddings (assuming 'embeddings' is already defined as HuggingFaceEmbeddings)
    db = FAISS.from_texts(docs, embeddings)
    return db

# Example usage (assuming 'texts' is already defined):
db = create_vectordb(texts)
print(f"FAISS VectorDB created with {len(db.index_to_docstore_id)} vectors.")

Creating chunks: 100%|██████████| 100/100 [00:00<00:00, 205.11it/s]


FAISS VectorDB created with 8843 vectors.


In [None]:
# prompt: write python code to save the vectordb. Then load it for use

# import pickle

# # Save the vector database to a file
# with open("vector_db.pkl", "wb") as f:
#     pickle.dump(db, f)

# # Load the vector database from the file
# with open("vector_db.pkl", "rb") as f:
#     loaded_db = pickle.load(f)

#print(f"Loaded FAISS index with {len(loaded_db.index_to_docstore_id)} vectors.")

In [None]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

In [None]:
def get_qa_chain(memory):

  retriever = db.as_retriever(search_type="mmr",
      search_kwargs={"k": 2, "fetch_k":5}
  )
  qa_chain = ConversationalRetrievalChain.from_llm(
        llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0.2, max_tokens=1000),
        retriever=retriever,
        memory=memory,
        return_source_documents=True
    )
  sys_prompt = "Answer the question based on the context provided."
  qa_chain.combine_docs_chain.llm_chain.prompt.messages[0] = SystemMessagePromptTemplate.from_template(sys_prompt)
  #print(f"qa chain: {qa_chain}")
  return qa_chain

In [None]:
# prompt: write python code to quantitatively compare human written summary with LLM generated summary using BERTScore

from bert_score import score

def compare_summaries(human_summary, llm_summary):
  """
  Quantitatively compares a human-written summary with an LLM-generated summary using BERTScore.

  Args:
    human_summary: The human-written summary (string).
    llm_summary: The LLM-generated summary (string).

  Returns:
    A dictionary containing the BERTScore precision, recall, and F1 score.
  """
  P, R, F1 = score([human_summary], [llm_summary], lang="en", verbose=True)
  return {"precision": P.mean().item(), "recall": R.mean().item(), "f1": F1.mean().item()}

# Example usage
human_summary = "This is a great summary written by a human."
llm_summary = "This is a good summary generated by an LLM."

results = compare_summaries(human_summary, llm_summary)
results

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.09 seconds, 10.60 sentences/sec


{'precision': 0.9228496551513672,
 'recall': 0.9183753728866577,
 'f1': 0.9206070303916931}

In [None]:
def generate_summary(text):
    llm = ChatOpenAI(temperature=0.2, model_name="gpt-4o-mini", max_tokens=128) # or gpt-4 if available

    prompt_template = """Summarize: {text}
    """
    prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

    summary = llm(prompt.format(text=text))
    return summary.content

def generate_and_compare_summaries(dataset, num_samples=10):
    """
    Generates summaries for a given number of text descriptions using ChatOpenAI's GPT-4o-mini,
    compares them with the respective abstracts using BERTScore, and returns statistics.
    """

    results = []
    for i in tqdm(range(num_samples), desc="Generating and comparing summaries"):
      description = dataset["description"][i]
      abstract = dataset["abstract"][i]

      try:
        # Generate summary (replace with your GPT-4o-mini call)
        llm_summary = generate_summary(description)

        # Compare summaries
        metrics = compare_summaries(abstract, llm_summary)
        results.append({
          "description_id": i,
          "precision": metrics["precision"],
          "recall": metrics["recall"],
          "f1": metrics["f1"]
      })
      except Exception as e:
        print(f"Error processing sample {i}: {e}")
        results.append({
          "description_id": i,
          "precision": None,  # Indicate error with None
          "recall": None,
          "f1": None
        })
    return pd.DataFrame(results)


# Example usage:
summary_results = generate_and_compare_summaries(dataset)

print(summary_results.describe()) # Prints statistics for precision, recall, and F1 score

Generating and comparing summaries:   0%|          | 0/10 [00:00<?, ?it/s]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  10%|█         | 1/10 [00:02<00:21,  2.37s/it]

done in 0.07 seconds, 13.37 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  20%|██        | 2/10 [00:04<00:17,  2.16s/it]

done in 0.05 seconds, 19.14 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  30%|███       | 3/10 [00:07<00:16,  2.39s/it]

done in 0.05 seconds, 19.39 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  40%|████      | 4/10 [00:09<00:14,  2.34s/it]

done in 0.07 seconds, 13.73 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  50%|█████     | 5/10 [00:11<00:11,  2.22s/it]

done in 0.06 seconds, 17.29 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  60%|██████    | 6/10 [00:14<00:10,  2.56s/it]

done in 0.06 seconds, 17.63 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  70%|███████   | 7/10 [00:16<00:07,  2.43s/it]

done in 0.06 seconds, 17.51 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  80%|████████  | 8/10 [00:19<00:05,  2.68s/it]

done in 0.05 seconds, 18.56 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries:  90%|█████████ | 9/10 [00:22<00:02,  2.51s/it]

done in 0.05 seconds, 18.26 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Generating and comparing summaries: 100%|██████████| 10/10 [00:25<00:00,  2.54s/it]

done in 0.07 seconds, 14.66 sentences/sec
       description_id  precision     recall         f1
count        10.00000  10.000000  10.000000  10.000000
mean          4.50000   0.853120   0.850752   0.851839
std           3.02765   0.024358   0.013394   0.017225
min           0.00000   0.809282   0.829802   0.819934
25%           2.25000   0.850576   0.843224   0.852082
50%           4.50000   0.859265   0.855828   0.858666
75%           6.75000   0.863572   0.858161   0.860647
max           9.00000   0.883153   0.871202   0.869998





In [None]:
def get_retrieved_docs_metadata(result):
    # Get metadata of retrieved documents
    retrieved_docs_metadata = ""
    if 'source_documents' in result:
        for doc in result['source_documents']:
            if hasattr(doc, 'metadata'):
                retrieved_docs_metadata += f"Metadata: {doc.metadata}\n"
    return retrieved_docs_metadata

In [None]:
  prior_art_memory = ConversationBufferWindowMemory(
                    memory_key="chat_history",
                    return_messages=True,
                    k=3,
                    output_key='answer'
        )

def prior_art_search(query):
    qa_chain = get_qa_chain(prior_art_memory)
    result = qa_chain({"question": query})
    retrieved_docs_metadata = get_retrieved_docs_metadata(result)
    print("result: ", result)
    return (result["answer"], retrieved_docs_metadata)

In [None]:
patent_summarization_memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=3,
    output_key='answer'
)

def patent_summarization(patent_text):
    qa_chain = get_qa_chain(patent_summarization_memory)
    result = qa_chain({"question": f"Summarize this patent:\n{patent_text}"})
    retrieved_docs_metadata = get_retrieved_docs_metadata(result)
    return (result["answer"], retrieved_docs_metadata)

In [None]:
competitive_monitoring_memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=3,
    output_key='answer'
)

def competitive_monitoring(technology_area):
    # Placeholder for filtering by technology area (requires further implementation)
    # In a real application, you would filter the dataset or vector database
    # based on the provided technology area before retrieval.
    qa_chain = get_qa_chain(competitive_monitoring_memory)
    result = qa_chain({"question": f"Summarize recent patents in {technology_area}"})
    retrieved_docs_metadata = get_retrieved_docs_metadata(result)
    return (result["answer"], retrieved_docs_metadata)

In [None]:
claim_analysis_memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=3,
    output_key='answer'
)

def claim_analysis(claim1, claim2):
    qa_chain = get_qa_chain(claim_analysis_memory)
    result = qa_chain(
        {
            "question": f"Compare and contrast these two claims:\nClaim 1: {claim1}\nClaim 2: {claim2}"
        }
    )
    retrieved_docs_metadata = get_retrieved_docs_metadata(result)
    return (result["answer"], retrieved_docs_metadata)

In [None]:
landscape_overview_memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=3,
    output_key='answer'
)

def landscape_overview(cpc_code):
    # Placeholder for filtering by CPC code (requires further implementation)
    cpc_code = 'g'
    qa_chain = get_qa_chain(landscape_overview_memory)
    result = qa_chain({"question": f"Overview of patent landscape for CPC code {cpc_code}"})
    retrieved_docs_metadata = get_retrieved_docs_metadata(result)
    return (result["answer"], retrieved_docs_metadata)

In [None]:
def clear_memory():

    prior_art_memory.clear()
    patent_summarization_memory.clear()
    competitive_monitoring_memory.clear()
    claim_analysis_memory.clear()
    landscape_overview_memory.clear()

    return "Memory Cleared !", ""

In [None]:
# Query Examples (replace with your actual examples if needed)
prior_art_suggestions = [
    "Method for measuring the gravitational field of a celestial body",
    "System for generating high-intensity magnetic fields",
    "Device for controlling the flow of a fluid in a microfluidic channel"
]

monitoring_suggestions = [  # G-specific examples
    "Advancements in optical metrology for semiconductor manufacturing",
    "Trends in nuclear fusion reactor design",
    "Developments in acoustic sensing for medical imaging"
]

claim_suggestions = [ # G-specific examples
    ["Claim 1: A method for measuring the speed of light using interferometry.", "Claim 2: A method for determining the refractive index of a material using interferometry."],
    ["Claim 1: A device for generating acoustic waves using piezoelectric materials.","Claim 2: A device for generating acoustic waves using magnetostrictive materials."],
    ["Claim 1: A system for controlling the temperature of a superconducting magnet.", "Claim 2: A system for controlling the magnetic field strength of a superconducting magnet."]
]


summary_suggestions = [ # G-specific examples
    "Patent on a new type of optical sensor for detecting gravitational waves",
    "Patent on an improved design for a nuclear fusion reactor",
    "Patent on a novel method for measuring the viscosity of fluids using microfluidic devices"
]

landscape_suggestions = [ # G-specific examples
    "CPC code: G01N (Measuring instruments; Measuring methods in general)",
    "CPC code: G21B (Nuclear reactors; accessories or details thereof)",
    "CPC code: G02B (Optical elements, systems, or apparatus)"
]

In [None]:
with gr.Blocks() as demo:
    gr.Markdown("# BigPatent Multi-Use-Case App")

    with gr.Tab("Prior Art Search"):
        prior_art_inputs = gr.Textbox(lines=2, placeholder="Enter your invention description...")

        prior_art_examples = gr.Examples(prior_art_suggestions, prior_art_inputs) # Query suggestions
        prior_art_outputs = [gr.Textbox(lines=5, label="Potential Prior Art"),
                     gr.Textbox(label="Document Metadata")
            ]
        prior_art_iface = gr.Interface(
            fn=prior_art_search,
            inputs=prior_art_inputs,
            outputs=prior_art_outputs
        )
        clear_button_prior_art = gr.Button("Clear History")
        clear_button_prior_art.click(clear_memory, [], prior_art_outputs) # Update outputs

    with gr.Tab("Competitive Monitoring"):
        monitoring_inputs = gr.Textbox(placeholder="Enter technology area")
        monitoring_examples = gr.Examples(monitoring_suggestions, monitoring_inputs) # Query suggestions
        monitoring_outputs = [gr.Textbox(lines=5, label="Competitive Landscape"),
                     gr.Textbox(label="Document Metadata")
            ]
        monitoring_iface = gr.Interface(
            fn=competitive_monitoring,
            inputs=monitoring_inputs,
            outputs=monitoring_outputs
        )
        clear_button_monitoring = gr.Button("Clear History")
        clear_button_monitoring.click(clear_memory, [], monitoring_outputs) # Update outputs

    with gr.Tab("Claim Analysis"):
        claim_inputs = [
                gr.Textbox(lines=2, placeholder="Enter claim 1"),
                gr.Textbox(lines=2, placeholder="Enter claim 2"),
            ]
        claim_examples = gr.Examples(claim_suggestions, claim_inputs) # Query suggestions
        claim_outputs = [gr.Textbox(lines=5, label="Claim Comparison"),
                     gr.Textbox(label="Document Metadata")
            ]
        claim_iface = gr.Interface(
            fn=claim_analysis,
            inputs=claim_inputs,
            outputs=claim_outputs
        )
        clear_button_claim = gr.Button("Clear History")
        clear_button_claim.click(clear_memory, [], claim_outputs) # Update outputs

    with gr.Tab("Patent Summarization"):
        summary_inputs = gr.Textbox(lines=5, placeholder="Enter patent text")
        summary_examples = gr.Examples(summary_suggestions, summary_inputs) # Query suggestions
        summary_outputs = [gr.Textbox(lines=3, label="Patent Summary"),
                     gr.Textbox(label="Document Metadata")
            ]
        summary_iface = gr.Interface(
            fn=patent_summarization,
            inputs=summary_inputs,
            outputs=summary_outputs
        )
        clear_button_summary = gr.Button("Clear History")
        clear_button_summary.click(clear_memory, [], summary_outputs) # Update outputs

    with gr.Tab("Landscape Overview"):
        landscape_inputs = gr.Textbox(placeholder="Enter CPC code")
        ladscape_examples = gr.Examples(landscape_suggestions, landscape_inputs) # Query suggestions
        landscape_outputs = [gr.Textbox(lines=5, label="Landscape Overview"),
                     gr.Textbox(label="Document Metadata")
            ]
        landscape_iface = gr.Interface(
            fn=landscape_overview,
            inputs=landscape_inputs,
            outputs=landscape_outputs
        )
        clear_button_landscape = gr.Button("Clear History")
        clear_button_landscape.click(clear_memory, [], summary_outputs) # Update outputs

demo.launch(share=True, debug=True)

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://733151c3dee1074997.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://733151c3dee1074997.gradio.live


