<a href="https://www.kaggle.com/code/awsaf49/uspto-kerasnlp-starter?scriptVersionId=182496391" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<center><img src="https://keras.io/img/logo-small.png" alt="Keras logo" width="100"><br/>
This starter notebook is provided by the Keras team.</center>

# USPTO - Explainable AI for Patent Professionals with [KerasNLP](https://github.com/keras-team/keras-nlp) and [Keras](https://github.com/keras-team/keras)

<div align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/7/75/Seal_of_the_United_States_Patent_and_Trademark_Office.svg" width="150">
</div>

In this competition, our aim is to enable patent professionals to interpret the results of AI-powered searches in familiar language and syntax through Boolean search queries. These queries, which effectively characterize groups of patents, are well known to patent professionals. Specifically, we need to create a model that will generate binary search queries from a patent and its neighbor $50$ patents, which will return the same set of $50$ neighbor patents when searched in the USPTO database. The problem in this competition can be visually described as below:

<div align="center"><img src="https://i.postimg.cc/q7jYCHLY/USPTO-problem.png" width="550"></div>

In this notebook, we will demonstrate how to leverage Large Language Models (LLMs) to approach this problem with an iterative comparison method to extract effective queries. Specifically, this notebook uses the **Gemma 1.1 2B** instruction tuned model to perform query extraction from the patents.

**Did you know**: This notebook is backend-agnostic, which means it supports TensorFlow, PyTorch, and JAX backends. However, the best performance can be achieved with `JAX`. KerasNLP and Keras enable the choice of the preferred backend. Explore further details on [Keras](https://keras.io/keras_3/).

**Note**: For a deeper understanding of KerasNLP, refer to the [KerasNLP guides](https://keras.io/keras_nlp/).


# 🛠️ | Install Libraries

We need to install `whoosh` and `whoosh_utils` libraries which will be used for processing our model output. As we don't have access to internet during inference, we will be installing this library from our local files.


In [1]:
# Install whoosh and whoosh_utils library
!pip install /kaggle/input/uspto-whoosh-reloaded-2-7-5-patched/Whoosh_Reloaded-2.7.5-py2.py3-none-any.whl
!sed 's:/kaggle/input/whoosh-wheel-2-7-4/Whoosh-2.7.4-py2.py3-none-any.whl:whoosh-reloaded==2.7.5:g' /kaggle/usr/lib/whoosh_utils/whoosh_utils.py > whoosh_utils.py

Processing /kaggle/input/uspto-whoosh-reloaded-2-7-5-patched/Whoosh_Reloaded-2.7.5-py2.py3-none-any.whl
Installing collected packages: Whoosh-Reloaded
Successfully installed Whoosh-Reloaded-2.7.5


# 📚 | Import Libraries 

In [2]:
import os
os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_nlp
import keras

import numpy as np
import pandas as pd
from tqdm import tqdm
import gc

import re
import whoosh_utils
import whoosh

2024-06-10 04:42:04.671985: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-10 04:42:04.672102: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-10 04:42:04.784269: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered




## Library Version

In [3]:
print("Keras:", keras.__version__)
print("KerasNLP:", keras_nlp.__version__)

Keras: 3.3.3
KerasNLP: 0.12.1


# ⚙️ | Configuration

In [4]:
class CFG:
    seed = 42
    dataset_path = "/kaggle/input/uspto-explainable-ai"
    preset = "gemma_1.1_instruct_2b_en" # name of pretrained Gemma
    input_length = 1024 # max size of input sequence for training
    output_length = 1200 # max size of output sequence
    num_neighbors = 2 # how many neighbour patents to consider

# ♻️ | Reproducibility 

Sets value for random seed to produce similar result in each run.

In [5]:
keras.utils.set_random_seed(CFG.seed)

# 📖 | Metadata

The competition dataset provides different information about patents and their neighboring patents. In this notebook, we will use the `Title` and the `Abstract` of the patents to extract search queries.

## Files

### **patent_metadata.parquet** → Metadata for the most recent patents
- `publication_number` - The patent identifier.
- `publication_date` - The date the patent was published.
- `filing_date` - The date the patent was filed.
- `family_id` - An identifier for the patent family.
- `cpc_codes` - A list of the patent classification codes covering the patent.

### **test.csv** → Nearest Neighbor Patents of the 2,500 test patents
- `publication_number` - Only patents published on or after 1975 were included in this column.
- `target_[N]` - The Nth nearest neighbor of the patent, based on embeddings from the Google Patents Research Data BigQuery dataset.

### **patent_data/[year_month].parquet** → Patent text from Google Patents Public Data
- `publication_number` - The patent identifier.
- `title` - The text of the patent's title.
- `abstract` - The text of the patent's abstract.
- `claims` - The text of the patent's claims.
- `description` - The text of the patent's full description.

### **all_patents.parquet** → Title and Abstract of all the patents compiled from **patent_data/[year_month].parquet** (by @aerdem4)
- `publication_number` - The patent identifier.
- `Title` - Title of the patent.
- `Abstract` - Abstract of the patent.

> **Note:** In this notebook, we will not be using the `description` text of patents to keep the runtime small. Also, we are not be using `patent_data/[year_month].parquet` files as `all_patents.parquet` file contains the necessary information for all the patents. You are welcome to test with all the information of the patents.

In [6]:
# Read the CSV file into a DataFrame with specific columns
test_df = pd.read_csv(f"{CFG.dataset_path}/test.csv")
test_df = test_df.iloc[:, :CFG.num_neighbors+1]
target_cols = list(test_df.columns[1:])

# Merge metadata of the patents
meta_df = pd.read_parquet(f"{CFG.dataset_path}/patent_metadata.parquet")
test_df = test_df.merge(meta_df, on="publication_number", how="left")

# Merge Title and Abstract of the patennts
patent_df = pd.read_parquet("/kaggle/input/uspto-all-patents-after-1975/all_patents.parquet")
test_df = test_df.merge(patent_df, on="publication_number")

# Fill NaN values
test_df["title"] = test_df["title"].fillna("")
test_df["abstract"] = test_df["abstract"].fillna("")

# Merge Title and Abstract of the neighbour patents
for i in range(CFG.num_neighbors):
    test_df = test_df.merge(
        patent_df,
        left_on=target_cols[i],
        right_on="publication_number",
        how="left",
        suffixes=("", f"_{i}"),
    )

    # Fill NaN values
    test_df[f"title_{i}"] = test_df[f"title_{i}"].fillna("")
    test_df[f"abstract_{i}"] = test_df[f"abstract_{i}"].fillna("")

    # Drop extra publication_number column from merges
    test_df = test_df.drop(columns=[f"publication_number_{i}"])

# Reset index order as it will be used later for iteration
test_df = test_df.reset_index(drop=True)

# Clean up memory
del meta_df, patent_df
gc.collect()

30

In [7]:
test_df.head()

Unnamed: 0,publication_number,target_0,target_1,publication_date,filing_date,family_id,cpc_codes,title,abstract,title_0,abstract_0,title_1,abstract_1
0,US-2017082634-A1,US-2020225242-A1,US-2010137151-A1,2017-03-23,2016-07-21,58277081.0,"[C12Q1/485, G01N2570/00, G01N33/6845, G01N33/6...",Multiplexed Proteomics and Phosphoproteomics,The disclosure features methods of identifying...,Multiplexed proteomics and predictive drug can...,The disclosure features methods of identifying...,Protein Expression Profile Database,This invention describes the use of peptide pr...
1,US-2017180470-A1,US-2017171349-A1,US-2017171301-A1,2017-06-22,2016-08-25,59067267.0,"[H04L67/1063, H04L67/1074, H04L67/1097, H04L67...",Method and electronic device for sending CDN a...,Disclosed are a method and an electronic devic...,"Method, Device and System for Transmitting Data","Disclosed are a method, an electronic device a...","Method, device and system for load balancing c...","Disclosed are a method, a device and a system ..."
2,US-2018029544-A1,US-9597956-B2,US-2015197150-A1,2018-02-01,2016-07-26,60950985.0,"[B60J7/043, B60J7/053, B60L8/003, B60R11/00, B...",Roof support structure for solar panel module,A support structure for a vehicle roof panel i...,"Vehicle roof structure and vehicle, and method...",A vehicle roof structure includes a solar cell...,Vehicle solar cell panel,A vehicle solar cell panel that is mounted to ...
3,US-2022408153-A1,US-7698238-B2,US-10593167-B2,2022-12-22,2020-11-27,76221613.0,"[G06F2203/011, G06F3/011, H04N21/23412, H04N21...","Information processing device, information pro...",Provided is an information processing device c...,Emotion controlled system for processing multi...,The present invention relates to an emotion co...,Crowd-based haptics,A system produces haptic effects. The system r...
4,US-3881203-A,US-2777179-A,US-3898717-A,1975-05-06,1974-01-24,23731585.0,[B42F9/008],Tool for inserting paper into spines,"A simple, small tool is provided for the inser...",,,Releasable paper clip,A paper clip is made of sheet metal folded ove...


# 📋 | Solution Overview

Before we dive deep into the problem, let's have a look at the overview of our solution.

![Solution Overview](https://i.postimg.cc/vHpTVpB0/USPTO-overview.png)

Our approach can be summarized in the following steps:

1. **Generate Prompts**: For each patent, we will generate prompts from all its (patent, neighbor patent) pairs. This involves taking a given patent and pairing it with each of its neighbor patents to form pairs like `[(patent_0, neighbor_0), (patent_0, neighbor_1), ... (patent_0, neighbor_N)]`.

2. **Extract Keywords**: Using a Large Language Model (LLM), we will iteratively extract keywords from each patent pair. This process involves iteratively feeding the LLM with the generated prompts and obtaining relevant keywords from the responses.

3. **Create Queries**: After extracting keywords for each patent, we will create search queries using binary operators. These quries are the targets of this competition which we will submit in `submission.csv`

# 🔧 | Prompt Engineering

We will be using the simple prompt template below, followed by the chat template of Gemma, to extract queries/keywords from each patent and its one neighbor. We will iteratively repeat this process (patent, neighbor) pairs to extract more effective queries. You are welcome to explore more advanced prompt templates for better results.

**Prompt Template:**
```
Task:
Analyze and compare the given two patents and identify the common and similar query keywords that should yield these two patents when searched in the United States Patent and Trademark Office (USPTO) database.

Instructions:
1. Carefully read and understand the provided 'Patent 1' and 'Patent 2' below. Pay attention to their titles and abstracts.
2. Identify the key terms, concepts, and components that are common or similar in both patents, that are likely to be used when searching for these patents.
3. After the 'Keywords' section below, write only the keywords, each separated with a semicolone (';') and a space (' '). For example, 'keyword1; keyword2; keyword3_1 keyword3_2'.
4. You are not allowed to add any narratives or text before or after your response.

Patent 1:
* Title: ...
* Abstract: ...

Patent 2:
* Title: ...
* Abstract: ...

Keywords:

```

**Chat Template:**

```
<start_of_turn>user
Prompt.....<end_of_turn>
<start_of_turn>model
```

In [8]:
prompt_template = "Task:\nAnalyze and compare the given two patent abstracts and titles, and identify the common or similar query keywords that should yield these two patents when searched in the United States Patent and Trademark Office (USPTO) database.\n\nInstructions:\n1. Carefully read and understand the provided 'Patent 1' and 'Patent 2' titles and abstracts below.\n2. Identify the key terms, concepts, and components that are either common or similar in both patent titles and abstracts.\n3. In the 'Keywords' section below, write the common or similar keywords, separating each keyword with a semicolon (';') and a space (' '). Here is an example response, 'keyword1; keyword2; keyword3_1 keyword3_2'.\n4. Do not add any additional narratives or text before or after the keywords.\n\nPatent 1:\n* Title: {title_a}\n* Abstract: {abstract_a}\n\nPatent 2:\n* Title: {title_b}\n* Abstract: {abstract_b}\n\nKeywords:"

chat_template = f"<start_of_turn>user\n{prompt_template}<end_of_turn>\n<start_of_turn>model\n"

In [9]:
print(prompt_template)

Task:
Analyze and compare the given two patent abstracts and titles, and identify the common or similar query keywords that should yield these two patents when searched in the United States Patent and Trademark Office (USPTO) database.

Instructions:
1. Carefully read and understand the provided 'Patent 1' and 'Patent 2' titles and abstracts below.
2. Identify the key terms, concepts, and components that are either common or similar in both patent titles and abstracts.
3. In the 'Keywords' section below, write the common or similar keywords, separating each keyword with a semicolon (';') and a space (' '). Here is an example response, 'keyword1; keyword2; keyword3_1 keyword3_2'.
4. Do not add any additional narratives or text before or after the keywords.

Patent 1:
* Title: {title_a}
* Abstract: {abstract_a}

Patent 2:
* Title: {title_b}
* Abstract: {abstract_b}

Keywords:


In [10]:
def create_prompt(row, neighbor_idx):
    prompt = chat_template.format(title_a=row["title"], abstract_a=row["abstract"],
                                  title_b=row[f"title_{neighbor_idx}"], abstract_b=row[f"abstract_{neighbor_idx}"])
    return prompt

## Check Sample Prompt

In [11]:
prompt_sample = create_prompt(test_df.iloc[2], 0)
print(prompt_sample)

<start_of_turn>user
Task:
Analyze and compare the given two patent abstracts and titles, and identify the common or similar query keywords that should yield these two patents when searched in the United States Patent and Trademark Office (USPTO) database.

Instructions:
1. Carefully read and understand the provided 'Patent 1' and 'Patent 2' titles and abstracts below.
2. Identify the key terms, concepts, and components that are either common or similar in both patent titles and abstracts.
3. In the 'Keywords' section below, write the common or similar keywords, separating each keyword with a semicolon (';') and a space (' '). Here is an example response, 'keyword1; keyword2; keyword3_1 keyword3_2'.
4. Do not add any additional narratives or text before or after the keywords.

Patent 1:
* Title: Roof support structure for solar panel module
* Abstract: A support structure for a vehicle roof panel includes a solar panel module. The solar panel module is disposed within an opening defined

# 🤖 | Modeling

To extract the query keywords from patents either we can use simple text processing/filtering but it is very difficult to generalize the filter for such keywords. On the other side, we can use methods like Named Entitry Recogntition (NER) to identfy the key terms in the patent but int this competition we don't have any ground truth query keywords in the first place thus it not possible to train a NER model for this task. Finally, we can use large language model (LLM) which can follow instruction based on our prompt for effectively extracting query keywords from patents.

In this notebook, we will be using **Gemma 1.1 2b instructed tuned** model from **KerasNLP**. KerasNLP has many more recent and powerful pretrained LLMs. To name a few, `Phi3`, `Llama3`, `Mistral`, etc. You are welcome to try them out. You can check the available pretrained LLMs in the [KerasNLP webpage](https://keras.io/api/keras_nlp/models/).`

> We are using the "Instruction tuned" model instead of the "Pretrained" one because it is easier for the model to fine-tune on the prepared dataset. In the 

In [12]:
# Declare the model
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_1.1_instruct_2b_en")

# Set input length of small to keep the memory and latency cost small
gemma_lm.preprocessor.sequence_length = CFG.input_length

Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'task.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Kaggle notebook...
Attachin

# 🔑 | Keyword Generation

The following code will extract query keywords from each patent and one of its neighbors. First, it will use Gemma LLM to analyze the patents and try to identify the common query keywords. Then, we will apply text processing to extract the query keywords from the model's output. We will repeat this process $N$ times as defined in `num_neighbors` of the config file. Finally, we will merge the keywords for all the (patent, neighbor) pair. In this notebook, we will use $2$ neighbors. You are welcome to use more neighbor patents.

> **Note:** In the code for extracting keywords from the model output, we have included several text processing steps which may seem unnecessary. However, these processes are added to handle various possible pitfalls that could result in `submission scoring errors`. Because, sometimes, LLMs do not follow instructions properly, creating unformatted outputs. Thus, the following code will ensure that we get our query keywords properly from the model output even when the model fails to follow instructions properly.


In [13]:
def generate_keyword(row, neighbor_idx):
    # Check if any title or abstract is an empty string
    fields = [
        "title",
        "abstract",
        f"title_{neighbor_idx}",
        f"abstract_{neighbor_idx}",
    ]
    if all(row[field] == "" for field in fields):
        return [""]

    # Create prompt
    prompt = create_prompt(row, neighbor_idx)

    try:
        # Generate output from model
        output = gemma_lm.generate(prompt, max_length=CFG.output_length)

        # Extract keyword from model output
        keyword = decode_output(output, prompt)
    except:
        keyword = [""]
        
    return keyword


def decode_output(output, prompt):
    # Remove input prompt from model output
    answer = output.replace(prompt, "").strip()

    # Avoid edge case when output_max_length < model_output
    if "Title:" in answer and "Abstract:" in answer:
        return [""]

    # Filter out possible unwanted output text
    for x in ["Keywords:", "solution:", "Solution", "**", "\n\n", "[", "]"]:
        answer = answer.replace(x, "").strip()

    # Create list of keywords using possible delimiters
    for sep in [";\n-", ",\n-", "\n-", ";\n", ";\n*", ",\n*", ",\n", "\n*", ",", ";"]:
        if sep in answer:
            answer = answer.strip(sep).strip().split(sep + " ")

    # Final filtering: remove '*', '.', and any keywords length > 40
    keywords = [x.replace("*", "").replace(".", "") for x in set(answer) if len(x) < 40]
    
    # If there is no keywords found, then enter a empty string as keyword
    if not len(keywords):
        keywords = [""]
        
    return keywords


Let's generate the query keywords from the patents.

In [14]:
# Keywords for all patents
keywords_all = []

for i, row in tqdm(test_df.iterrows(), total=test_df.shape[0]):
    # Keywords for one patent
    keywords = []
    
    # Iteratively create keywords for each (patent, neighbour) pair
    for i in range(CFG.num_neighbors):
        keywords += generate_keyword(row, neighbor_idx=i)
        
    # Remove duplicate keywords
    keywords = list(set(keywords))
    
    # Merge keywords
    keywords_all.append(keywords)

100%|██████████| 6/6 [00:25<00:00,  4.21s/it]


Let's check some of the keywords that we extracted from patents.

In [15]:
_ = [print(f"Keywords {i}: {q}", end="\n\n") for i, q in  enumerate(keywords_all[:3])]

Keywords 0: ['Protein deregulation', 'Protein-protein interaction network', 'Biological samples', 'Comparative analysis', 'Drug candidate assessment', 'Disease treatment', 'Peptide profiling', 'Mass spectrometry']

Keywords 1: ['P2P connections', 'Load balancing configuration', 'Common or Similar Query keyword1', 'cache server', 'Common or Similar Query  CDN address', 'IP address', 'server information', 'data transmission path', 'Server information', 'CDN address']

Keywords 2: ['solar array', 'transparency', 'Common or Similar Query keyword1', 'support structure', 'resin', 'solar panel module', 'vehicle roof', 'V-shaped ribs']



# 🚫 | Whoosh Filter

Now that we have the keywords, we will soon create search queries from them. During submission, these queries will be used in the patent search database emulator, built with [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html) for searching patents. The returned patents from the search will be used for scoring. In simple terms, we can consider Whoosh as a search engine like Google, where given an input query, it returns the patents that match those queries. However, unlike Google, we can't input just anything into Whoosh; we need to maintain a specific format. To ensure our input works properly in Whoosh, we need to filter the query. The following code will remove common words (stopwords) and numbers (e.g., `123`, `1,234`, or `1.234`) from the keywords and make them searchable in Whoosh.

You can learn more about the Whoosh filter [here](https://www.kaggle.com/competitions/uspto-explainable-ai/discussion/499582). If you want to learn more about how search in Whoosh works, check out this [demo](https://www.kaggle.com/code/sohier/basic-whoosh-search-demo).


In [16]:
BRS_STOPWORDS = ['an', 'are', 'by', 'for', 'if', 'into', 'is', 'no', 'not', 'of', 'on', 'such',
        'that', 'the', 'their', 'then', 'there', 'these', 'they', 'this', 'to', 'was', 'will', 'and', 'or']
NUMBER_REGEX = re.compile(r'^(\d+|\d{1,3}(,\d{3})*)(\.\d+)?$')

class NumberFilter(whoosh.analysis.Filter):
    def __call__(self, tokens):
        for t in tokens:
            if not NUMBER_REGEX.match(t.text):
                yield t

custom_analyzer = whoosh.analysis.StandardAnalyzer(stoplist=BRS_STOPWORDS) | NumberFilter()

Let's see this filter in action. It is apparent that this filter removes numbers and stopwords.


In [17]:
it = custom_analyzer("device, 1.023, machine, that, learning, there")
[token.text for token in it]

['device', 'machine', 'learning']

In the final step of the filtering, we will do a check using `QueryValidator()` from `whoosh_utils` library to ensure the query is fit to be used in search.

> **Note:** Whoosh library itself is a bit complicated thus this competition offers `whoosh_utils` which allows us to search, validate queries more easily.

In [18]:
query_validator = whoosh_utils.QueryValidator()

def validate_query(query):
    query = "ti:device" if not len(query) or not isinstance(query, str) else query
    try:
        query_validator.validate_query(query)
    except:
        query = "ti:device"
    return query

In [19]:
validate_query("device OR machine") # query is valid

'device OR machine'

In [20]:
validate_query("(device OR machine") # query is invalid due to missing ')' thus returns default query

'ti:device'

# 🔍 | Query Creation

## What are the Queries?

The search queries here are the target of the competition. Specifically, they are binary queries, composed using binary logic operators to evaluate the truthfulness of each part. These queries allow for complex search criteria, refining search results effectively.

We combine keywords using different Boolean or Proximity operators:
- Boolean: `OR`, `AND`, `NOT`, `XOR`
- Proximity: `ADJ`, `NEAR` (Note: Proximity operators aren't supported for the CPC field.)

Wildcards can also be used to include variations of the keywords:
- Wildcards: `*`, `?`, `$` (Note: Wildcards are incompatible with proximity operators and may impact performance.)

The following fields of the patents are searchable using the query:
- `ti`: title
- `ab`: abstract
- `clm`: claims
- `detd`: description
- `cpc`: CPC codes

## How are queries created?

In the following code, we first create a query from CPC codes (up to 15) using binary operators. Then, we create a query from keywords extracted from patents using LLM. We merge queries from both keywords and CPC codes using boolean operators, specifying which query belongs to which field (`ti`, `detd`, or `cpc`). Finally, we perform a query token count check as the competition metric allows only 50 tokens. We also validate the query to ensure everything is in order.

Example queries may look like:
- `"detd:(device OR machine)"`
- `"cpc:(C12Q1/485 OR G01N2570/00)"`
- `"detd:(device OR machine) AND cpc:(C12Q1/485 OR G01N2570/00)"`
- `"ti:device"`

> **Note**: Although query keywords are extracted from titles and abstracts of patents, we search in the description of patents as titles and abstracts might not always contain the keyword. Moreover, keywords appearing in titles and abstracts are very likely to appear in the description of the patent.

In [21]:
queries = []

for i, row in tqdm(test_df.iterrows(), total=len(test_df)):
    # Create query from cpc_codes
    cpc = row["cpc_codes"]
    query_cpc = f"cpc:({' OR '.join(cpc[:15])})" if len(cpc) else ""
    
    try:
        # Analyze the keywords
        keywords_str = ", ".join(keywords_all[i])
        tokens = list(set([token.text for token in custom_analyzer(keywords_str)]))
        
        # Reduce the keywords if number of query tokens > 50
        while len(tokens):
            # Create query from keywords
            query_keywords = f"({' OR '.join(tokens)})"
            
            # Merge quries from keywords and cpc_codes
            query_check = f"detd:{query_keywords}" + (f" AND {query_cpc}" if len(query_cpc) else "")
            
            # Return query if number of query tokens is okay
            if whoosh_utils.count_query_tokens(query_check) < 50:
                query = query_check
                break
                
            # Reduce keywords if number query is not okay
            tokens.pop()
    except:
        query = query_cpc
    
    # Final query validation
    query = validate_query(query)
    
    queries.append(query)

100%|██████████| 6/6 [00:00<00:00, 1517.20it/s]


Let's check some of the quries we created.

In [22]:
_ = [print(f"Query {i}: {q}", end="\n\n") for i, q in  enumerate(queries[:3])]

Query 0: detd:(candidate OR protein OR deregulation OR profiling OR spectrometry OR network OR analysis OR drug OR treatment OR biological OR interaction OR mass OR assessment OR peptide OR disease OR samples) AND cpc:(C12Q1/485 OR G01N2570/00 OR G01N33/6845 OR G01N33/6848 OR G06F19/20 OR G16B25/00 OR G16B25/10 OR G16B5/00)

Query 1: detd:(connections OR keyword1 OR data OR balancing OR transmission OR cdn OR query OR ip OR path OR p2p OR common OR server OR information OR configuration OR similar OR cache OR address OR load) AND cpc:(H04L67/1063 OR H04L67/1074 OR H04L67/1097 OR H04L67/18)

Query 2: detd:(keyword1 OR transparency OR roof OR module OR solar OR query OR resin OR support OR shaped) AND cpc:(B60J7/043 OR B60J7/053 OR B60L8/003 OR B60R11/00 OR B60R13/06 OR B60R13/0869 OR B60R16/0307 OR B60R16/033 OR B60R2011/004 OR B60R2011/0045 OR B60R2011/0084 OR H01L31/02167 OR H01L31/024 OR H01L31/0481 OR H01L31/049)



# 📬 | Submission

Following code will prepare the submission file.

In [23]:
test_df["query"] = queries
pred_df = test_df[["publication_number", "query"]]
pred_df.to_csv("submission.csv", index=False)
pred_df.head()

Unnamed: 0,publication_number,query
0,US-2017082634-A1,detd:(candidate OR protein OR deregulation OR ...
1,US-2017180470-A1,detd:(connections OR keyword1 OR data OR balan...
2,US-2018029544-A1,detd:(keyword1 OR transparency OR roof OR modu...
3,US-2022408153-A1,detd:(data OR recognition OR event OR multimed...
4,US-3881203-A,detd:(insertion OR capacity OR slit OR spines ...


But as this competition is quite different from other NLP competitions due to its weakly supervised labels, let's try to understand how the submission process works to have a better understanding of the competition.

Once submitted, the `whoosh_utils` library will search patents using queries from our `submission.csv`. Once it has retrieved the patents, they will be evaluated using mean average precision at 50 (`mAP@50`) between the retrieved patents and the ground truth patents. So, unlike other Kaggle competitions where the submission file is directly used for scoring, here the Whoosh library will be used before scoring.


# 🔭 | Future Directions

Looking forward, we can further improve this notebook by:

1. Trying different models like `Phi3`, `Llama3`, or `Mistral`.
2. Increasing `num_neighbors` in the config file to extract more keywords.
3. Using fewer `cpc` codes in the query (currently up to 15) to include more query keywords.
4. Experimenting with different operator combinations. Currently, we are using simple `OR` and `AND`.
5. Utilizing more advanced prompt engineering.


# 📌 | Reference

* [USPTO LLM Solution [Prompt Eng]](https://www.kaggle.com/code/aerdem4/uspto-llm-solution-prompt-eng)
* [USPTO: whoosh==2.7.5 Offline Use](https://www.kaggle.com/code/seshurajup/uspto-whoosh-2-7-5-offline-use)