*Licensed under the MIT License. See LICENSE-CODE in the repository root for details.*

*Copyright (c) 2025 Eleni Kamateri*

### Classification Test Set Creation (Ground Truth)

This script facilitates the creation of summarization test sets (i.e., ground truth), using a CSV file as input. The CSV file contains essential data for patents within a specific core vertical (e.g., EP).

#### Overview

The script is divided into four key parts:

- Part I: Selects patents meeting specific criteria for inclusion in the test set.
- Part II: Retrieves textual data from the original WPI collection.
- Part III: Extracts key textual sections, including the brief description, the summary segment of the description and the first claim.
- (Optionally) Part IV: Filters patents based on additional selection criteria.

This methodology ensures a structured, high-quality dataset for training and evaluating summarization models on patent data.

### Selection Criteria for Test Set (Part I)

Patents included in the test sets must meet the following criteria:

1. Complete textual fields: The patent must include an abstract, description, and claims.
2. B kind code selection: The patent must have a B kind code (e.g., B1, B2, B3, B6, B8, B9).
3. Date restriction: The patent must have been submitted in the last quarter of 2015 (i.e., after October 1, 2015). 

### Textual Data Retrieval (Part II)
To create a summarization dataset that is ready for use, we need to retrieve key textual sections from the original WPI collection:

1. Abstract – Serves as the reference summary.
2. Description – Typically used as input for generating summaries.
3. Claims – Typically included in the input text, as they define the scope of the invention.
4. Title – Provides additional context and may help improve summarization quality.

By ensuring these sections are included, the dataset will be structured and comprehensive, supporting high-quality summarization tasks.

<div class="alert alert-block alert-info">
<b>Tip for Part III:</b> To ensure proper extraction of patent text in Part III, the get function should separate the text it retrieves using a line separator ('\n ').
</div>

### Brief Description, Summary, and First Claim Extraction (Part III)
This step extracts key sections from each patent:

- Brief description
- Summary section
- First claim

<div class="alert alert-block alert-info">
<b>Tip:</b> To ensure proper extraction of patent text:
    
- description section must include author-annotated headings.
    
- claims section must follow a standard numbering format (e.g., starting from "1."). If these structural markers are missing, the algorithm may not detect the first claim. 
</div>

### Additional Criteria for Test Set (Part IV) (Optional)
Further refinement of the summarization test set can be applied by:

1. Filtering patents without a distinct summary segment
2. Removing patents where the abstract has low similarity with the description and summary segment

These additional filters help maintain high-quality and coherent summarization data.

### Important Note

Virtual patents were created by merging different kind codes of the same patent, retaining the latest information for each field.

### Configurable Parameters

Researchers can modify the following parameters to customize the test set generation:

**csv_file_path** – Path to the CSV file containing essential data for analyzing the specific vertical.

**vertical_origin_path** – Path to the core vertical of the WPI dataset, containing the extracted files to be parsed for CSV creation. 

        Example: "/YOUR_PATH/WPI-Dataset/EP/". 
        
**destination_path** – Path to the folder where the generated files will be stored. 

**sep** – Defines the separator used in the CSV file:

        0: Semicolon (;)
        
        1: Comma (,)
              
**kind_code_selection** – Selects the type of documents for the test set:
        
        0: B documents (default).
        
        1: A documents (for cases like the WO core vertical, which only has A documents).
        
**date_selection** – Allows modification of the date criteria for selected documents.

        20151000 (default).
                
The code below creates summarization test sets for the #EP core vertical.

### Set the required parameters for the script

In [15]:
csv_file_path='/YOUR_PATH/EP_csv_file_for_wpi_analysis.csv'
vertical_origin_path="/YOUR_PATH/WPI-Dataset/EP/"
destination_path="/YOUR_PATH/WPI-Dataset/"
filename1="1"
filename2="2"
filename3="3"
sep=0
kind_code_selection=0
date_selection=20151000

### Import all required libraries for the script

In [2]:
import pandas as pd
import numpy as np
import os
from bs4 import BeautifulSoup
import time

### Import the CSV file and load its data into a DataFrame

In [3]:
if sep==0:
    DF = pd.read_csv(csv_file_path, header=0, delimiter=";") #, nrows=1000)
elif sep==1:
    DF = pd.read_csv(csv_file_path, header=0) #, nrows=1000)
else:
    print("Please provide a valid value for sep")

print(DF.shape)
DF.head(1)

(552439, 11)


Unnamed: 0.1,Unnamed: 0,xml_file_name,ucid,date,main_classification,further_classification,classification_ipcr,classification_cpc,abstract_lang_en_exist,description_lang_en_exist,claims_lang_en_exist
0,0,EP-2677851-A1.xml,EP-2677851-A1,20140101,,,A01B 79/02 20060101AFI20120911BHEP ...,A01B 79/005 20130101 LI20150420BHEP ...,1.0,1.0,1.0


### Identify the patent number and kind code, and append these fields to the initial DataFrame

In [4]:
DF['patent_number']=DF['xml_file_name'].str.split(".").str[0]
DF['patent_number']=DF['patent_number'].str.split("-").str[1:2]
DF['patent_number']=DF['patent_number'].str.join('')

DF['kind_code']=DF['xml_file_name'].str.split(".").str[0]
DF['kind_code']=DF['kind_code'].str.split("-").str[2:3]
DF['kind_code']=DF['kind_code'].str.join('')

DF['kind_code_letter']=DF['kind_code'].str[0]

# Selection Criteria for Test Set (Part I)

### Filter patents based on the first and second criteria
Retain only the patents that:
1. Have all textual fields completed (i.e., abstract, description, and claims).
2. Belong to a B kind code (i.e., B1, B2, B3, B6, B8, B9).

In [5]:
# Find patent_numbers being "A" or "B" kind code
if kind_code_selection==0:
    DF_1=DF[DF['kind_code_letter']=='B']
elif kind_code_selection==1:
    DF_1=DF[DF['kind_code_letter']=='A']
else:
    print("Please provide a valid value for kind_code_selection")
    
DF_1 =DF_1.loc[:, ['patent_number']]

# Find patent_numbers having all textual fields completed in any of the kind code documents
DF_2=DF.copy()
DF_2['adc_exist']= None
DF_2['abstract_lang_en_exist'] = DF_2['abstract_lang_en_exist'].replace({0:np.nan})
DF_2['description_lang_en_exist'] = DF_2['description_lang_en_exist'].replace({0:np.nan})
DF_2['claims_lang_en_exist'] = DF_2['claims_lang_en_exist'].replace({0:np.nan})
DF_2=DF_2.groupby('patent_number').agg({'abstract_lang_en_exist':'last', 'description_lang_en_exist': 'last', \
                                     'claims_lang_en_exist':'last', 'patent_number':'last'})
DF_2 = DF_2.reset_index(drop=True)
DF_2['adc_exist']=DF_2['abstract_lang_en_exist']+DF_2['description_lang_en_exist']+DF_2['claims_lang_en_exist']
DF_2=DF_2[DF_2['adc_exist']==3]
DF_2 =DF_2.loc[:, ['patent_number']]

# Find patent_numbers satisfying both above criteria
# Specifically, merge the two dataframes and since a patent_number may appear more than one time, delete duplicates.
DF_12 = pd.merge(DF_1, DF_2, on=['patent_number'])
DF_12 = DF_12.drop_duplicates(subset = ["patent_number"])
DF_12=DF_12.sort_values(by = 'patent_number', ascending=True)

# Create a list with the detected patent_numbers and remove from the initial dataframe the patent_numbers not listed in this list
doc_number_list=DF_12['patent_number'].tolist()
DF_subpart = DF[DF['patent_number'].isin(doc_number_list)]

# In case of more than one kind codes for a patent, group them and keep the latest non empty field
DF_subpart_merged=DF_subpart.groupby('patent_number').agg({'xml_file_name':'last', 'ucid':'last', 'date':'last', 'patent_number':'last'})
DF_subpart_merged = DF_subpart_merged.reset_index(drop=True)
print("Number of patent documents:",DF_subpart.shape, "Number of single patents:", DF_subpart_merged.shape)

Number of patent documents: (19313, 14) Number of single patents: (9261, 4)


### Filter patents based on the third criterion
Retain only the patent documents that:
- Were submitted in the last quarter of 2015 (i.e., after October 1, 2015).

In [6]:
DF_subpart_2=DF_subpart[DF_subpart['date']>date_selection]
DF_subpart_2_merged=DF_subpart_merged[DF_subpart_merged['date']>date_selection]
print("Number of patent documents:",DF_subpart_2.shape, "Number of single patents:", DF_subpart_2_merged.shape)

Number of patent documents: (2855, 14) Number of single patents: (2847, 4)


# Textual Data Retrieval (Part II)

Once we have the list of patent numbers that satisfy the selection criteria in Part I, the next step is to retrieve the corresponding full-text data (abstract, description, claims, title) from the original WPI collection. 

In [7]:
# Count the time
start_time = time.time()
df_sum_test = pd.DataFrame()
counter_sum=0

DF_patent_number_list=DF_subpart_2_merged['patent_number'].tolist()

for folder_level_1 in os.listdir(vertical_origin_path): #CC
    for folder_level_2 in os.listdir(vertical_origin_path+"/"+folder_level_1): #nnnnnn
        for folder_level_3 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2): #nn
            for folder_level_4 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3): #nn
                for folder_level_5 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4): #nn                                        
                    for folder_level_6 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5): #nn                                        
                        for files in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6): #nn                                        

                            counter_sum=counter_sum+1                    
                            if counter_sum%100000==0:
                                print(counter_sum)
                            files_proc=files.split(".")[0]
                            
                            try:
                                doc_number_proc=files_proc.split("-")[1]
                            except Exception:
                                doc_number_proc=''
                                print("Exception with doc-number", files)
                                
                            if doc_number_proc in DF_patent_number_list:                                                        
                                
                                title_en_text  = ''     
                                abstract_en_text   = ''                                        
                                description_en_text = ''                                                   
                                claims_en_text = ''

                                content = open(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6+"/"+files,'r',encoding='utf-8').read()
                                soup = BeautifulSoup(content, 'xml')
                                document_info = soup.find_all("patent-document")  

                                try:
                                    ucid=document_info[0]['ucid']
                                except Exception:
                                    ucid=''
                                    print("Exception 1, ucid does not exist", files)
                                                       
                                try:
                                    date=document_info[0]['date']
                                except Exception:
                                    date=''
                                    print("Exception 2, date does not exist", files)
                                                                
                                title_en=soup.find('invention-title', attrs={'lang':'EN'})
                                if title_en != None:
                                    title_en_text=title_en.getText(separator='\n ')
                                   
                                abstract_en=soup.find('abstract', attrs={'lang':'EN'})
                                if abstract_en != None:
                                    abstract_en_text=abstract_en.getText(separator='\n ')
                                                
                                description_en = soup.find('description', attrs={'lang':'EN'})
                                if description_en != None:
                                    description_en_text=description_en.getText('\n ')
                                    
                                claims_en = soup.find('claims', attrs={'lang':'EN'})
                                if claims_en != None:
                                    claims_en_text=claims_en.getText(separator='\n ')
                                        
                                df_sum_test.loc[counter_sum-1, 'xml_file_name']=files
                                df_sum_test.loc[counter_sum-1, 'ucid']=ucid
                                df_sum_test.loc[counter_sum-1, 'date']=date
                                df_sum_test.loc[counter_sum-1, 'title_lang_en']=title_en_text
                                df_sum_test.loc[counter_sum-1, 'abstract_lang_en']=abstract_en_text
                                df_sum_test.loc[counter_sum-1, 'description_lang_en']=description_en_text
                                df_sum_test.loc[counter_sum-1, 'claims_lang_en']=claims_en_text

100000
200000
300000
400000
500000


### Identify the patent number and kind code, and append these fields to the initial DataFrame

In [8]:
df_sum_test['patent_number']=df_sum_test['xml_file_name'].str.split(".").str[0]
df_sum_test['patent_number']=df_sum_test['patent_number'].str.split("-").str[1:2]
df_sum_test['patent_number']=df_sum_test['patent_number'].str.join('')

df_sum_test['kind_code']=df_sum_test['xml_file_name'].str.split(".").str[0]
df_sum_test['kind_code']=df_sum_test['kind_code'].str.split("-").str[2:3]
df_sum_test['kind_code']=df_sum_test['kind_code'].str.join('')

df_sum_test['kind_code_letter']=df_sum_test['kind_code'].str[0]

### When multiple kind codes exist for a single patent, we merge the most updated information for the selected fields.

In [9]:
df_sum_test= df_sum_test.replace('', pd.NA)
df_sum_test=df_sum_test.groupby('patent_number').agg({'patent_number':'last', 'kind_code':'last', 'title_lang_en':'last', \
                            'abstract_lang_en':'last', 'description_lang_en':'last', 'claims_lang_en': 'last'})
df_sum_test = df_sum_test.reset_index(drop=True)

# Brief Description, Summary, and First Claim Extraction (Part III)

In [13]:
def keep_only_letters_numbers(text):
    return re.sub(r"[^a-zA-Z0-9\s]", "", text)

#Load the HUPD headings file
summary_headings_file='F:/data/exports/summary_headings.csv'
summary_headings = pd.read_csv(summary_headings_file, header=0)
summary_headings=summary_headings['summary_headings'].tolist()

import re

# Initialize lists for storing extracted parts
brief_list, summary_list, claim_list = [], [], []

# Iterate through each row in the DataFrame
for i, description in enumerate(df_sum_test['description_lang_en']):
    summary_heading_flag = 0
    brief_help, summary_help = [], []

    # Split description into lines and process each line
    for line in description.split("\n"):
        brief_help.append(line)

        stripped_line = " ".join(line.split()).lower()
        stripped_line = keep_only_letters_numbers(stripped_line)
        #summary flag equal to 0 and still checking for the summary section
        if summary_heading_flag == 0:
               if stripped_line in summary_headings:
                    summary_heading_flag = 1 
                    summary_help.append(" ".join(line.split()))
        
        #summary flag equal to 1, so we have found the summary section and search to find where it closes
        else:
            match = re.search(r"[A-Z][^a-z]* ", line)
            if match and "description" in match.group(0).strip().lower():
                summary_heading_flag = 0
                break
            else:
                summary_help.append(" ".join(line.split()))
            

    # Store results for this row
    brief_list.append(" ".join(brief_help))
    summary_list.append(" ".join(summary_help))

    # Extract first claim from claims column
    claim = df_sum_test['claims_lang_en'][i].split(' 2.')[0] if ' 2.' in df_sum_test['claims_lang_en'][i] else ''
    claim_list.append(claim)

# Add extracted data to DataFrame
df_sum_test['brief_description'] = brief_list
df_sum_test['summary'] = summary_list
df_sum_test['1st_claim'] = claim_list

In [6]:
print("Out of", df_sum_test.shape[0], "single patents, we extracted", df_sum_test[df_sum_test['brief_description']!=""].shape[0],\
      "brief description segments,", df_sum_test[df_sum_test['summary']!=""].shape[0], "summary segments and", \
      df_sum_test[df_sum_test['1st_claim']!=""].shape[0], "first claims") 

Out of 2847 single patents, we extracted 2847 brief description segments, 1819 summary segments and 132 first claims


### Store single patents belonging to the test set 1

In [17]:
#Single patents
SMTSname="SMTSep_VP_" 
suffix=".csv"
df_sum_test.to_csv(destination_path+SMTSname+filename1+suffix, sep =';')

#  Additional Criteria for Test Set (Part IV) (Optional)

## Filtering patents without a distinct summary segment

In [19]:
df=df_sum_test[df_sum_test['summary']!=""]
df=df.reset_index(drop=True)
df.shape

(1819, 9)

### Store single patents belonging to the test set 2

In [20]:
#Single patents
SMTSname="SMTSep_VP_" 
suffix=".csv"
df.to_csv(destination_path+SMTSname+filename2+suffix, sep =';')

## Removing patents where the abstract has low similarity with the description and the summary segment

In [30]:
df1=df.copy()
df1['number_of_words']=None

for i in range(0, df1.shape[0]):
    summary_valid=df1['summary'][i]
    df1.loc[i, 'number_of_words']=len(summary_valid.split())
df1.shape

(1819, 10)

In [None]:
#!pip install bert-extractive-summarizer
#!pip install -U sentence-transformers

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
import transformers
from transformers import AutoTokenizer, BertModel
import torch
from summarizer import Summarizer
from summarizer.sbert import SBertSummarizer

tokenizer = AutoTokenizer.from_pretrained('anferico/bert-for-patents')
model = BertModel.from_pretrained('anferico/bert-for-patents')

model2 = SBertSummarizer('paraphrase-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [31]:
i=0
for i in range(df1.shape[0]):
    if df1["number_of_words"][i]>350:
        #extract_summary=model2(df1['description_lang_en'][i])
        extract_summary=df1['description_lang_en'][i].split()[:350]
        extract_summary=" ".join(extract_summary)
    else:
        extract_summary=df1['description_lang_en'][i]
        
    df1.loc[i, 'processed_description']=extract_summary
df1

Unnamed: 0,patent_number,kind_code,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en,brief_description,summary,1st_claim,number_of_words,processed_description
0,1788455,B1,Method and system for improved control of xero...,A system changes the setpoint of a digital rep...,BACKGROUND AND SUMMARY\n Digital reprographic ...,A system to control image quality for a laser ...,BACKGROUND AND SUMMARY Digital reprographic s...,BACKGROUND AND SUMMARY Digital reprographic sy...,,653,BACKGROUND AND SUMMARY Digital reprographic sy...
1,1798380,B1,Turbine nozzle with spline seal,A method for assembling a gas turbine engine (...,BACKGROUND OF THE INVENTION\n This invention r...,A turbine nozzle assembly (202) for a gas turb...,BACKGROUND OF THE INVENTION This invention re...,BRIEF SUMMARY OF THE INVENTION In a further as...,,26,BACKGROUND OF THE INVENTION\n This invention r...
2,1847773,B1,INTEGRATED FLUIDIZED BED ASH COOLER,An integrated fluidized bed ash cooler for a f...,FIELD OF THE INVENTION\n The present invention...,A fluidized bed ash cooler (100) for cooling b...,FIELD OF THE INVENTION The present invention ...,SUMMARY OF THE INVENTION Aspects of the invent...,,460,FIELD OF THE INVENTION The present invention r...
3,1898530,B1,"Communication system, communication apparatus,...",A communication system includes the following ...,CROSS REFERENCES TO RELATED APPLICATIONS\n The...,An electric-field-coupling antenna used in com...,CROSS REFERENCES TO RELATED APPLICATIONS The ...,SUMMARY OF THE INVENTION It is desirable to pr...,,2146,CROSS REFERENCES TO RELATED APPLICATIONS The p...
4,1918964,B1,Method for measuring information transfer limi...,A crystal thin film is adopted as a specimen f...,BACKGROUND OF THE INVENTION\n (1) Field of the...,A method for measuring an information transfer...,BACKGROUND OF THE INVENTION (1) Field of the ...,"SUMMARY OF THE INVENTION Accordingly, an objec...",,2619,BACKGROUND OF THE INVENTION (1) Field of the I...
...,...,...,...,...,...,...,...,...,...,...,...
1814,2888825,B1,BEAMFORMING,The embodiments herein relate to a method in a...,TECHNICAL FIELD\n Embodiments herein relate ge...,A method in a transmitter (201) for transmitti...,TECHNICAL FIELD Embodiments herein relate gen...,SUMMARY An objective of embodiments herein is ...,,786,TECHNICAL FIELD Embodiments herein relate gene...
1815,2890616,B1,A CARTON FOR PACKING AND A METHOD FOR PACKING ...,The invention relates to a carton for packing ...,FIELD OF INVENTION\n The present invention rel...,"A carton (1) for packing that, when is in a fl...",FIELD OF INVENTION The present invention rela...,SUMMARY OF THE INVENTION The aim of the presen...,,463,FIELD OF INVENTION The present invention relat...
1816,2897464,B1,EDIBLE WATER-IN-OIL EMULSION AND A PROCESS FOR...,The invention relates to an edible water-in-oi...,Field of invention\n The present invention rel...,A process for the manufacture of an edible wat...,Field of invention The present invention rela...,Summary of the invention The inventors have fo...,A process for the manufacture of an edible wat...,7857,Field of invention The present invention relat...
1817,2904757,B1,THROTTLING A MEDIA STREAM FOR TRANSMISSION VIA...,"A method of throttling a media stream, compris...",Technical field\n The invention relates to a m...,A method (700) of throttling a media stream (3...,Technical field The invention relates to a me...,Summary It is an object of the invention to pr...,,7821,Technical field The invention relates to a met...


In [None]:
df1['abstract_lang_en']=df1['abstract_lang_en'].fillna("")

similarity_list1 = []
similarity_list2 = []

for k in range(df1.shape[0]):
    print(k)
    summary=df1['summary'][k].split()
    summary=" ".join(summary)

    abstract=df1['abstract_lang_en'][k].split()
    abstract=" ".join(abstract)

    descr=df1['processed_description'][k]
    
    tokens1 = tokenizer.tokenize(summary)
    tokens1 = ['[CLS]'] + tokens1[0:500] + ['[SEP]']
    tokens2 = tokenizer.tokenize(abstract)
    tokens2 = ['[CLS]'] + tokens2[0:500] + ['[SEP]']
    tokens3 = tokenizer.tokenize(descr)
    tokens3 = ['[CLS]'] + tokens3[0:500] + ['[SEP]']

    # Convert tokens to input IDs
    input_ids1 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens1)).unsqueeze(0)  # Batch size 1
    input_ids2 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens2)).unsqueeze(0)  # Batch size 1
    input_ids3 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens3)).unsqueeze(0)  # Batch size 1

    # Obtain the BERT embeddings
    with torch.no_grad():
        outputs1 = model(input_ids1)
        outputs2 = model(input_ids2)
        outputs3 = model(input_ids3)

        embeddings1 = outputs1.last_hidden_state[:, 0, :]  # [CLS] token
        embeddings2 = outputs2.last_hidden_state[:, 0, :]  # [CLS] token           
        embeddings3 = outputs3.last_hidden_state[:, 0, :]  # [CLS] token

    # Calculate similarity
    similarity_score1 = cosine_similarity(embeddings2, embeddings1)
    similarity_score2 = cosine_similarity(embeddings2, embeddings3)

    #print(similarity_score,k, df.shape[0])
    similarity_list1.append(similarity_score1[0][0])
    similarity_list2.append(similarity_score2[0][0])

df1['similarity_abs_sum']=similarity_list1
df1['similarity_abs_descr']=similarity_list2

0


In [None]:
df1=df1[(df1['similarity_abs_sum']>0.70)&(df1['similarity_abs_descr']>0.70)]
df1.shape

### Store single patents belonging to the test set 3

In [None]:
#Single patents
SMTSname="SMTSep_VP_" 
suffix=".csv"
df.to_csv(destination_path+SMTSname+filename3+suffix, sep =';')