*Licensed under the MIT License. See LICENSE-CODE in the repository root for details.*

*Copyright (c) 2025 Eleni Kamateri*

### Generating Testing Datasets for Classification Test Sets

This script creates testing datasets corresponding to the classification test sets (CLTS) by retrieving and structuring patent data from the WPI collection.

#### Process Overview

1. **Load the Classification Test Set:**
Load patent numbers of the CLTS of interest.

2. **Retrieve Full-Text Data from WPI collection:**
This step involves retrieving the corresponding full-text patent data from the original WPI collection based on the patent numbers identified in the CLTS. Extracted full-text data include:

- Abstract
- Description
- Claims
- Title
    
(Optional) Depending on the use case, the script may also extract other additional metadata, such as:

- Application Date
- Publication Date
- Patent Inventor/Applicant
- Kind Code
      
3. **Merge and Structure the Dataset:**
This step merges different kind codes of the same patent, retaining the latest information for each field.

4. **Process the final CLTS dataset for the classification experiment:**
This step includes column renaming, label formatting and truncation. Labels are converted to subclass format (e.g., "G06F 17/30" → "G06F" and the Description and Claims sections are filtered to retain the first 300 words.

#### Configurable Parameters

Researchers can modify the following parameters to customize the test set generation:

**classification_test_set_path** – Path to the classification test set file. 

**vertical_origin_path** – Path to the core vertical of the WPI dataset. 

        Example: "/YOUR_PATH/WPI-Dataset/EP/". 

**destination_path** – Path to the folder where the generated files will be stored. 

**sep** – Defines the separator used in the CSV file:

        0: Semicolon (;)
        
        1: Comma (,)
        
        
The code below creates the dataset for the classification test set 1 for the #EP core vertical and IPCR labels, referred to as [#CLTSep_VP_ipcr_1.csv](https://github.com/cs1msa/WPIplus/blob/main/Ground%20Truths/Classification/%23CLTSep/%23CLTSep-ipcr%7Bfilter%3A%20B%2C%20all%2C%2020151001%7D/CLTSep_VP_ipcr_1.csv).

### Set the required parameters for the script

In [15]:
classification_test_set_path="/YOUR_PATH/CLTSep_VP_ipcr_1.csv."
vertical_origin_path="/YOUR_PATH/WPI-Dataset/EP/"
destination_path="/YOUR_PATH/WPI-Dataset/"

### Import all required libraries for the script

In [16]:
import numpy as np
import pandas as pd
import os
from bs4 import BeautifulSoup
import time

### Textual Data Retrieval 

In [17]:
# Count the time
start_time = time.time()
df_class_test = pd.DataFrame()
counter_class=0

DF = pd.read_csv(classification_test_set_path, header=0)
DF_doc_number_list=DF['patent_number'].tolist()

for folder_level_1 in os.listdir(vertical_origin_path): #CC
    for folder_level_2 in os.listdir(vertical_origin_path+"/"+folder_level_1): #nnnnnn
        for folder_level_3 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2): #nn
            for folder_level_4 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3): #nn
                for folder_level_5 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4): #nn                                        
                    for folder_level_6 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5): #nn                                        
                        for files in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6): #nn                                        
                            
                            counter_class=counter_class+1
                            
                            if counter_class%100000==0:
                                print(counter_class)

                            files_proc=files.split(".")[0]
                            
                            try:
                                doc_number_proc=files_proc.split("-")[1]
                                doc_number_proc=int(doc_number_proc)                                                            
                            except Exception:
                                doc_number_proc=''
                                print("Exception with doc-number", files)
                                                                                                                
                            if doc_number_proc in DF_doc_number_list:                                                        
                                
                                title_en_text  = ''     
                                abstract_en_text   = ''                                        
                                description_en_text = ''                                                   
                                claims_en_text = ''

                                content = open(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6+"/"+files,'r',encoding='utf-8').read()
                                soup = BeautifulSoup(content, 'xml')
                                document_info = soup.find_all("patent-document")  

                                try:
                                    ucid=document_info[0]['ucid']
                                except Exception:
                                    ucid=''
                                    print("Exception 1, ucid does not exist")
                                                       
                                try:
                                    date=document_info[0]['date']
                                except Exception:
                                    date=''
                                    print("Exception 2, date does not exist")
                                
                                code_main=''
                                for main_classification in soup.find_all('main-classification'):
                                    code_main=main_classification.getText()      

                                further_group_help=[]
                                further_group_list=[]
                                for further_classification in soup.find_all('further-classification'):
                                    code_further=further_classification.getText()
                                    further_group_help.append(code_further) if code_further not in further_group_help else further_group_help
                                further_group_list = ", ".join(further_group_help)

                                ipcr_group_help=[]
                                ipcr_group_list=[]
                                for classification_ipcr in soup.find_all('classification-ipcr'):
                                    code_ipcr=classification_ipcr.getText()
                                    ipcr_group_help.append(code_ipcr) if code_ipcr not in ipcr_group_help else ipcr_group_help
                                ipcr_group_list = ", ".join(ipcr_group_help)
    
                                cpc_group_help=[]
                                cpc_group_list=[]
                                for classification_cpc in soup.find_all('classification-cpc'):
                                    code_cpc=classification_cpc.getText()
                                    cpc_group_help.append(code_cpc) if code_cpc not in cpc_group_help else cpc_group_help
                                cpc_group_list = ", ".join(cpc_group_help)
                                
                                title_en=soup.find('invention-title', attrs={'lang':'EN'})
                                if title_en != None:
                                    title_en=title_en.getText().replace('"', ' ')
                                    title_en=" ".join(title_en.split())
                                    title_en_text=title_en

                                abstract_en=soup.find('abstract', attrs={'lang':'EN'})
                                if abstract_en != None:
                                    abstract_en=abstract_en.getText().replace('"', ' ')
                                    abstract_en=" ".join(abstract_en.split())
                                    abstract_en_text=abstract_en
                                                
                                description_en = soup.find('description', attrs={'lang':'EN'})
                                if description_en != None:
                                    description_en=description_en.getText().replace('"', ' ')
                                    description_en=" ".join(description_en.split())
                                    description_en_text=description_en
                                                                
                                claims_en = soup.find('claims', attrs={'lang':'EN'})
                                if claims_en != None:
                                    claims_en=claims_en.getText().replace('"', ' ')
                                    claims_en=" ".join(claims_en.split())                          
                                    claims_en_text=claims_en
                                
                                df_class_test.loc[counter_class-1, 'xml_file_name']=files
                                df_class_test.loc[counter_class-1, 'ucid']=ucid
                                df_class_test.loc[counter_class-1, 'date']=date
                                df_class_test.loc[counter_class-1, 'main_classification']=code_main                                    
                                df_class_test.loc[counter_class-1, 'further_classification']=further_group_list                                        
                                df_class_test.loc[counter_class-1, 'classification_ipcr']=ipcr_group_list
                                df_class_test.loc[counter_class-1, 'classification_cpc']=cpc_group_list 
                                df_class_test.loc[counter_class-1, 'title_lang_en']=title_en_text
                                df_class_test.loc[counter_class-1, 'abstract_lang_en']=abstract_en_text
                                df_class_test.loc[counter_class-1, 'description_lang_en']=description_en_text
                                df_class_test.loc[counter_class-1, 'claims_lang_en']=claims_en_text                                                       

100000
200000
300000
400000
500000


In [18]:
df_class_test.shape

(6181, 11)

In [None]:
df_class_test['patent_number']=df_class_test['xml_file_name'].str.split(".").str[0]
df_class_test['patent_number']=df_class_test['patent_number'].str.split("-").str[1:2]
df_class_test['patent_number']=df_class_test['patent_number'].str.join('')

### When multiple kind codes exist for a single patent, we merge the most updated information for the selected fields.

In [20]:
df_class_test= df_class_test.replace('', pd.NA)
df_class_test=df_class_test.groupby('patent_number').agg({'patent_number':'last', 'classification_ipcr':'last', 'title_lang_en':'last', \
                            'abstract_lang_en':'last', 'description_lang_en':'last', 'claims_lang_en': 'last'})
df_class_test = df_class_test.reset_index(drop=True)

In [21]:
df_class_test.shape

(2847, 6)

### Process the final dataset

In [8]:
# Rename
df_class_test=df_class_test.rename(columns={'classification_ipcr': 'labels'})

# Split the 'labels' column by commas
df_class_test['labels'] = df_class_test['labels'].str.split(',')

# Extract the first part of each label after splitting by '/'
df_class_test['labels'] = df_class_test['labels'].apply(lambda x: [item.split('/')[0] for item in x])
df_class_test['labels'] = df_class_test['labels'].apply(lambda x: [item.replace(" ", "") for item in x])
df_class_test['labels'] = df_class_test['labels'].apply(lambda x: [item.strip() for item in x])
df_class_test['labels'] = df_class_test['labels'].apply(lambda x: [item[0:4]for item in x])
df_class_test['labels'] = df_class_test['labels'].apply(lambda x: set(x))
df_class_test['labels'] = df_class_test['labels'].apply(lambda x: ','.join(x))
df_class_test

Unnamed: 0,patent_number,labels,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en
0,1628142,"G01T,C09K,G21K","SCINTILLATOR COMPOSITIONS, RELATED PROCESSES, ...",A scintillator composition of a halide perovsk...,The invention relates generally to materials a...,"A scintillator composition, comprising the fol..."
1,1712735,"F01D,B23P",Method of repairing spline and seal teeth of a...,A method of repairing spline and seal teeth (2...,The present invention relates generally to rep...,A method of repairing a spline (20) of a mated...
2,1777792,"H02K,H02M,H02H,H02P",Control and protection methodologies for a mot...,A motor control module designed to control ope...,Example embodiments in general relate to contr...,"A power tool (10), comprising: a tool motor (1..."
3,1788455,G03G,Method and system for improved control of xero...,A system changes the setpoint of a digital rep...,BACKGROUND AND SUMMARYDigital reprographic sys...,A system to control image quality for a laser ...
4,1798380,F01D,Turbine nozzle with spline seal,A method for assembling a gas turbine engine (...,BACKGROUND OF THE INVENTIONThis invention rela...,A turbine nozzle assembly (202) for a gas turb...
...,...,...,...,...,...,...
2842,2891613,"A47J,B65D",CAPSULE FOR PRODUCING A BEVERAGE,Beverage-making capsule comprising a lid (8) t...,OBJECT OF THE INVENTIONThe invention consists ...,Beverage-making capsule: - comprising one lid ...
2843,2891615,"A47J,B65D",CAPSULE FOR A BEVERAGE MAKER,"Capsule for beverage-making machines , compris...","OBJECT OF THE INVENTIONThe present invention, ...",Capsule for beverage-making machines comprisin...
2844,2897464,A23D,EDIBLE WATER-IN-OIL EMULSION AND A PROCESS FOR...,The invention relates to an edible water-in-oi...,Field of inventionThe present invention relate...,A process for the manufacture of an edible wat...
2845,2904757,H04L,THROTTLING A MEDIA STREAM FOR TRANSMISSION VIA...,"A method of throttling a media stream, compris...",Technical fieldThe invention relates to a meth...,A method (700) of throttling a media stream (3...


In [9]:
# Keep the first 300 words 

num=df_class_test.shape[0]

for i in range(num):
    if i%1000==0:
        print("i:", i)
    help_para=df_class_test['description_lang_en'][i].split()
    help_para=help_para[0:300]
    help_para=' '.join(help_para)
    DF.loc[i, 'description_lang_en']=help_para
for j in range(num):
    if j%1000==0:
        print("j:", j)
    help_para=df_class_test['claims_lang_en'][j].split()
    help_para=help_para[0:300]
    help_para=' '.join(help_para)
    DF.loc[j, 'claims_lang_en']=help_para

i: 0
i: 1000
i: 2000
j: 0
j: 1000
j: 2000


In [13]:
df_class_test.to_csv(destination_path+"CLTSep_VP_ipcr_1_text_dataset.csv", sep =';')