*Licensed under the MIT License. See LICENSE-CODE in the repository root for details.*

*Copyright (c) 2025 Eleni Kamateri*

### Generating Training Datasets for Classification Test Sets

This script creates training datasets corresponding to the classification test sets (CLTS) by retrieving and structuring patent data from the WPI collection. The process is similar to the [CLTS Testing Dataset Creation script](https://github.com/cs1msa/WPIplus/blob/main/UsingWPI%2B/An%20example%20of%20a%20classification%20experiment%20workflow/Source%20Code/CLTS%20Testing%20Dataset%20Creation.ipynb), with one key difference that instead of retrieving patent data included in the CLTS, this script retrieves all patent data that is NOT part of the CLTS.

#### Process Overview

1. **Load the Classification Test Set:**
Load patent numbers of the CLTS of interest.

2. **Retrieve Full-Text Data from WPI collection:**
This step involves retrieving the corresponding full-text patent data from the original WPI collection based on the patent numbers NOT identified in the CLTS. Extracted full-text data include:

- Abstract
- Description
- Claims
- Title
    
(Optional) Depending on the use case, the script may also extract other additional metadata, such as:

- Application Date
- Publication Date
- Patent Inventor/Applicant
- Kind Code
      
3. **Merge and Structure the Dataset:**
This step merges different kind codes of the same patent, retaining the latest information for each field.

4. **Process the final CLTS dataset for the classification experiment:**
Initially, we retain only patents that have all textual fields completed. Then, this step includes column renaming, label formatting and truncation. Labels are converted to subclass format (e.g., "G06F 17/30" → "G06F" and the Description and Claims sections are filtered to retain the first 300 words.

#### Configurable Parameters

Researchers can modify the following parameters to customize the test set generation:

**classification_test_set_path** – Path to the classification test set file. 

**vertical_origin_path** – Path to the core vertical of the WPI dataset. 

        Example: "/YOUR_PATH/WPI-Dataset/EP/". 

**destination_path** – Path to the folder where the generated files will be stored. 

**sep** – Defines the separator used in the CSV file:

        0: Semicolon (;)
        
        1: Comma (,)
        
        
The code below creates the training dataset for the classification test set 1 for the #EP core vertical and IPCR labels, referred to as [#CLTSep_VP_ipcr_1.csv](https://github.com/cs1msa/WPIplus/blob/main/Ground%20Truths/Classification/%23CLTSep/%23CLTSep-ipcr%7Bfilter%3A%20B%2C%20all%2C%2020151001%7D/CLTSep_VP_ipcr_1.csv).

### Set the required parameters for the script

In [1]:
classification_test_set_path="/PATH/WPI-Dataset/EP_US_WO_all_cpc_thres20_portion5_take1strategy.csv"
vertical_origin_path1="/PATH/WPI-Dataset/EP"
vertical_origin_path2="/PATH/WPI-Dataset/US"
vertical_origin_path3="/PATH/WPI-Dataset/WO"
destination_path="/PATH/WPI-Dataset/"

### Import all required libraries for the script

In [2]:
import numpy as np
import pandas as pd
import os
from bs4 import BeautifulSoup
import time

### Textual Data Retrieval 

In [3]:
df_class_train = pd.DataFrame()
counter_class=0

DF = pd.read_csv(classification_test_set_path, header=0)
DF_doc_number_list=DF['patent_number'].tolist()

In [4]:
vertical_origin_path=vertical_origin_path1

for folder_level_1 in os.listdir(vertical_origin_path): #CC
    for folder_level_2 in os.listdir(vertical_origin_path+"/"+folder_level_1): #nnnnnn
        for folder_level_3 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2): #nn
            for folder_level_4 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3): #nn
                for folder_level_5 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4): #nn                                        
                    for folder_level_6 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5): #nn                                        
                        for files in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6): #nn                                        

                            counter_class=counter_class+1
                            
                            if counter_class%1000==0:
                                print(counter_class)

                            files_proc=files.split(".")[0]
                            
                            try:
                                doc_number_proc=files_proc.split("-")[1]
                                doc_number_proc=int(doc_number_proc)                                
                            except Exception:
                                doc_number_proc=''
                                print("Exception with doc-number", files)
                                                                                                                
                            if doc_number_proc in DF_doc_number_list:                                                                                                                                                                                
                                
                                title_en_text  = ''     
                                abstract_en_text   = ''                                        
                                description_en_text = ''                                                   
                                claims_en_text = ''

                                content = open(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6+"/"+files,'r',encoding='utf-8').read()
                                soup = BeautifulSoup(content, 'xml')
                                document_info = soup.find_all("patent-document")  

                                try:
                                    ucid=document_info[0]['ucid']
                                except Exception:
                                    ucid=''
                                    print("Exception 1, ucid does not exist")
                                                       
                                try:
                                    date=document_info[0]['date']
                                except Exception:
                                    date=''
                                    print("Exception 2, date does not exist")
                                
                                #code_main=''
                                #for main_classification in soup.find_all('main-classification'):
                                    #code_main=main_classification.getText()      

                                #further_group_help=[]
                                #further_group_list=[]
                                #for further_classification in soup.find_all('further-classification'):
                                    #code_further=further_classification.getText()
                                    #further_group_help.append(code_further) if code_further not in further_group_help else further_group_help
                                #further_group_list = ", ".join(further_group_help)

                                #ipcr_group_help=[]
                                #ipcr_group_list=[]
                                #for classification_ipcr in soup.find_all('classification-ipcr'):
                                    #code_ipcr=classification_ipcr.getText()
                                    #ipcr_group_help.append(code_ipcr) if code_ipcr not in ipcr_group_help else ipcr_group_help
                                #ipcr_group_list = ", ".join(ipcr_group_help)
    
                                cpc_group_help=[]
                                cpc_group_list=[]
                                for classification_cpc in soup.find_all('classification-cpc'):
                                    code_cpc=classification_cpc.getText()
                                    cpc_group_help.append(code_cpc) if code_cpc not in cpc_group_help else cpc_group_help
                                cpc_group_list = ", ".join(cpc_group_help)
                                
                                title_en=soup.find('invention-title', attrs={'lang':'EN'})
                                if title_en != None:
                                    title_en_text=title_en.getText('\n ') 

                                abstract_en=soup.find('abstract', attrs={'lang':'EN'})
                                if abstract_en != None:
                                    abstract_en_text=abstract_en.getText('\n ') 
                                                
                                description_en = soup.find('description', attrs={'lang':'EN'})
                                if description_en != None:
                                    description_en_text=description_en.getText('\n ') 
                                                                
                                claims_en = soup.find('claims', attrs={'lang':'EN'})
                                if claims_en != None:
                                    claims_en_text=claims_en.getText('\n ') 
                                
                                df_class_train.loc[counter_class-1, 'xml_file_name']=files
                                df_class_train.loc[counter_class-1, 'ucid']=ucid
                                df_class_train.loc[counter_class-1, 'date']=date
                                #df_class_train.loc[counter_class-1, 'main_classification']=code_main                                    
                                #df_class_train.loc[counter_class-1, 'further_classification']=further_group_list                                        
                                #df_class_train.loc[counter_class-1, 'classification_ipcr']=ipcr_group_list
                                df_class_train.loc[counter_class-1, 'classification_cpc']=cpc_group_list 
                                df_class_train.loc[counter_class-1, 'title_lang_en']=title_en_text
                                df_class_train.loc[counter_class-1, 'abstract_lang_en']=abstract_en_text
                                df_class_train.loc[counter_class-1, 'description_lang_en']=description_en_text
                                df_class_train.loc[counter_class-1, 'claims_lang_en']=claims_en_text                            

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140000
141000
142000
143000
144000
145000
146000
147000
148000
149000
150000
151000
152000
153000
154000
155000
156000
157000
158000
15

In [5]:
df_class_train.shape

(15848, 8)

In [6]:
vertical_origin_path=vertical_origin_path2

for folder_level_1 in os.listdir(vertical_origin_path): #CC
    for folder_level_2 in os.listdir(vertical_origin_path+"/"+folder_level_1): #nnnnnn
        for folder_level_3 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2): #nn
            for folder_level_4 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3): #nn
                for folder_level_5 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4): #nn                                        
                    for folder_level_6 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5): #nn                                        
                        for files in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6): #nn                                        

                            counter_class=counter_class+1
                            
                            if counter_class%1000==0:
                                print(counter_class)

                            files_proc=files.split(".")[0]
                            
                            try:
                                doc_number_proc=files_proc.split("-")[1]
                                doc_number_proc=int(doc_number_proc)                                
                            except Exception:
                                doc_number_proc=''
                                print("Exception with doc-number", files)
                                                                                                                
                            if doc_number_proc in DF_doc_number_list:                                                                                                                                                                                
                                
                                title_en_text  = ''     
                                abstract_en_text   = ''                                        
                                description_en_text = ''                                                   
                                claims_en_text = ''

                                content = open(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6+"/"+files,'r',encoding='utf-8').read()
                                soup = BeautifulSoup(content, 'xml')
                                document_info = soup.find_all("patent-document")  

                                try:
                                    ucid=document_info[0]['ucid']
                                except Exception:
                                    ucid=''
                                    print("Exception 1, ucid does not exist")
                                                       
                                try:
                                    date=document_info[0]['date']
                                except Exception:
                                    date=''
                                    print("Exception 2, date does not exist")
                                
                                #code_main=''
                                #for main_classification in soup.find_all('main-classification'):
                                    #code_main=main_classification.getText()      

                                #further_group_help=[]
                                #further_group_list=[]
                                #for further_classification in soup.find_all('further-classification'):
                                    #code_further=further_classification.getText()
                                    #further_group_help.append(code_further) if code_further not in further_group_help else further_group_help
                                #further_group_list = ", ".join(further_group_help)

                                #ipcr_group_help=[]
                                #ipcr_group_list=[]
                                #for classification_ipcr in soup.find_all('classification-ipcr'):
                                    #code_ipcr=classification_ipcr.getText()
                                    #ipcr_group_help.append(code_ipcr) if code_ipcr not in ipcr_group_help else ipcr_group_help
                                #ipcr_group_list = ", ".join(ipcr_group_help)
    
                                cpc_group_help=[]
                                cpc_group_list=[]
                                for classification_cpc in soup.find_all('classification-cpc'):
                                    code_cpc=classification_cpc.getText()
                                    cpc_group_help.append(code_cpc) if code_cpc not in cpc_group_help else cpc_group_help
                                cpc_group_list = ", ".join(cpc_group_help)
                                
                                title_en=soup.find('invention-title', attrs={'lang':'EN'})
                                if title_en != None:
                                    title_en_text=title_en.getText('\n ') 

                                abstract_en=soup.find('abstract', attrs={'lang':'EN'})
                                if abstract_en != None:
                                    abstract_en_text=abstract_en.getText('\n ') 
                                                
                                description_en = soup.find('description', attrs={'lang':'EN'})
                                if description_en != None:
                                    description_en_text=description_en.getText('\n ') 
                                                                
                                claims_en = soup.find('claims', attrs={'lang':'EN'})
                                if claims_en != None:
                                    claims_en_text=claims_en.getText('\n ') 
                                
                                df_class_train.loc[counter_class-1, 'xml_file_name']=files
                                df_class_train.loc[counter_class-1, 'ucid']=ucid
                                df_class_train.loc[counter_class-1, 'date']=date
                                #df_class_train.loc[counter_class-1, 'main_classification']=code_main                                    
                                #df_class_train.loc[counter_class-1, 'further_classification']=further_group_list                                        
                                #df_class_train.loc[counter_class-1, 'classification_ipcr']=ipcr_group_list
                                df_class_train.loc[counter_class-1, 'classification_cpc']=cpc_group_list 
                                df_class_train.loc[counter_class-1, 'title_lang_en']=title_en_text
                                df_class_train.loc[counter_class-1, 'abstract_lang_en']=abstract_en_text
                                df_class_train.loc[counter_class-1, 'description_lang_en']=description_en_text
                                df_class_train.loc[counter_class-1, 'claims_lang_en']=claims_en_text                            

553000
554000
555000
556000
557000
558000
559000
560000
561000
562000
563000
564000
565000
566000
567000
568000
569000
570000
571000
572000
573000
574000
575000
576000
577000
578000
579000
580000
581000
582000
583000
584000
585000
586000
587000
588000
589000
590000
591000
592000
593000
594000
595000
596000
597000
598000
599000
600000
601000
602000
603000
604000
605000
606000
607000
608000
609000
610000
611000
612000
613000
614000
615000
616000
617000
618000
619000
620000
621000
622000
623000
624000
625000
626000
627000
628000
629000
630000
631000
632000
633000
634000
635000
636000
637000
638000
639000
640000
641000
642000
643000
644000
645000
646000
647000
648000
649000
650000
651000
652000
653000
654000
655000
656000
657000
658000
659000
660000
661000
662000
663000
664000
665000
666000
667000
668000
669000
670000
671000
672000
673000
674000
675000
676000
677000
678000
679000
680000
681000
682000
683000
684000
685000
686000
687000
688000
689000
690000
691000
692000
693000
694000
695000

1633000
1634000
1635000
1636000
1637000
1638000
1639000
1640000
1641000
1642000
1643000
1644000
1645000
1646000
1647000
1648000
1649000
1650000
1651000
1652000
1653000
1654000
1655000
1656000
1657000
1658000
1659000
1660000
1661000
1662000
1663000
1664000
1665000
1666000
1667000
1668000
1669000
1670000
1671000
1672000
1673000
1674000
1675000
1676000
1677000
1678000
1679000
1680000
1681000
1682000
1683000
1684000
1685000
1686000
1687000
1688000
1689000
1690000
1691000
1692000
1693000
1694000
1695000
1696000
1697000
1698000
1699000
1700000
1701000
1702000
1703000
1704000
1705000
1706000
1707000
1708000
1709000
1710000
1711000
1712000
1713000
1714000
1715000
1716000
1717000
1718000
1719000
1720000
1721000
1722000
1723000
1724000
1725000
1726000
1727000
1728000
1729000
1730000
1731000
1732000
1733000
1734000
1735000
1736000
1737000
1738000
1739000
1740000
1741000
1742000
1743000
1744000
1745000
1746000
1747000
1748000
1749000
1750000
1751000
1752000
1753000
1754000
1755000
1756000
1757000


In [7]:
df_class_train.shape

(88957, 8)

In [8]:
vertical_origin_path=vertical_origin_path3

for folder_level_1 in os.listdir(vertical_origin_path): #CC
    for folder_level_2 in os.listdir(vertical_origin_path+"/"+folder_level_1): #nnnnnn
        for folder_level_3 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2): #nn
            for folder_level_4 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3): #nn
                for folder_level_5 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4): #nn                                        
                    for folder_level_6 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5): #nn                                        
                        for files in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6): #nn                                        

                            counter_class=counter_class+1
                            
                            if counter_class%1000==0:
                                print(counter_class)

                            files_proc=files.split(".")[0]
                            
                            try:
                                doc_number_proc=files_proc.split("-")[1]
                                doc_number_proc=int(doc_number_proc)                                
                            except Exception:
                                doc_number_proc=''
                                print("Exception with doc-number", files)
                                                                                                                
                            if doc_number_proc in DF_doc_number_list:                                                                                                                                                                                
                                
                                title_en_text  = ''     
                                abstract_en_text   = ''                                        
                                description_en_text = ''                                                   
                                claims_en_text = ''

                                content = open(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6+"/"+files,'r',encoding='utf-8').read()
                                soup = BeautifulSoup(content, 'xml')
                                document_info = soup.find_all("patent-document")  

                                try:
                                    ucid=document_info[0]['ucid']
                                except Exception:
                                    ucid=''
                                    print("Exception 1, ucid does not exist")
                                                       
                                try:
                                    date=document_info[0]['date']
                                except Exception:
                                    date=''
                                    print("Exception 2, date does not exist")
                                
                                #code_main=''
                                #for main_classification in soup.find_all('main-classification'):
                                    #code_main=main_classification.getText()      

                                #further_group_help=[]
                                #further_group_list=[]
                                #for further_classification in soup.find_all('further-classification'):
                                    #code_further=further_classification.getText()
                                    #further_group_help.append(code_further) if code_further not in further_group_help else further_group_help
                                #further_group_list = ", ".join(further_group_help)

                                #ipcr_group_help=[]
                                #ipcr_group_list=[]
                                #for classification_ipcr in soup.find_all('classification-ipcr'):
                                    #code_ipcr=classification_ipcr.getText()
                                    #ipcr_group_help.append(code_ipcr) if code_ipcr not in ipcr_group_help else ipcr_group_help
                                #ipcr_group_list = ", ".join(ipcr_group_help)
    
                                cpc_group_help=[]
                                cpc_group_list=[]
                                for classification_cpc in soup.find_all('classification-cpc'):
                                    code_cpc=classification_cpc.getText()
                                    cpc_group_help.append(code_cpc) if code_cpc not in cpc_group_help else cpc_group_help
                                cpc_group_list = ", ".join(cpc_group_help)
                                
                                title_en=soup.find('invention-title', attrs={'lang':'EN'})
                                if title_en != None:
                                    title_en_text=title_en.getText('\n ') 

                                abstract_en=soup.find('abstract', attrs={'lang':'EN'})
                                if abstract_en != None:
                                    abstract_en_text=abstract_en.getText('\n ') 
                                                
                                description_en = soup.find('description', attrs={'lang':'EN'})
                                if description_en != None:
                                    description_en_text=description_en.getText('\n ') 
                                                                
                                claims_en = soup.find('claims', attrs={'lang':'EN'})
                                if claims_en != None:
                                    claims_en_text=claims_en.getText('\n ') 
                                
                                df_class_train.loc[counter_class-1, 'xml_file_name']=files
                                df_class_train.loc[counter_class-1, 'ucid']=ucid
                                df_class_train.loc[counter_class-1, 'date']=date
                                #df_class_train.loc[counter_class-1, 'main_classification']=code_main                                    
                                #df_class_train.loc[counter_class-1, 'further_classification']=further_group_list                                        
                                #df_class_train.loc[counter_class-1, 'classification_ipcr']=ipcr_group_list
                                df_class_train.loc[counter_class-1, 'classification_cpc']=cpc_group_list 
                                df_class_train.loc[counter_class-1, 'title_lang_en']=title_en_text
                                df_class_train.loc[counter_class-1, 'abstract_lang_en']=abstract_en_text
                                df_class_train.loc[counter_class-1, 'description_lang_en']=description_en_text
                                df_class_train.loc[counter_class-1, 'claims_lang_en']=claims_en_text                            

1951000
1952000
1953000
1954000
1955000
1956000
1957000
1958000
1959000
1960000
1961000
1962000
1963000
1964000
1965000
1966000
1967000
1968000
1969000
1970000
1971000
1972000
1973000
1974000
1975000
1976000
1977000
1978000
1979000
1980000
1981000
1982000
1983000
1984000
1985000
1986000
1987000
1988000
1989000
1990000
1991000
1992000
1993000
1994000
1995000
1996000
1997000
1998000
1999000
2000000
2001000
2002000
2003000
2004000
2005000
2006000
2007000
2008000
2009000
2010000
2011000
2012000
2013000
2014000
2015000
2016000
2017000
2018000
2019000
2020000
2021000
2022000
2023000
2024000
2025000
2026000
2027000
2028000
2029000
2030000
2031000
2032000
2033000
2034000
2035000
2036000
2037000
2038000
2039000
2040000
2041000
2042000
2043000
2044000
2045000
2046000
2047000
2048000
2049000
2050000
2051000
2052000
2053000
2054000
2055000
2056000
2057000
2058000
2059000
2060000
2061000
2062000
2063000
2064000
2065000
2066000
2067000
2068000
2069000
2070000
2071000
2072000
2073000
2074000
2075000


In [9]:
df_class_train.shape

(102388, 8)

In [10]:
df_class_train['patent_number']=df_class_train['xml_file_name'].str.split(".").str[0]
df_class_train['patent_number']=df_class_train['patent_number'].str.split("-").str[1:2]
df_class_train['patent_number']=df_class_train['patent_number'].str.join('')

df_class_train['patent_office']=df_class_train['xml_file_name'].str[:2]

df_class_train.head(1)

Unnamed: 0,xml_file_name,ucid,date,classification_cpc,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en,patent_number,patent_office
40,EP-2677901-A1.xml,EP-2677901-A1,20140101,A47F 1/04 20130101 LA20151107BHEP ...,PRODUCT DISPENSING SYSTEM WITH PANEL GUIDE,A product dispensing system (10) is disclosed ...,PRODUCT DISPENSING SYSTEM WITH PANEL GUIDE \n...,1. A product dispensing system comprising: \n ...,2677901,EP


In [12]:
df_class_train= df_class_train.replace('', pd.NA)
df_class_train_=df_class_train.groupby('patent_number').agg({'ucid': lambda x: ','.join(map(str, x)), 'date':'first', \
                                    'classification_cpc':'last','title_lang_en':'last', 'abstract_lang_en':'last', \
                                    'description_lang_en': 'last', 'claims_lang_en':'last', \
                                    'patent_number':'last', 'patent_office': 'last'})
df_class_train_ = df_class_train_.reset_index(drop=True)

  mask = arr == x


In [13]:
df_class_train_.shape, df_class_train_.dtypes

((97063, 9),
 ucid                   object
 date                   object
 classification_cpc     object
 title_lang_en          object
 abstract_lang_en       object
 description_lang_en    object
 claims_lang_en         object
 patent_number          object
 patent_office          object
 dtype: object)

In [None]:
#Load the parsing file

In [14]:
DF2_combined = pd.read_csv('F:/PhD/Test collections/WPI-Dataset/EP_US_WO_all_cpc_thres20_portion5_take1strategy.csv', header=0)
DF2_combined.shape, DF2_combined.dtypes

((97063, 8),
 ucid                          object
 date                         float64
 labels                        object
 abstract_lang_en_exist       float64
 description_lang_en_exist    float64
 claims_lang_en_exist         float64
 patent_number                  int64
 patent_office                 object
 dtype: object)

In [15]:
df_class_train_['patent_number']=df_class_train_['patent_number'].astype(str)
DF2_combined['patent_number']=DF2_combined['patent_number'].astype(str)

DF = pd.merge(DF2_combined, df_class_train_, on=['patent_number'])

In [16]:
del DF['ucid_y'], DF['date_y'], DF['classification_cpc'],DF['patent_office_x'],DF['patent_office_y'],DF['abstract_lang_en_exist'],DF['description_lang_en_exist'],DF['claims_lang_en_exist']
DF=DF.rename(columns={'ucid_x': 'ucid'})
DF=DF.rename(columns={'date_x': 'date'})
DF

Unnamed: 0,ucid,date,labels,patent_number,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en
0,EP-1629683-B1,20140129.0,H04W16,1629683,A METHOD AND AN APPARATUS FOR CELL PLANNING,Measuring in a cellular telecommunication syst...,Technical field\n The invention relates to ban...,A method in a cellular mobile telecommunicatio...
1,EP-1682573-B1,20140101.0,"G01N33,C12N15,C12Q1,G01N2500,C07K14,A61K39,A61...",1682573,THE USE OF EUKARYOTIC GENES AFFECTING CELL CYC...,The present invention relates to the significa...,The present invention relates to the use of ag...,An isolated nucleic acid molecule comprising a...
2,"EP-1693184-A3,EP-1693184-B1",20140910.0,"B31B70,B31B2160,B65B11,B31B2170",1693184,Method and system for creating mailpieces from...,A method for creating mailpieces from a single...,The present invention relates generally to a m...,A method for creating mailpieces from a single...
3,EP-1830222-B1,20140129.0,"G02C13,G02C7",1830222,Method for the determination of a progressive ...,A method for the determination by optical opti...,The present invention relates to a method for ...,Method for the determination of a personalized...
4,"EP-1881160-A3,EP-1881160-B1",20140129.0,F01D11,1881160,Seal for an annulus filler between fan blades,A seal for an annulus filler (326) for a gas t...,"This invention relates to gas turbine engines,...",A seal arrangement for an annulus filler (326)...
...,...,...,...,...,...,...,...,...
97058,US-9226362-B2,20151229.0,H05B33,9226362,Transparent inorganic thin-film electrolumines...,"An inorganic, transparent thin film electrolum...",TECHNICAL FIELD\n This disclosure relates to t...,"The invention claimed is: \n 1. An inorganic, ..."
97059,US-9226393-B2,20151229.0,"B41J2,H05K3,H05K2201,H05K1,G06K19",9226393,Tear-proof circuit,A circuit including a flexible substrate and a...,BACKGROUND\n 1. Technical Field\n The present ...,"The invention claimed is:\n 1. A circuit, comp..."
97060,US-9226410-B2,20151229.0,"H05K2201,H05K2203,B32B15,B32B7,B32B2457,B32B22...",9226410,Method of making a flexible circuit,A method of manufacturing a multilayer flexibl...,FIELD OF THE INVENTION\n The present invention...,The invention claimed is:\n 1. A method of man...
97061,US-9226411-B2,20151229.0,"G06F3,G01N27,H05K3,G06F2203,B05D5",9226411,Making multi-layer micro-wire structure,A method of making a multi-layer micro-wire st...,CROSS REFERENCE TO RELATED APPLICATIONS\n This...,The invention claimed is:\n 1. A method of mak...


In [17]:
DF.to_csv("/PATH/sample_dataset_cpc.csv",  sep =';')