*Licensed under the MIT License. See LICENSE-CODE in the repository root for details.*

*Copyright (c) 2025 Eleni Kamateri*

### Generating Training Datasets for Classification Test Sets

This script creates training datasets corresponding to the classification test sets (CLTS) by retrieving and structuring patent data from the WPI collection. The process is similar to the [CLTS Testing Dataset Creation script](https://github.com/cs1msa/WPIplus/blob/main/UsingWPI%2B/An%20example%20of%20a%20classification%20experiment%20workflow/Source%20Code/CLTS%20Testing%20Dataset%20Creation.ipynb), with one key difference that instead of retrieving patent data included in the CLTS, this script retrieves all patent data that is NOT part of the CLTS.

#### Process Overview

1. **Load the Classification Test Set:**
Load patent numbers of the CLTS of interest.

2. **Retrieve Full-Text Data from WPI collection:**
This step involves retrieving the corresponding full-text patent data from the original WPI collection based on the patent numbers NOT identified in the CLTS. Extracted full-text data include:

- Abstract
- Description
- Claims
- Title
    
(Optional) Depending on the use case, the script may also extract other additional metadata, such as:

- Application Date
- Publication Date
- Patent Inventor/Applicant
- Kind Code
      
3. **Merge and Structure the Dataset:**
This step merges different kind codes of the same patent, retaining the latest information for each field.

4. **Process the final CLTS dataset for the classification experiment:**
Initially, we retain only patents that have all textual fields completed. Then, this step includes column renaming, label formatting and truncation. Labels are converted to subclass format (e.g., "G06F 17/30" → "G06F" and the Description and Claims sections are filtered to retain the first 300 words.

#### Configurable Parameters

Researchers can modify the following parameters to customize the test set generation:

**classification_test_set_path** – Path to the classification test set file. 

**vertical_origin_path** – Path to the core vertical of the WPI dataset. 

        Example: "/YOUR_PATH/WPI-Dataset/EP/". 

**destination_path** – Path to the folder where the generated files will be stored. 

**sep** – Defines the separator used in the CSV file:

        0: Semicolon (;)
        
        1: Comma (,)
        
        
The code below creates the training dataset for the classification test set 1 for the #EP core vertical and IPCR labels, referred to as [#CLTSep_VP_ipcr_1.csv](https://github.com/cs1msa/WPIplus/blob/main/Ground%20Truths/Classification/%23CLTSep/%23CLTSep-ipcr%7Bfilter%3A%20B%2C%20all%2C%2020151001%7D/CLTSep_VP_ipcr_1.csv).

### Set the required parameters for the script

In [15]:
classification_test_set_path="/YOUR_PATH/CLTSep_VP_ipcr_1.csv."
vertical_origin_path="/YOUR_PATH/WPI-Dataset/EP/"
destination_path="/YOUR_PATH/WPI-Dataset/"

### Import all required libraries for the script

In [13]:
import numpy as np
import pandas as pd
import os
from bs4 import BeautifulSoup
import time

### Textual Data Retrieval 

In [None]:
# Count the time
start_time = time.time()
df_class_train = pd.DataFrame()
counter_class=0

DF = pd.read_csv(classification_test_set_path, header=0)
DF_doc_number_list=DF['patent_number'].tolist()

for folder_level_1 in os.listdir(vertical_origin_path): #CC
    for folder_level_2 in os.listdir(vertical_origin_path+"/"+folder_level_1): #nnnnnn
        for folder_level_3 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2): #nn
            for folder_level_4 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3): #nn
                for folder_level_5 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4): #nn                                        
                    for folder_level_6 in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5): #nn                                        
                        for files in os.listdir(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6): #nn                                        

                            counter_class=counter_class+1
                            
                            if counter_class%1000==0:
                                print(counter_class)

                            files_proc=files.split(".")[0]
                            
                            try:
                                doc_number_proc=files_proc.split("-")[1]
                                doc_number_proc=int(doc_number_proc)                                
                            except Exception:
                                doc_number_proc=''
                                print("Exception with doc-number", files)
                                                                                                                
                            if doc_number_proc not in DF_doc_number_list:                                                                                                                                                                                
                                
                                title_en_text  = ''     
                                abstract_en_text   = ''                                        
                                description_en_text = ''                                                   
                                claims_en_text = ''

                                content = open(vertical_origin_path+"/"+folder_level_1+"/"+folder_level_2+"/"+folder_level_3+"/"+folder_level_4+"/"+folder_level_5+"/"+folder_level_6+"/"+files,'r',encoding='utf-8').read()
                                soup = BeautifulSoup(content, 'xml')
                                document_info = soup.find_all("patent-document")  

                                try:
                                    ucid=document_info[0]['ucid']
                                except Exception:
                                    ucid=''
                                    print("Exception 1, ucid does not exist")
                                                       
                                try:
                                    date=document_info[0]['date']
                                except Exception:
                                    date=''
                                    print("Exception 2, date does not exist")
                                
                                code_main=''
                                for main_classification in soup.find_all('main-classification'):
                                    code_main=main_classification.getText()      

                                further_group_help=[]
                                further_group_list=[]
                                for further_classification in soup.find_all('further-classification'):
                                    code_further=further_classification.getText()
                                    further_group_help.append(code_further) if code_further not in further_group_help else further_group_help
                                further_group_list = ", ".join(further_group_help)

                                ipcr_group_help=[]
                                ipcr_group_list=[]
                                for classification_ipcr in soup.find_all('classification-ipcr'):
                                    code_ipcr=classification_ipcr.getText()
                                    ipcr_group_help.append(code_ipcr) if code_ipcr not in ipcr_group_help else ipcr_group_help
                                ipcr_group_list = ", ".join(ipcr_group_help)
    
                                cpc_group_help=[]
                                cpc_group_list=[]
                                for classification_cpc in soup.find_all('classification-cpc'):
                                    code_cpc=classification_cpc.getText()
                                    cpc_group_help.append(code_cpc) if code_cpc not in cpc_group_help else cpc_group_help
                                cpc_group_list = ", ".join(cpc_group_help)
                                
                                title_en=soup.find('invention-title', attrs={'lang':'EN'})
                                if title_en != None:
                                    title_en=title_en.getText().replace('"', ' ')
                                    title_en=" ".join(title_en.split())
                                    title_en_text=title_en

                                abstract_en=soup.find('abstract', attrs={'lang':'EN'})
                                if abstract_en != None:
                                    abstract_en=abstract_en.getText().replace('"', ' ')
                                    abstract_en=" ".join(abstract_en.split())
                                    abstract_en_text=abstract_en
                                                
                                description_en = soup.find('description', attrs={'lang':'EN'})
                                if description_en != None:
                                    description_en=description_en.getText().replace('"', ' ')
                                    description_en=" ".join(description_en.split())
                                    description_en_text=description_en
                                                                
                                claims_en = soup.find('claims', attrs={'lang':'EN'})
                                if claims_en != None:
                                    claims_en=claims_en.getText().replace('"', ' ')
                                    claims_en=" ".join(claims_en.split())                          
                                    claims_en_text=claims_en
                                
                                df_class_train.loc[counter_class-1, 'xml_file_name']=files
                                df_class_train.loc[counter_class-1, 'ucid']=ucid
                                df_class_train.loc[counter_class-1, 'date']=date
                                df_class_train.loc[counter_class-1, 'main_classification']=code_main                                    
                                df_class_train.loc[counter_class-1, 'further_classification']=further_group_list                                        
                                df_class_train.loc[counter_class-1, 'classification_ipcr']=ipcr_group_list
                                df_class_train.loc[counter_class-1, 'classification_cpc']=cpc_group_list 
                                df_class_train.loc[counter_class-1, 'title_lang_en']=title_en_text
                                df_class_train.loc[counter_class-1, 'abstract_lang_en']=abstract_en_text
                                df_class_train.loc[counter_class-1, 'description_lang_en']=description_en_text
                                df_class_train.loc[counter_class-1, 'claims_lang_en']=claims_en_text                            

In [None]:
df_class_train

In [19]:
df_class_train['patent_number']=df_class_train['xml_file_name'].str.split(".").str[0]
df_class_train['patent_number']=df_class_train['patent_number'].str.split("-").str[1:2]
df_class_train['patent_number']=df_class_train['patent_number'].str.join('')

In [13]:
df_class_train.shape

(546258, 13)

df_class_train -> 546258

df_class_test -> 6181

all -> 552.439 (confirmed!)

### When multiple kind codes exist for a single patent, we merge the most updated information for the selected fields.

In [None]:
df_class_train= df_class_train.replace('', pd.NA)
df_class_train=df_class_train.groupby('patent_number').agg({'patent_number':'last', 'classification_ipcr':'last', 'title_lang_en':'last', \
                            'abstract_lang_en':'last', 'description_lang_en':'last', 'claims_lang_en': 'last'})
df_class_train = df_class_train.reset_index(drop=True)

In [18]:
df_class_train

Unnamed: 0,patent_number,classification_ipcr,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en
0,0440147,G01N 33/53 20060101ALI20040803BHEP ...,Preparation and use of a human antibody gene b...,,,A process for preparing a human-antibody libra...
1,0562003,A23L 27/00 20160101AFI20160226RMEP ...,IMPROVED SACCHARIFICATION OF CELLULOSE BY CLON...,,BACKGROUND OF THE INVENTION1. Field of the Inv...,A process for modifying the expression of extr...
2,0611167,B65D 47/08 20060101AFI19940602BHEP ...,Closure device,,This invention is concerned with devices for c...,A receptacle for receiving sharps and medical ...
3,0629632,C08F 4/60 20060101A I20060521RMEP ...,Polypropylene,,The present invention relates to a propylene h...,A propylene copolymer having the following pro...
4,0630170,H05B 3/06 20060101ALI19940928BHEP ...,Electrical connection for window,,,Glazing pane equipped with at least one access...
...,...,...,...,...,...,...
466950,2961250,F21V 29/00 20150101ALI20151109BHEP ...,ASSEMBLY FOR EMITTING LIGHT WITH LED AND SUPPO...,,,
466951,2961251,F16B 5/00 20060101ALN20151126BHEP ...,CONNECTION STRUCTURE FOR CONNECTING ABUTTING S...,,,
466952,2961252,H05K 7/20 20060101AFI20151028BHEP ...,SYSTEMS AND METHODS FOR PASSIVE COOLING OF COM...,"A system includes an electrical enclosure, a f...",BACKGROUNDThe subject matter disclosed herein ...,"A system, comprising: an electrical enclosure ..."
466953,2961253,H05K 13/02 20060101AFI20160718BHEP ...,COMPONENT MOUNTING SYSTEM AND BULK COMPONENT D...,To provide a component mounting system that ef...,Technical FieldThe present invention relates t...,A component mounting system comprising: a prin...


### Process the final dataset

#### We retain only patents that have all textual fields completed

In [20]:
df_class_train = df_class_train[~df_class_train.isnull().any(axis=1)]
df_class_train = df_class_train.reset_index(drop=True)
df_class_train

Unnamed: 0,patent_number,classification_ipcr,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en
0,1037926,A61K 39/395 20060101ALI20130819BHEP ...,TREATMENT WITH ANTI-ErbB2 ANTIBODIES,The present invention concerns the treatment o...,Field of the InventionThe present invention co...,Use of an anti-ErbB2 antibody in the preparati...
1,1069454,G02B 3/00 20060101ALN20130621BHEP ...,THREE-DIMENSIONAL IMAGE DISPLAY,The present invention provides a three-dimensi...,TECHNICAL FIELDThe present invention relates t...,A three-dimensional image display for displayi...
2,1093259,H04L 12/861 20130101AFI20130826BHEP ...,Method for flow control,A communications node admits the receive data ...,Field Of The InventionThis invention relates t...,A method for controlling data flow in a store-...
3,1124348,H04B 7/26 20060101ALI20031126BHEP ...,Uplink timing synchronization and access control,"Timing and access signals, to be transmitted i...",Related ApplicationThe present invention is re...,A method for use in a wireless communication s...
4,1168759,H04L 1/18 20060101ALI20130830BHEP ...,Method of generating protocol data units in sp...,A method of generating protocol data units in ...,BACKGROUND OF THE INVENTIONField of the Invent...,A method of transmitting signals in a mobile c...
...,...,...,...,...,...,...
224373,2961247,H05B 33/08 20060101AFI20151119BHEP ...,METHOD AND DEVICE OF SWITCHING MODE REGULATION...,A method of regulating in switching mode compr...,TECHNICAL FIELDThis disclosure relates to powe...,A method of regulating in switching mode lumin...
224374,2961248,G09G 3/34 20060101ALI20160121BHEP ...,METHOD AND APPARATUS FOR AUTOMATICALLY CONTROL...,Provided are a method and device for automatic...,Technical FieldThe present invention relates t...,A method for automatically controlling a state...
224375,2961252,H05K 7/20 20060101AFI20151028BHEP ...,SYSTEMS AND METHODS FOR PASSIVE COOLING OF COM...,"A system includes an electrical enclosure, a f...",BACKGROUNDThe subject matter disclosed herein ...,"A system, comprising: an electrical enclosure ..."
224376,2961253,H05K 13/02 20060101AFI20160718BHEP ...,COMPONENT MOUNTING SYSTEM AND BULK COMPONENT D...,To provide a component mounting system that ef...,Technical FieldThe present invention relates t...,A component mounting system comprising: a prin...


#### We convert labels to subclass format

In [4]:
# Rename
df_class_train=df_class_train.rename(columns={'classification_ipcr': 'labels'})

# Split the 'labels' column by commas
df_class_train['labels'] = df_class_train['labels'].str.split(',')

# Extract the first part of each label after splitting by '/'
df_class_train['labels'] = df_class_train['labels'].apply(lambda x: [item.split('/')[0] for item in x])
df_class_train['labels'] = df_class_train['labels'].apply(lambda x: [item.replace(" ", "") for item in x])
df_class_train['labels'] = df_class_train['labels'].apply(lambda x: [item.strip() for item in x])
df_class_train['labels'] = df_class_train['labels'].apply(lambda x: [item[0:4]for item in x])
df_class_train['labels'] = df_class_train['labels'].apply(lambda x: set(x))
df_class_train['labels'] = df_class_train['labels'].apply(lambda x: ','.join(x))
df_class_train

Unnamed: 0.1,Unnamed: 0,patent_number,labels,title_lang_en,abstract_lang_en,description_lang_en,claims_lang_en
0,0,1037926,"C07K,A61P,A61K,C12N,C12P",TREATMENT WITH ANTI-ErbB2 ANTIBODIES,The present invention concerns the treatment o...,Field of the InventionThe present invention co...,Use of an anti-ErbB2 antibody in the preparati...
1,1,1069454,"G03B,G02B",THREE-DIMENSIONAL IMAGE DISPLAY,The present invention provides a three-dimensi...,TECHNICAL FIELDThe present invention relates t...,A three-dimensional image display for displayi...
2,2,1093259,H04L,Method for flow control,A communications node admits the receive data ...,Field Of The InventionThis invention relates t...,A method for controlling data flow in a store-...
3,3,1124348,"H04B,H04L,H04J",Uplink timing synchronization and access control,"Timing and access signals, to be transmitted i...",Related ApplicationThe present invention is re...,A method for use in a wireless communication s...
4,4,1168759,"H04B,H04L",Method of generating protocol data units in sp...,A method of generating protocol data units in ...,BACKGROUND OF THE INVENTIONField of the Invent...,A method of transmitting signals in a mobile c...
...,...,...,...,...,...,...,...
224373,224373,2961247,H05B,METHOD AND DEVICE OF SWITCHING MODE REGULATION...,A method of regulating in switching mode compr...,TECHNICAL FIELDThis disclosure relates to powe...,A method of regulating in switching mode lumin...
224374,224374,2961248,"H05B,G09G,G09F",METHOD AND APPARATUS FOR AUTOMATICALLY CONTROL...,Provided are a method and device for automatic...,Technical FieldThe present invention relates t...,A method for automatically controlling a state...
224375,224375,2961252,H05K,SYSTEMS AND METHODS FOR PASSIVE COOLING OF COM...,"A system includes an electrical enclosure, a f...",BACKGROUNDThe subject matter disclosed herein ...,"A system, comprising: an electrical enclosure ..."
224376,224376,2961253,H05K,COMPONENT MOUNTING SYSTEM AND BULK COMPONENT D...,To provide a component mounting system that ef...,Technical FieldThe present invention relates t...,A component mounting system comprising: a prin...


#### Description and Claims sections are filtered to retain the first 300 words

In [5]:
# Keep the first 300 words 
num=df_class_train.shape[0]

for i in range(num):
    if i%100==0:
        print("i:", i)
    help_para=df_class_train['description_lang_en'][i].split()
    help_para=help_para[0:300]
    help_para=' '.join(help_para)
    df_class_train.loc[i, 'description_lang_en']=help_para
for j in range(num):
    if j%100==0:
        print("j:", j)
    help_para=df_class_train['claims_lang_en'][j].split()
    help_para=help_para[0:300]
    help_para=' '.join(help_para)
    df_class_train.loc[j, 'claims_lang_en']=help_para

i: 0
i: 100
i: 200
i: 300
i: 400
i: 500
i: 600
i: 700
i: 800
i: 900
i: 1000
i: 1100
i: 1200
i: 1300
i: 1400
i: 1500
i: 1600
i: 1700
i: 1800
i: 1900
i: 2000
i: 2100
i: 2200
i: 2300
i: 2400
i: 2500
i: 2600
i: 2700
i: 2800
i: 2900
i: 3000
i: 3100
i: 3200
i: 3300
i: 3400
i: 3500
i: 3600
i: 3700
i: 3800
i: 3900
i: 4000
i: 4100
i: 4200
i: 4300
i: 4400
i: 4500
i: 4600
i: 4700
i: 4800
i: 4900
i: 5000
i: 5100
i: 5200
i: 5300
i: 5400
i: 5500
i: 5600
i: 5700
i: 5800
i: 5900
i: 6000
i: 6100
i: 6200
i: 6300
i: 6400
i: 6500
i: 6600
i: 6700
i: 6800
i: 6900
i: 7000
i: 7100
i: 7200
i: 7300
i: 7400
i: 7500
i: 7600
i: 7700
i: 7800
i: 7900
i: 8000
i: 8100
i: 8200
i: 8300
i: 8400
i: 8500
i: 8600
i: 8700
i: 8800
i: 8900
i: 9000
i: 9100
i: 9200
i: 9300
i: 9400
i: 9500
i: 9600
i: 9700
i: 9800
i: 9900
i: 10000
i: 10100
i: 10200
i: 10300
i: 10400
i: 10500
i: 10600
i: 10700
i: 10800
i: 10900
i: 11000
i: 11100
i: 11200
i: 11300
i: 11400
i: 11500
i: 11600
i: 11700
i: 11800
i: 11900
i: 12000
i: 12100
i: 12200
i: 12

i: 92300
i: 92400
i: 92500
i: 92600
i: 92700
i: 92800
i: 92900
i: 93000
i: 93100
i: 93200
i: 93300
i: 93400
i: 93500
i: 93600
i: 93700
i: 93800
i: 93900
i: 94000
i: 94100
i: 94200
i: 94300
i: 94400
i: 94500
i: 94600
i: 94700
i: 94800
i: 94900
i: 95000
i: 95100
i: 95200
i: 95300
i: 95400
i: 95500
i: 95600
i: 95700
i: 95800
i: 95900
i: 96000
i: 96100
i: 96200
i: 96300
i: 96400
i: 96500
i: 96600
i: 96700
i: 96800
i: 96900
i: 97000
i: 97100
i: 97200
i: 97300
i: 97400
i: 97500
i: 97600
i: 97700
i: 97800
i: 97900
i: 98000
i: 98100
i: 98200
i: 98300
i: 98400
i: 98500
i: 98600
i: 98700
i: 98800
i: 98900
i: 99000
i: 99100
i: 99200
i: 99300
i: 99400
i: 99500
i: 99600
i: 99700
i: 99800
i: 99900
i: 100000
i: 100100
i: 100200
i: 100300
i: 100400
i: 100500
i: 100600
i: 100700
i: 100800
i: 100900
i: 101000
i: 101100
i: 101200
i: 101300
i: 101400
i: 101500
i: 101600
i: 101700
i: 101800
i: 101900
i: 102000
i: 102100
i: 102200
i: 102300
i: 102400
i: 102500
i: 102600
i: 102700
i: 102800
i: 102900
i: 1030

i: 175000
i: 175100
i: 175200
i: 175300
i: 175400
i: 175500
i: 175600
i: 175700
i: 175800
i: 175900
i: 176000
i: 176100
i: 176200
i: 176300
i: 176400
i: 176500
i: 176600
i: 176700
i: 176800
i: 176900
i: 177000
i: 177100
i: 177200
i: 177300
i: 177400
i: 177500
i: 177600
i: 177700
i: 177800
i: 177900
i: 178000
i: 178100
i: 178200
i: 178300
i: 178400
i: 178500
i: 178600
i: 178700
i: 178800
i: 178900
i: 179000
i: 179100
i: 179200
i: 179300
i: 179400
i: 179500
i: 179600
i: 179700
i: 179800
i: 179900
i: 180000
i: 180100
i: 180200
i: 180300
i: 180400
i: 180500
i: 180600
i: 180700
i: 180800
i: 180900
i: 181000
i: 181100
i: 181200
i: 181300
i: 181400
i: 181500
i: 181600
i: 181700
i: 181800
i: 181900
i: 182000
i: 182100
i: 182200
i: 182300
i: 182400
i: 182500
i: 182600
i: 182700
i: 182800
i: 182900
i: 183000
i: 183100
i: 183200
i: 183300
i: 183400
i: 183500
i: 183600
i: 183700
i: 183800
i: 183900
i: 184000
i: 184100
i: 184200
i: 184300
i: 184400
i: 184500
i: 184600
i: 184700
i: 184800
i: 184900


j: 37400
j: 37500
j: 37600
j: 37700
j: 37800
j: 37900
j: 38000
j: 38100
j: 38200
j: 38300
j: 38400
j: 38500
j: 38600
j: 38700
j: 38800
j: 38900
j: 39000
j: 39100
j: 39200
j: 39300
j: 39400
j: 39500
j: 39600
j: 39700
j: 39800
j: 39900
j: 40000
j: 40100
j: 40200
j: 40300
j: 40400
j: 40500
j: 40600
j: 40700
j: 40800
j: 40900
j: 41000
j: 41100
j: 41200
j: 41300
j: 41400
j: 41500
j: 41600
j: 41700
j: 41800
j: 41900
j: 42000
j: 42100
j: 42200
j: 42300
j: 42400
j: 42500
j: 42600
j: 42700
j: 42800
j: 42900
j: 43000
j: 43100
j: 43200
j: 43300
j: 43400
j: 43500
j: 43600
j: 43700
j: 43800
j: 43900
j: 44000
j: 44100
j: 44200
j: 44300
j: 44400
j: 44500
j: 44600
j: 44700
j: 44800
j: 44900
j: 45000
j: 45100
j: 45200
j: 45300
j: 45400
j: 45500
j: 45600
j: 45700
j: 45800
j: 45900
j: 46000
j: 46100
j: 46200
j: 46300
j: 46400
j: 46500
j: 46600
j: 46700
j: 46800
j: 46900
j: 47000
j: 47100
j: 47200
j: 47300
j: 47400
j: 47500
j: 47600
j: 47700
j: 47800
j: 47900
j: 48000
j: 48100
j: 48200
j: 48300
j: 48400
j

j: 125600
j: 125700
j: 125800
j: 125900
j: 126000
j: 126100
j: 126200
j: 126300
j: 126400
j: 126500
j: 126600
j: 126700
j: 126800
j: 126900
j: 127000
j: 127100
j: 127200
j: 127300
j: 127400
j: 127500
j: 127600
j: 127700
j: 127800
j: 127900
j: 128000
j: 128100
j: 128200
j: 128300
j: 128400
j: 128500
j: 128600
j: 128700
j: 128800
j: 128900
j: 129000
j: 129100
j: 129200
j: 129300
j: 129400
j: 129500
j: 129600
j: 129700
j: 129800
j: 129900
j: 130000
j: 130100
j: 130200
j: 130300
j: 130400
j: 130500
j: 130600
j: 130700
j: 130800
j: 130900
j: 131000
j: 131100
j: 131200
j: 131300
j: 131400
j: 131500
j: 131600
j: 131700
j: 131800
j: 131900
j: 132000
j: 132100
j: 132200
j: 132300
j: 132400
j: 132500
j: 132600
j: 132700
j: 132800
j: 132900
j: 133000
j: 133100
j: 133200
j: 133300
j: 133400
j: 133500
j: 133600
j: 133700
j: 133800
j: 133900
j: 134000
j: 134100
j: 134200
j: 134300
j: 134400
j: 134500
j: 134600
j: 134700
j: 134800
j: 134900
j: 135000
j: 135100
j: 135200
j: 135300
j: 135400
j: 135500


j: 207600
j: 207700
j: 207800
j: 207900
j: 208000
j: 208100
j: 208200
j: 208300
j: 208400
j: 208500
j: 208600
j: 208700
j: 208800
j: 208900
j: 209000
j: 209100
j: 209200
j: 209300
j: 209400
j: 209500
j: 209600
j: 209700
j: 209800
j: 209900
j: 210000
j: 210100
j: 210200
j: 210300
j: 210400
j: 210500
j: 210600
j: 210700
j: 210800
j: 210900
j: 211000
j: 211100
j: 211200
j: 211300
j: 211400
j: 211500
j: 211600
j: 211700
j: 211800
j: 211900
j: 212000
j: 212100
j: 212200
j: 212300
j: 212400
j: 212500
j: 212600
j: 212700
j: 212800
j: 212900
j: 213000
j: 213100
j: 213200
j: 213300
j: 213400
j: 213500
j: 213600
j: 213700
j: 213800
j: 213900
j: 214000
j: 214100
j: 214200
j: 214300
j: 214400
j: 214500
j: 214600
j: 214700
j: 214800
j: 214900
j: 215000
j: 215100
j: 215200
j: 215300
j: 215400
j: 215500
j: 215600
j: 215700
j: 215800
j: 215900
j: 216000
j: 216100
j: 216200
j: 216300
j: 216400
j: 216500
j: 216600
j: 216700
j: 216800
j: 216900
j: 217000
j: 217100
j: 217200
j: 217300
j: 217400
j: 217500


In [6]:
df_class_train.to_csv(destination_path+"CLTSep_VP_ipcr_1_train_dataset.csv",  sep =';')

In [7]:
df_class_train['description_lang_en'][0]

'Field of the InventionThe present invention concerns the treatment of disorders characterized by the overexpression of with malignant breast ErbB2. More specifically, the invention concerns the treatment of human patients with malignant breast cancer overexpressing ErbB2 with a combination of an anti-ErbB2 antibody and a chemotherapeutic agent that is a taxoid, in the absence of an anthracycline, e.g. doxorubicin or epirubicin.Background of the InventionProto-oncogenes that encode growth factors and growth factor receptors have been identified to play important roles in the pathogenesis of various human malignancies, including breast cancer. It has been found that the human ErbB2 gene (erbB2, also known as her2, or c-erbB-2), which encodes a 185-kd transmembrane glycoprotein receptor (p185HER2) related to the epidermal growth factor receptor (EGFR), is overexpressed in about 25% to 30% of human breast cancer (Slamon et al., Science 235:177-182 [1987]; Slamon et al., Science 244:707-71