*Licensed under the MIT License. See LICENSE-CODE in the repository root for details.*

*Copyright (c) 2025 Eleni Kamateri*

### WPI Analysis

This script processes a CSV file containing essential data for analyzing patent documents of a specific vertical (e.g., EP) and generates five output files refering to the virtual patents, each named by the vertical and the corresponding file:

1. **\[VerticalName\]_PatDocs.csv** – A list of all patent documents in the specific vertical, sorted by patent number (fields: 'xml_file_name', 'ucid', 'patent_number').
2. **\[VerticalName\]_Pat.csv** – A list of all patent numbers in the specific vertical (field: 'patent_number').
3. **\[VerticalName\]_ClassInfoIPC.csv** – Contains the IPC classification codes for all patent documents in the specific vertical, sorted by patent number (fields: 'ucid', 'main_classification', 'further_classification').
4. **\[VerticalName\]_ClassInfoIPCR.csv** – Contains the IPCR classification codes for all patent documents in the specific vertical, sorted by patent number (fields: 'ucid', 'classification_ipcr').
5. **\[VerticalName\]_ClassInfoCPC.csv** – Contains the CPC classification codes for all patent documents in the specific vertical, sorted by patent number (fields: 'ucid', 'classification_cpc').

#### Requirements

1. This script should be applied to the extracted patent documents, which are organized into separate folders. It requires that the script "*7z Files Extraction and Organization by Vertical.ipynb*" has been run first.
2. This script requires the CSV file generated by the "*CSV File Creation for Patent Document Analysis.ipynb*". 

#### Configurable Parameters

Researchers can customize the process using the following parameters:

**destination_path** – Path to the folder where the generated files will be stored. 
        

**csv_file_path** – Path to the CSV file containing data for the specific vertical.
        

**sep:** – Defines the seperator used in the CSV file:
        
        0: semicolon
        1: comma


**vertical** – Defines the vertical code used for filename creation:
       
        0: EP
        1: WO
        2: US
        3: CN
        4: JP
        5: KR

### Set parameters

In [1]:
filename1="PatDocs"
filename2="Pat"
filename3="ClassInfoIPC"
filename4="ClassInfoIPCR"
filename5="ClassInfoCPC"
destination_path="/YOUR_PATH/WPI-Dataset/"
csv_file_path="/YOUR_PATH/EP_csv_file_for_wpi_analysis.csv"
sep=0
vertical=0

In [2]:
if vertical==0:
    destination_path=destination_path+"VPep"
elif vertical==1:
    destination_path=destination_path+"VPwo"
elif vertical==1:
    destination_path=destination_path+"VPus"
elif vertical==1:
    destination_path=destination_path+"VPcn"
elif vertical==1:
    destination_path=destination_path+"VPjp"
elif vertical==1:
    destination_path=destination_path+"VPkr"
else:
    print("Provide a valid vertical number")

### Import all required libraries for the script

In [3]:
import pandas as pd
import numpy as np

In [4]:
#set the print option
pd.set_option("max_colwidth",None)
pd.set_option("max_rows", None)

### Import the CSV file and load its data into a DataFrame

In [5]:
if sep==0:
    DF = pd.read_csv(csv_file_path, header=0, delimiter=";") #, nrows=1000)
elif sep==1:
    DF = pd.read_csv(csv_file_path, header=0) #, nrows=1000)
else:
    print("Please provide a valid value for sep")

print(DF.shape)
DF.head(1)

(552439, 11)


Unnamed: 0.1,Unnamed: 0,xml_file_name,ucid,date,main_classification,further_classification,classification_ipcr,classification_cpc,abstract_lang_en_exist,description_lang_en_exist,claims_lang_en_exist
0,0,EP-2677851-A1.xml,EP-2677851-A1,20140101,,,"A01B 79/02 20060101AFI20120911BHEP , A01B 69/00 20060101ALI20120911BHEP , A01C 7/00 20060101ALI20120911BHEP , A01C 21/00 20060101ALI20120911BHEP","A01B 79/005 20130101 LI20150420BHEP , A01C 21/005 20130101 LI20150420BHEP , A01B 69/007 20130101 LI20150420BHEP , A01C 7/00 20130101 LI20131205BHEP , A01C 7/06 20130101 FI20131205BHEP , A01C 21/00 20130101 LI20131205BHEP , A01C 15/00 20130101 LI20131205BHEP",1.0,1.0,1.0


### Identify the patent number and kind code, and append these two fields to the initial DataFrame

In [6]:
DF['patent_number']=DF['xml_file_name'].str.split(".").str[0]
DF['patent_number']=DF['patent_number'].str.split("-").str[1:2]
DF['patent_number']=DF['patent_number'].str.join('')

DF['kind_code']=DF['xml_file_name'].str.split(".").str[0]
DF['kind_code']=DF['kind_code'].str.split("-").str[2:3]
DF['kind_code']=DF['kind_code'].str.join('')

DF['kind_code_letter']=DF['kind_code'].str[0]

### Create PatDocs file

In [7]:
DF1=DF.sort_values(by = 'patent_number', ascending=True)
DF1=DF1.groupby('patent_number').agg({'xml_file_name':'last', 'ucid':'last', 'patent_number':'last' })
DF1=DF1.reset_index(drop=True)
DF1.to_csv(destination_path+"_"+filename1+".csv", index=False)

In [8]:
DF1.head(10)

Unnamed: 0,xml_file_name,ucid,patent_number
0,EP-0440147-B2.xml,EP-0440147-B2,440147
1,EP-0562003-B2.xml,EP-0562003-B2,562003
2,EP-0611167-B3.xml,EP-0611167-B3,611167
3,EP-0629632-B2.xml,EP-0629632-B2,629632
4,EP-0630170-B2.xml,EP-0630170-B2,630170
5,EP-0639563-B2.xml,EP-0639563-B2,639563
6,EP-0656786-B2.xml,EP-0656786-B2,656786
7,EP-0663146-B2.xml,EP-0663146-B2,663146
8,EP-0667102-B2.xml,EP-0667102-B2,667102
9,EP-0668080-B2.xml,EP-0668080-B2,668080


### Create Pat file

In [9]:
DF2=DF.sort_values(by = 'patent_number', ascending=True)
DF2=DF2[['patent_number']]
DF2 = DF2.drop_duplicates(subset = ["patent_number"])
DF2=DF2.reset_index(drop=True)
DF2.to_csv(destination_path+"_"+filename2+".csv", index=False)

In [10]:
DF2.head(10)

Unnamed: 0,patent_number
0,440147
1,562003
2,611167
3,629632
4,630170
5,639563
6,656786
7,663146
8,667102
9,668080


### Create ClassInfoIPC file

In [11]:
DF3=DF.sort_values(by = 'patent_number', ascending=True)
DF3=DF3.groupby('patent_number').agg({'patent_number':'last', 'main_classification':'last', 'further_classification':'last'})
DF3=DF3.reset_index(drop=True)
DF3.to_csv(destination_path+"_"+filename3+".csv", index=False)

In [12]:
DF3.head(10)

Unnamed: 0,patent_number,main_classification,further_classification
0,440147,,
1,562003,,
2,611167,,
3,629632,,
4,630170,,
5,639563,,
6,656786,,
7,663146,,
8,667102,,
9,668080,,


### Create ClassInfoIPCR file

In [13]:
DF4=DF.sort_values(by = 'patent_number', ascending=True)
DF4=DF4.groupby('patent_number').agg({'patent_number':'last', 'classification_ipcr':'last'})
DF4=DF4.reset_index(drop=True)
DF4.to_csv(destination_path+"_"+filename4+".csv", index=False)

In [14]:
DF4.head(10)

Unnamed: 0,patent_number,classification_ipcr
0,440147,"G01N 33/53 20060101ALI20040803BHEP , C12Q 1/68 20060101ALI20040803BHEP , C12N 15/13 20060101AFI20040803BHEP , C12P 21/08 20060101ALI20040803BHEP , C12N 15/09 20060101AFI20051220RMJP , G01N 33/531 20060101A I20060722RMEP , C07K 16/44 20060101A I20051008RMEP , C12N 15/65 20060101A I20051008RMEP , G01N 33/68 20060101A I20060722RMEP , C12N 15/63 20060101A I20051110RMEP , C12N 15/10 20060101A I20051008RMEP , C12N 15/70 20060101A I20051008RMEP , C12R 1/91 20060101ALN20051220RMJP , C07K 7/06 20060101A I20051008RMEP , C07K 16/06 20060101A I20051110RMEP , C40B 30/04 20060101A I20070721RMEP , C12R 1/19 20060101ALN20051220RMJP , C07K 16/00 20060101A I20051008RMEP , C12N 9/88 20060101A I20051008RMEP"
1,562003,"A23L 27/00 20160101AFI20160226RMEP , C12N 15/00 20060101ALI20051220RMJP , C12R 1/885 20060101ALN20051220RMJP , C12N 9/42 20060101A I20051008RMEP , C12N 1/15 20060101ALI20051220RMJP , C12N 9/26 20060101A I20051008RMEP , C12P 19/00 20060101A I20051008RMEP , C12N 15/09 20060101ALI20051220RMJP , C12N 15/55 20060101A I20060521RMUS , C12G 1/02 20060101A I20051008RMEP , C12G 1/00 20060101A I20051008RMEP , C12Q 1/68 20060101ALI19940630BHEP , C12P 19/14 20060101ALI19940630BHEP , C12N 15/80 20060101ALI19940630BHEP , C12N 15/56 20060101AFI19940630BHEP , C12N 9/24 20060101ALI19940630BHEP , C11D 3/386 20060101ALI19940630BHEP"
2,611167,"B65D 47/08 20060101AFI19940602BHEP , A61J 1/00 20060101ALI20051220RMJP , A61G 12/00 20060101AFI20051220RMJP , C04B 28/04 20060101A I20051008RMEP , A61M 5/32 20060101ALI20051220RMJP , B65D 43/22 20060101ALI20051220RMJP"
3,629632,"C08F 4/60 20060101A I20060521RMEP , C08F 4/659 20060101A N20051008RMEP , C08F 4/6592 20060101A N20051008RMEP , C08F 4/64 20060101A I20060521RMUS , C07C 45/46 20060101A I20051008RMEP , C07C 49/67 20060101A I20051008RMEP , C08F 4/6192 20060101A N20051008RMEP , C08F 4/619 20060101A N20051008RMEP , C08F 10/06 20060101A I20051008RMEP , C08F 210/16 20060101A I20051008RMEP , C08F 4/642 20060101A I20060521RMEP , C07C 49/697 20060101A I20051008RMEP , C08F 210/06 20060101ALI19950113BHEP , C08F 110/06 20060101ALI19950113BHEP , C08F 10/00 20060101ALI19950113BHEP , C08F 4/602 20060101ALI19950113BHEP , C07F 17/00 20060101AFI19940921BHEP"
4,630170,"H05B 3/06 20060101ALI19940928BHEP , H05B 3/84 20060101AFI19940928BHEP , H02G 3/30 20060101ALI20060310RMJP , B60J 1/20 20060101ALI20060310RMJP , B60J 1/00 20060101AFI20060310RMJP"
5,639563,"A61K 9/00 20060101A I20080531RMEP , A61K 31/085 20060101A I20051110RMEP , A61K 45/06 20060101A I20051008RMEP , A61P 27/02 20060101ALI20051220RMJP , A61K 31/558 20060101A I20051008RMEP , A61K 31/5575 20060101A I20051008RMEP , C07C 405/00 20060101AFI20040525BHEP , A61P 27/06 20060101ALI20040525BHEP , A61K 31/557 20060101ALI20040525BHEP"
6,656786,"A61K 36/48 20060101AFI20131202BHEP , A61K 9/20 20060101ALI20051220RMJP , A61K 38/22 20060101A I20070602RMWO , A61K 9/48 20060101ALI20051220RMJP , A61K 45/06 20060101A I20070602RMEP , A61K 31/35 20060101A I20051008RMEP , A61K 31/70 20060101ALI20051220RMJP , A61K 31/565 20060101ALI20051220RMJP , A61P 15/00 20060101ALI20051220RMJP , A61P 35/00 20060101ALI20051220RMJP , A61K 31/7048 20060101A I20070602RMUS , A61K 31/704 20060101ALI20051220RMJP , A61P 13/08 20060101ALI20051220RMJP , A61K 31/353 20060101A I20070602RMUS , A61P 5/00 20060101ALI20051220RMJP , A61K 36/00 20060101A I20070602RMWO , A61P 1/00 20060101ALI20051220RMJP , A61P 15/12 20060101A I20051110RMEP , A61P 3/06 20060101ALI20051220RMJP , A23L 1/20 20060101A I20051008RMEP , A61K 31/352 20060101ALI20051220RMJP , A61P 13/02 20060101ALI20051220RMJP , A23L 1/30 20060101A I20051008RMEP , A61K 36/185 20060101A I20080531RMEP , A61K 38/08 20060101A I20051110RMEP"
7,663146,"A01J 7/00 20060101AFI19950318BHEP , A01J 5/017 20060101A I20051008RMEP"
8,667102,"A23L 13/70 20160101A I20160226RMEP , A22C 5/00 20060101ALI19950513BHEP , A22C 9/00 20060101ALI19950513BHEP , A23B 4/06 20060101AFI19950513BHEP , F25D 13/06 20060101ALI19950513BHEP"
9,668080,"A61L 15/24 20060101A I20051008RMEP , C08J 3/24 20060101A I20051008RMEP , A61L 15/60 20060101AFI19950628BHEP"


### Create ClassInfoCPC file

In [16]:
DF5=DF.sort_values(by = 'patent_number', ascending=True)
DF5=DF5.groupby('patent_number').agg({'patent_number':'last', 'classification_cpc':'last'})
DF5=DF5.reset_index(drop=True)
DF5.to_csv(destination_path+"_"+filename5+".csv", index=False)

In [17]:
DF5.head(10)

Unnamed: 0,patent_number,classification_cpc
0,440147,"C12N 9/88 20130101 LI20130101BHEP , C07K2319/00 20130101 LA20130101BHEP , C07K 16/44 20130101 LI20130101BHEP , G01N 33/6845 20130101 LI20130101BHEP , G01N 33/531 20130101 LI20130101BHEP , G01N 33/6857 20130101 LI20130101BHEP , C07K2317/21 20130101 LA20130101BHEP , C12N 15/65 20130101 LI20130101BHEP , C12N 15/70 20130101 LI20130101BHEP , C07K 7/06 20130101 LI20130101BHEP , C12N 15/1093 20130101 FI20130101BHEP , C07K 16/00 20130101 LI20130101BHEP , C07K2319/02 20130101 LA20130101BHEP , C40B 30/04 20130101 LI20130101BHEP"
1,562003,"C12N 9/2445 20130101 LI20130605BHEP , C12P 19/00 20130101 LI20130905BHEP , Y02E 50/16 20130101 LA20130905BHEP , C11D 3/38645 20130101 LI20130905BHEP , C12R 1/885 20130101 FI20130905BHEP , C12P 19/14 20130101 LI20130905BHEP , C12Y 302/01021 20130101 LI20130905BHEP , C12N 9/2408 20130101 LI20130905BHEP , C12G 1/02 20130101 LI20130905BHEP , C12G 1/00 20130101 LI20130905BHEP"
2,611167,"Y10S 220/908 20130101 LA20130518BHEP , B65D 47/0847 20130101 FI20140616BHEP , B65D2251/1066 20130101 LA20140616BHEP"
3,629632,"C08F 210/06 20130101 LA20130821BHEP , C08F 4/65916 20130101 LA20130821BHEP , C08F 4/61912 20130101 LA20130821BHEP , Y10S 526/904 20130101 LA20130518BHEP , C07C 49/67 20130101 LI20130821BHEP , C08F 10/06 20130101 LI20130821BHEP , C07C 45/46 20130101 LI20130821BHEP , C07C 49/697 20130101 LI20130821BHEP , C08F 4/61927 20130101 LA20130821BHEP , Y10S 526/943 20130101 LA20130518BHEP , C08F 4/65912 20130101 LA20130821BHEP , C08F 4/65908 20130101 LA20130821BHEP , C08F 110/06 20130101 LI20130821BHEP , C08F 10/00 20130101 LI20130821BHEP , C08F 210/16 20130101 FI20130821BHEP , C08F 4/65927 20130101 LA20130821BHEP , C07F 17/00 20130101 LI20130821BHEP"
4,630170,"H05B2203/016 20130101 LA20130101BHEP , H05B 3/84 20130101 FI20130101BHEP"
5,639563,"A61K 31/5575 20130101 LI20130101BHEP , A61K 9/0048 20130101 FI20130101BHEP , A61K 31/558 20130101 LI20130101BHEP , A61K 45/06 20130101 LI20130101BHEP , C07C 405/0025 20130101 LI20130101BHEP , A61K 31/557 20130101 LI20130101BHEP , C07C 405/00 20130101 LI20130101BHEP"
6,656786,"A61K 36/48 20130101 LI20160804BHEP , A61K 31/35 20130101 FI20160804BHEP , A23L 11/07 20160801 LI20160801BHEP , A61K 31/353 20130101 LI20160804BHEP , A23L 33/11 20160801 LI20160801BHEP , A61K 38/08 20130101 LI20160804BHEP , A61K 31/7048 20130101 LI20160804BHEP , A61K 36/185 20130101 LI20160804BHEP , A23L 11/05 20160801 LI20160801BHEP"
7,663146,A01J 5/0175 20130101 FI20130101BHEP
8,667102,"F25D 13/06 20130101 LI20160804BHEP , A23B 4/064 20130101 LI20160804BHEP , F25D2400/28 20130101 LA20160804BHEP , A23B 4/06 20130101 LI20160804BHEP , A22C 21/00 20130101 FI20160804BHEP , A22B 5/0076 20130101 LI20160804BHEP , A23L 13/76 20160801 LI20160801BHEP"
9,668080,"A61L 15/24 20130101 LI20160707BHEP , C08J2300/14 20130101 LA20160707BHEP , C08J 3/245 20130101 FI20160707BHEP , C08F 220/06 20130101 LI20160707BHEP , A61L 15/60 20130101 LI20160707BHEP"
