## Experiment 2: Multi label classification based on patent abstract

### Load and inspect data

First, I load the complete set of labeled abstracts from the database.

In [1]:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql+psycopg2://cdrc1103:typ95yeah@localhost:5432/Thesis', echo=False)

dataset = pd.read_sql("abstract", con=engine, index_col="patentid")

Let's get some insights about the distribution of the labels.

In [2]:
# create a unique set of the labels
unique_label = set()
for label_set in dataset["level1labels"]:
    [unique_label.add(lbl) for lbl in label_set]
unique_label

{'Active ingredients',
 'Artificial Intelligence (AI)',
 'Cleansing',
 'Decorative cosmetic',
 'Deo',
 'Devices',
 'Hair care',
 'Health care',
 'IP7 Beiersdorf',
 'Lip care',
 'Manufacturing technology',
 'Non woven',
 'Packaging',
 'Perfume',
 'Personalization',
 'Shaving',
 'Skin care',
 'Sun',
 'Sustainability',
 'no follow up'}

In [3]:
# now count the frequency of each class
label_dict = {lbl: 0 for lbl in unique_label}
for label_set in dataset["level1labels"]:
    for lbl in label_set:
        label_dict[lbl] += 1
pd.Series(label_dict, index=label_dict.keys()).sort_values()

IP7 Beiersdorf                      2
Artificial Intelligence (AI)        6
no follow up                        6
Personalization                   111
Sustainability                    356
Shaving                           735
Devices                          1614
Manufacturing technology         1699
Lip care                         1803
Decorative cosmetic              2014
Non woven                        3423
Deo                              3681
Perfume                          4015
Sun                              6461
Cleansing                       10819
Health care                     11045
Packaging                       14135
Hair care                       17794
Skin care                       27438
Active ingredients              40706
dtype: int64

Instances that contain a label from "IP7Beiersdorf", "no follow up" or
"AI" are deleted because the frequency is too low.

In [4]:
drop_list = ["IP7 Beiersdorf", "no follow up", "Artificial Intelligence (AI)"]
drop_idx = []
for i, label_list in enumerate(dataset["level1labels"]):
    for lbl in drop_list:
        if lbl in label_list:
            drop_idx.append(i)
new_dataset = dataset.copy().reset_index()
new_dataset = new_dataset.drop(index=drop_idx).reset_index()
print(len(dataset) -len(new_dataset))

14


In [5]:
# Transform label strings to ids
updated_unique_label = set()
for label_set in new_dataset["level1labels"]:
    [updated_unique_label.add(lbl) for lbl in label_set]

label_mapping = {lbl: idx for idx, lbl in enumerate(sorted(updated_unique_label))}
encodings = []
for label_set in new_dataset["level1labels"]:
    encodings.append([label_mapping[lbl] for lbl in label_set])
new_dataset["encodings"] = encodings
new_dataset

Unnamed: 0,index,patentid,abstract,level1labels,encodings
0,0,AP1665A,Disclosed is an oral dosage form comprising (i...,[Skin care],[14]
1,1,AP1682A,A method of enhancing health through the gener...,[Active ingredients],[0]
2,2,AP1904A,Light-converting material comprises a europium...,[Hair care],[5]
3,3,AP1937A,A flexible container (1) for holding a liquid ...,[Packaging],[10]
4,4,AP2011006030A0,The invention of the present Application provi...,"[Active ingredients, Sun]","[0, 15]"
...,...,...,...,...,...
107383,107397,YU62004A,Kod jednog postupka za izradu čekinje od termi...,[Packaging],[10]
107384,107398,YU75102A,"A low residue, stable antiperspirant and/or de...","[Non woven, Packaging, Skin care, Deo]","[9, 10, 14, 3]"
107385,107399,YU75202A,"Slabo otiruća, stabilna antiperspirant i/ili d...","[Non woven, Packaging, Skin care, Deo]","[9, 10, 14, 3]"
107386,107400,YU82803A,"Dvofazni, sa kuglicom za nanošenje antiperspir...","[Packaging, Deo]","[10, 3]"


Next, I transform the labels to categorical codes.

In [6]:
from pprint import pprint
import numpy as np
# encode labels
label_encoded = np.zeros([len(new_dataset), len(updated_unique_label)], dtype=np.int32)
for i, lbl_list in enumerate(new_dataset["encodings"]):
    label_encoded[i, lbl_list] = 1
print(label_encoded[0:10])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0]]


Now that the labels are encoded, I will split the dataset into sub-datasets for training,
validation and testing. It is important that all labels have equal frequency fractions in each
sub-dataset.

In [26]:
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
mskf = MultilabelStratifiedShuffleSplit(n_splits=2, test_size=0.25, random_state=0)

for train_index, test_index in mskf.split(new_dataset["abstract"], label_encoded):
   X_train, X_test = new_dataset["abstract"][train_index], new_dataset["abstract"][test_index]
   y_train, y_test = label_encoded[train_index], label_encoded[test_index]



Check if stratification worked

In [27]:
print(len(new_dataset))
print(len(X_train))
print(len(X_test))
print(np.sum(y_train, axis=0)/len(y_train))
print(np.sum(y_test, axis=0)/len(y_test))

107388
80591
26797
[0.37878919 0.10064399 0.01874899 0.03425941 0.01500168 0.16560162
 0.10279063 0.01677607 0.01580822 0.03185219 0.13154074 0.03736149
 0.00101748 0.00683699 0.25531387 0.06013078 0.00331303]
[0.379744   0.10090682 0.01877076 0.0343322  0.015039   0.16598873
 0.10303392 0.01683024 0.01585998 0.03194387 0.13188043 0.03746688
 0.00100758 0.00686644 0.25596149 0.06026794 0.00332127]


In [37]:
train = X_train.reset_index(drop=True)
train = pd.concat([train, pd.DataFrame(y_train)], axis=1)
train

Unnamed: 0,abstract,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,Disclosed is an oral dosage form comprising (i...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,A method of enhancing health through the gener...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Light-converting material comprises a europium...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,A flexible container (1) for holding a liquid ...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,"Storage stable, topical lotion compositions fo...",1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80586,A deodorant composition comprising a magnesium...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
80587,Ovaj pronalazak obuhvata jednofaznu kozmetičku...,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0
80588,"A low residue, stable antiperspirant and/or de...",0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0
80589,"Slabo otiruća, stabilna antiperspirant i/ili d...",0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0


In [38]:
test = X_test.reset_index(drop=True)
test = pd.concat([test, pd.DataFrame(y_test)], axis=1)
test

Unnamed: 0,abstract,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,The invention of the present Application provi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,An antimicrobial composition is disclosed that...,1,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
2,A composition for lubricant formulations is di...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,There is described a method of killing an inse...,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,"The disclosure relates to a three part, vegeta...",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26792,Cosmetic compositions for skin include anhydro...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
26793,"A bandage includes a top film, an absorbent la...",0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
26794,System for operting a blind in a glass-enclose...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
26795,Kod jednog postupka za izradu čekinje od termi...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [39]:
train.to_csv("train_ml_untouched.csv")
test.to_csv("test_ml_untouched.csv")

### Correct the class imbalance
In order to prevent that the NN only focuses on the more frequent classes,
I over-sample (create duplicates) the less frequent classes by setting a threshold of instances
that each class should at least have. If the threshold is not met, I sample duplicates of that class.
Since the class frequency is only relevant during training, I adapt the frequencies only for the
training data set.

In [16]:
subset_list = []
min_n_sample = 3000 # each class should have minimum n instances
n_labels = len(train["label"].unique())
for label in range(n_labels):
    subset = train[train["label"] == label]
    if len(subset) < min_n_sample:
        resampled = subset.sample(n=min_n_sample, random_state=1000, replace=True)
        subset = resampled
    subset_list.append(subset)
train_res = pd.concat(subset_list, axis=0)
print(train_res["label"].value_counts())

0     18212
13     9054
10     8411
5      6399
6      5827
1      3007
2      3000
3      3000
4      3000
7      3000
8      3000
9      3000
11     3000
12     3000
14     3000
15     3000
Name: label, dtype: int64


### Export

In [17]:
train_res.to_csv("train_v3.csv")
test.to_csv("test_v3.csv")
