## Experiment 1: Single label classification based on patent abstract

### Load and inspect data

In [1]:
import pandas as pd

labels = pd.read_csv("level1_labels.csv", index_col=0)
features = pd.read_csv("feature_stats.csv", index_col=0)
feature_stats = pd.concat([features, labels], axis=1)
feature_stats.head()

Unnamed: 0,path,abstract,description,claims,label,level1labels
AP1605A,E:\MLData\thesis\Datasets\LexisNexis\AP1605A.xml,False,False,False,0,
AP1665A,E:\MLData\thesis\Datasets\LexisNexis\AP1665A.xml,True,False,False,1,Skin care
AP1682A,E:\MLData\thesis\Datasets\LexisNexis\AP1682A.xml,True,False,False,1,Active ingredients
AP1904A,E:\MLData\thesis\Datasets\LexisNexis\AP1904A.xml,True,False,False,1,Hair care
AP1937A,E:\MLData\thesis\Datasets\LexisNexis\AP1937A.xml,True,False,False,1,Packaging


In [2]:
feature_stats = feature_stats[feature_stats["level1labels"].notna()] # drop unlabeled patents
feature_stats = feature_stats[feature_stats["abstract"] == 1] # drop patents that don't contain an abstract
print(f"Number of examples: {len(feature_stats)}")
print(feature_stats["level1labels"].value_counts())

Number of examples: 196574
Active ingredients              57205
Skin care                       30095
Packaging                       25063
Health care                     21890
Hair care                       21171
Cleansing                       10824
Sun                              8092
Perfume                          5989
Deo                              4243
Non woven                        3185
Devices                          1899
Lip care                         1861
Decorative cosmetic              1828
Manufacturing technology         1737
Shaving                           919
Sustainability                    477
Personalization                    86
Artificial Intelligence (AI)        5
no follow up                        3
IP7 Beiersdorf                      2
Name: level1labels, dtype: int64


It can be seen that there is a huge
class imbalance (57205 : 2). I decide to drop three of the minority classes because it is
unlikely that the NN will be able to make any reliable predictions when it has been trained
on only one or two examples of the respective class.

In [3]:
# define a set of classes that are excluded
# exclude minority classes
exclude_list = ["Artificial Intelligence (AI)", "no follow up", "IP7 Beiersdorf"]
mask = feature_stats['level1labels'].isin(exclude_list)
red_feature_stats = feature_stats[~mask].copy()
print(f"Number of examples: {len(red_feature_stats)}")
print(red_feature_stats["level1labels"].value_counts())

Number of examples: 196564
Active ingredients          57205
Skin care                   30095
Packaging                   25063
Health care                 21890
Hair care                   21171
Cleansing                   10824
Sun                          8092
Perfume                      5989
Deo                          4243
Non woven                    3185
Devices                      1899
Lip care                     1861
Decorative cosmetic          1828
Manufacturing technology     1737
Shaving                       919
Sustainability                477
Personalization                86
Name: level1labels, dtype: int64


### Parse Patent Files
Parse patent files for their abstract if they're labeled with one of the remaining classes.

In [27]:
# parse xml files to get features
from PipelineBricks.parse_feature import process_files
dataset = pd.DataFrame
if __name__ == "__main__":
    feature_list = ['abstract']
    dataset = process_files(red_feature_stats, feature_list)

100%|██████████| 196564/196564 [19:33<00:00, 167.55it/s]


Transform classes to categorical codes and remove instances where the abstract is a None value
(because the text is to short). Also, some patents have different names but identical content.
So, duplicates are dropped as well.

In [28]:
from pprint import pprint
# Add labels to dataset
dataset["label"] = red_feature_stats["level1labels"]
# encode labels
dataset["label"] = dataset["label"].astype('category')
dataset["label_encoded"] = dataset["label"]
dataset["label_encoded"] = dataset["label_encoded"].cat.codes
pprint({code: label for code, label in enumerate(dataset["label"].cat.categories)})
dataset["label"] = dataset["label_encoded"]
dataset = dataset[["abstract", "label"]]
dataset.head()


{0: 'Active ingredients',
 1: 'Cleansing',
 2: 'Decorative cosmetic',
 3: 'Deo',
 4: 'Devices',
 5: 'Hair care',
 6: 'Health care',
 7: 'Lip care',
 8: 'Manufacturing technology',
 9: 'Non woven',
 10: 'Packaging',
 11: 'Perfume',
 12: 'Personalization',
 13: 'Shaving',
 14: 'Skin care',
 15: 'Sun',
 16: 'Sustainability'}


Unnamed: 0,abstract,label
AP1665A,Disclosed is an oral dosage form comprising (i...,14
AP1682A,A method of enhancing health through the gener...,0
AP1904A,Light-converting material comprises a europium...,5
AP1937A,A flexible container (1) for holding a liquid ...,10
AP2011006030A0,The invention of the present Application provi...,0


In [29]:
# drop nan values
dataset = dataset.dropna(axis=0)
print(f"Number of examples: {len(dataset)}")

Number of examples: 194967


In [30]:
# drop duplicates
dataset = dataset.drop_duplicates()
print(f"Number of examples: {len(dataset)}")


Number of examples: 116191


In [31]:
# split into train, validation and test dataset
from sklearn.model_selection import train_test_split

train, test = train_test_split(dataset, test_size=0.25,
                               random_state=1000, stratify=dataset["label"])
train, val = train_test_split(train, test_size=0.1, random_state=1000, stratify=train["label"])

value_counts = pd.concat([train["label"].value_counts(),
                          test["label"].value_counts(),
                          val["label"].value_counts()],
                         axis=1,
                         keys=["train", "test", "val"])
value_counts

Unnamed: 0,train,test,val
0,22681,8401,2520
14,13830,5122,1537
5,9259,3430,1029
10,8652,3204,961
6,6545,2424,727
1,5144,1906,572
15,3400,1259,378
11,2120,785,236
3,1663,616,185
9,1399,518,155


### Correct the class imbalance
In order to prevent that the NN only focuses on the more frequent classes,
I over-sampling (create duplicates) of less frequent classes by setting a threshold of instances
that each class should have. I the threshold is not met I sample duplicates of that class.
Since the class frequency is only relevant during training, I only adapt the frequencies in the
training data set.

In [32]:
subset_list = []
min_n_sample = 3000 # each class should have minimum n instances
n_labels = len(train["label"].unique())
for label in range(n_labels):
    subset = train[train["label"] == label]
    if len(subset) < min_n_sample:
        resampled = subset.sample(n=min_n_sample, random_state=1000, replace=True)
        subset = resampled
    subset_list.append(subset)
train_res = pd.concat(subset_list, axis=0)
print(train_res["label"].value_counts())

0     22681
14    13830
5      9259
10     8652
6      6545
1      5144
15     3400
13     3000
12     3000
11     3000
8      3000
9      3000
7      3000
4      3000
3      3000
2      3000
16     3000
Name: label, dtype: int64


### Export

In [33]:
train_res.to_csv("train.csv")
val.to_csv("val.csv")
test.to_csv("test.csv")
