NER & Extraction from product-data

The goal of this task is to extract product information such as BRAND NAME, SIZE, COLOR, GENDER,AGE,VOLUME, WEIGHT from product's titles and descriptions supplied by 119 unique providers. There are 670k observations in the dataset. The texts are mixture of English and Finnish. There is no missing data in brand column, so it can be used to annotate Brand Name (branding.py). Meta column is also quite useful for labeling the data(size,color,gender and age) despite missing some information.(datagenerator.py)

This is how it looks like

Brand Names

As the texts are mixture of Finnish and English, and lots of brand names are rarely-used words or industry-created new words, thus existing word embedding methods, such as GloVe or Bert, can't properly embedding them. If these words can't be embedded properly in a NER model, it's definitely a big problem.

The solution is to train a custom word embedding model by using library gensim and library nltk with this special corpus(train_custom_w2v.py). Based on this custom word embedding model, the NER Model can extract Brand Names quite effectively (Test set accuracy approximates 99.5%)(train_and_evaluate_model.py)

COLOR, VOLUME, WEIGHT, GENDER, AGE

Color,volume,weight,gender and age are labled by using regex and information from meta column.

The NER Model has also achieved quite good performance in extracting color,volume and weight as well as brand name. Test set accuracy is 99.65%.(train_and_evaluate_model.py)

SIZE relevant information

The forms of size relevant information varied:

"100cm"
"3.0cm - 5.0cm"
"95x220cm"
"0,5 x 1,8 x 49cm"
"kokoja: s, m, l, xl, xxl"
"kokoja: us 34 us 35 us 36"
"etupituus: 56cm, takapituus: 70cm"
"R14"
"kokoja: 3, 3 1/2, 4, 4 1/2"
"kokoja: yksi koko"
"koko: standard"
......

To extract this kind of information, NER is not a good choice (accuracy approximates 50% in our experiment). But utilizing Regex is a quite effective and efficient solution. Thus the final information extraction strategy will be a combination of regex and NER model(test_example.py).

Work fLow

NER model

A BiLSTM neural network is built to execute NER and Extraction.

The architecture looks like this:

Test with unseen text example

test_example.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
checkpoint		checkpoint
img		img
w2v-model		w2v-model
.DS_Store		.DS_Store
README.md		README.md
branding.py		branding.py
config.py		config.py
datagenerator.py		datagenerator.py
dataset.csv		dataset.csv
embedding.py		embedding.py
preprocess.py		preprocess.py
test_example.py		test_example.py
train_and_evaluate_model.py		train_and_evaluate_model.py
train_custom_w2v.py		train_custom_w2v.py

YuTian8328/NER-Multilingual-Product

Folders and files

Latest commit

History

Repository files navigation

NER & Extraction from product-data

Brand Names

COLOR, VOLUME, WEIGHT, GENDER, AGE

SIZE relevant information

Work fLow

NER model

Test with unseen text example

About

Resources

Stars

Watchers

Forks

Languages