Skip to content

BiLSTM model to extract product information such as brand name, color, or size from multilingual texts (mainly Finnish and English)

Notifications You must be signed in to change notification settings

YuTian8328/NER-Multilingual-Product

Repository files navigation

NER & Extraction from product-data

The goal of this task is to extract product information such as BRAND NAME, SIZE, COLOR, GENDER,AGE,VOLUME, WEIGHT from product's titles and descriptions supplied by 119 unique providers. There are 670k observations in the dataset. The texts are mixture of English and Finnish. There is no missing data in brand column, so it can be used to annotate Brand Name (branding.py). Meta column is also quite useful for labeling the data(size,color,gender and age) despite missing some information.(datagenerator.py)

This is how it looks like sample

Brand Names

As the texts are mixture of Finnish and English, and lots of brand names are rarely-used words or industry-created new words, thus existing word embedding methods, such as GloVe or Bert, can't properly embedding them. If these words can't be embedded properly in a NER model, it's definitely a big problem.

The solution is to train a custom word embedding model by using library gensim and library nltk with this special corpus(train_custom_w2v.py). Based on this custom word embedding model, the NER Model can extract Brand Names quite effectively (Test set accuracy approximates 99.5%)(train_and_evaluate_model.py)

COLOR, VOLUME, WEIGHT, GENDER, AGE

Color,volume,weight,gender and age are labled by using regex and information from meta column.

The NER Model has also achieved quite good performance in extracting color,volume and weight as well as brand name. Test set accuracy is 99.65%.(train_and_evaluate_model.py)

SIZE relevant information

The forms of size relevant information varied:

  • "100cm"
  • "3.0cm - 5.0cm"
  • "95x220cm"
  • "0,5 x 1,8 x 49cm"
  • "kokoja: s, m, l, xl, xxl"
  • "kokoja: us 34 us 35 us 36"
  • "etupituus: 56cm, takapituus: 70cm"
  • "R14"
  • "kokoja: 3, 3 1/2, 4, 4 1/2"
  • "kokoja: yksi koko"
  • "koko: standard"
  • ......

To extract this kind of information, NER is not a good choice (accuracy approximates 50% in our experiment). But utilizing Regex is a quite effective and efficient solution. Thus the final information extraction strategy will be a combination of regex and NER model(test_example.py).

Work fLow

NER model

A BiLSTM neural network is built to execute NER and Extraction.

The architecture looks like this:

Test with unseen text example

test_example.py

About

BiLSTM model to extract product information such as brand name, color, or size from multilingual texts (mainly Finnish and English)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages