Skip to content
This repository has been archived by the owner on Jun 2, 2021. It is now read-only.

Zalo AI Challenge 2020: News Summarization - Runner-up solution

Notifications You must be signed in to change notification settings

btcnhung1299/zaloai-2020-news-summarization

 
 

Repository files navigation

Zalo AI Challenge 2020 - News Summarization

This is the solution write-up of team NKT for the News Summarization Track of Zalo AI Challenge 2020.

Our team members:

Approach

Pre-processing

Training dataset was created from a subset of features in the provided dataset, including:

  • match_summary
  • html_annotation
  • train_id (for file identity only)

We follow the format of ACE2004, a dataset for the task of entity/relation extraction, and introduce new type of entities and relations regarding the current problem.

Entity Meaning Entity's Vietnamese name
CLU Team câu lạc bộ
SCO Team goals số bàn thắng
PSC Player who scored cầu thủ ghi bàn
PCA Player who received a card cầu thủ nhận thẻ
PSI Substitute cầu thủ thay thế
PSO Player who was replaced cầu thủ ra sân
TSC Time when a player scored thời gian ghi bàn
TCA Time when a card is shown thời gian nhận thẻ
TSI Time when substitution happened thời gian vào sân
Relation Meaning Relation's Vietnamese name
COMP(CLU, CLU) compete against đấu với
SCOC(CLU, SCO) score đạt được bởi
SCOP(PSC, CLU) score for có pha lập công của
SCOT(PSC, TSC) score at là thời điểm lập công của
CARP(PCA, CLU) carded as a player of có thẻ phạt từ
CART(PCA, TCA) carded at là thời điểm nhận thẻ phạt của
SUBP(PSI, PSO) substitute for bị thế chỗ bởi
SUBT(PSI, TME) substitute at là thời điểm bắt đầu của

We use heuristic matching method to find the position of entities within and without relations:

  • SCO: We prioritize text inside a score pattern score - score (e.g., 4 - 2, 2 - 3) for matching.
  • We use perfect matching to identify position of other entities. To handle short forms (e.g. Pogba short for Paul Pogba), we keep looking for subwords matching in the referenced events if no perfect match found.

Text in html_annotation was segmentized using VnCoreNLP toolkit and concatenated as a sample. Next, we encoded information in match_summary as entities and relations using matching method explained above. Note that entity position found within referenced events were converted to absolute position within the full content.

We only use the data created following this procedure to train our model and do not use any external data.

Model

Our approach for this challenge was primarily based on the paper titled "Entity-Relation Extraction as Multi-Turn Question Answering". To extract specific information from sport report, we cast the task to jointly entity-relation extraction with entities and relations. This task was indirectly solved as question answering problem:

  • To extract entities, simply state it in the single question.
  • To extract multiple entities in a relation, we need to answer questions in multiple turns. Due to time constraints, we only consider relations which can be extracted in two turns of question-answer.

We used PhoBERT as feature extractor, followed by a classification head. Each token is classified into one of 5 tags B, I, O, E, S (see also) similar to typical sequence tagging problem.

Firstly, segmentized content was fed to the model to classify tag of each token. An entity was constituted by considering all tokens between B and E tags. These entities served as head entities in relations, which were replaced in proper template to form questions. Answers to these questions result in tail entities of relations.

More specifically, we extract all the entities belong to 9 NER types above in the first turn. The question template for this turn is

Liệt kê tất cả [Entity's Vietnamese name].

Example: The question for extracting all CLU entities is "Liệt kê tất cả [câu lạc bộ]".

The extracted entities will then be used as the head entities in the question template for the second turn.

[Tail entity's Vietnamese name] nào [relation's Vietnamese name] [head entity's Vietnamese name] [extracted head entity's value]

Example: To extract the tail entity club (CLU) that competes with the head entity U22 Việt Nam (CLU), the question for the corresponding relation COMP(CLU, CLU) is "[Câu lạc bộ] nào [đấu với] [câu lạc bộ] [U23 Việt Nam]".

To reflect the jointly extraction method, the loss function took into account loss of both entites and relation extraction phase:

where is the parameter that controls the trade-off between extracting the right head entities and accurately extracting the tail entities in their corresponding relations. Please refer to the paper for more details.

The code for the original model is available at the official repository of the paper.

Post-processing

The output of the model for each football article is written in a text file where each line contains an entity or a relation prediction. For example:

CLU Chelsea
CLU Arsenal
COMP Chelsea Arsenal

In order to combine the extracted entities and relations to form the prediction result, we propose some heuristic approaches to filter then convert the output of the system to submission's format. The required fields are completed in the following order:

  1. Teams
  2. Score board
  3. Score list
  4. Card list
  5. Substitution list

Teams

As many extracted entities and relations comprise team names, we propose to set priorities over the order of team name extraction. Of the highest priority is COMP relation where both team names may exist. The second priority lies on SCOP, SCOC and CARP where the club is paired with a player, score or card. We choose CLU as the lowest priority as a football article may contain several references of a club name (even a random one to make comparison).

To avoid alias or inconsistent occurences of team name, we use longest common sequence algorithm (lcs) to check and keep the longer name.

Score board

We include all goalscore information from the SCOP, SCOC and SCO extracted by our model. With SCOP relation, we group the players by their corresponding team and assume the team score is its number of players. We then continue to extract score information from SCOC, and only use SCO to handle draw cases.

Score list

We include all goalscore information from the SCOT and SCOP extracted by our model. We use the SCOT to fill in the player_name and time fields, then use the SCOP to map the player_name to their team if this information is available, otherwise, we set team to an empty string. We also include all remaining SCOP players with their club information that do not appear in SCOT relation in our final score list. Note that we do not take into account the standalone PSC and TSC entities as doing so jeopadizes our performance on the public test set.

Card list

The card list is constructed from the information in the CART and CARP relations. We follow the same procedure as that for the score list.

Substitution list

To create the substitution list, we iterate through all the SUBP(PSI, PSO) in the output file to get player_in and player_out information, and use the SUBT relation to pull out the corresponding time for the substitution (set time to an empty string if this information is not available). We then append the list with all the in_player that have not been included with the substitution time from the SUBT. We again do not consider using the standalone entities PSI, PSO and TSI when constructing the list.

Code usage

Firstly, clone our this repository

Install Requirements

pip install -r requirements.txt

Preprocessing

  • Put train data into ./data/raw
  • Go to scripts folder cd scripts
  • Run command sh prepare_zalo.sh

Training

  • Go to scripts folder cd scripts
  • Run command sh train_zalo.sh
  • After run, model is saved in ./checkpoint

Inference

  • If you want to only predict, please download my docker
  • Load image docker load < nkt_image.tar.gz
  • Run command docker run –v [path to test data]:/data –v [path to result folder]:/result nkt_image:final /bin/bash /model/predict.sh

About

Zalo AI Challenge 2020: News Summarization - Runner-up solution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 57.8%
  • Jupyter Notebook 39.7%
  • Shell 2.5%