GitHub - amazon-science/contextual-product-qa

Contextual Product Question Answering

Contextual Product question answering (CPQA) has been a hot topic in e-commerce applications. Given a question about a product (context), the PQA system will search the product page and provide an instant answer, so that customers do not need to traverse over the page by themselves or seek help from real humans. This can greatly improve their online shopping experience while significantly reducing the cost of customer service support. This repository holds datasets supporting the contextual product question answering task.

Right now it includes two datasets:

semiPQA: Question answering from semi-structured data described in this paper.
hetPQA: Question answering from heterogeneous data described in this paper.
ePQA: An updated version of hetPQA (less noise, finer-grained labels, consider context of candidate), described in this paper.
xPQA: Cross-lingual Product Question answering in 12 languages, described in this paper.

SemiPQA

Current research mainly focuses on finding answers from either unstructured text, like product descriptions and user reviews, or structured knowledge bases with pre-defined schemas. Apart from the above two sources, a lot of product information is represented in a semistructured way, e.g., key-value pairs, lists, tables, json and xml files, etc. These semistructured data can be a valuable answer source since they are better organized than free text, while being easier to construct than structured knowledge bases. SemiPQA is a dataset to benchmark PQA over semi-structured data. It contains 10,949 written questions about json-formatted data covering 258 unique attribute types (The numbers are slightly lower than the ones reported in the paper as we removed some sensitive attributes). Each data point is paired with manually-annotated text that describes its contents, so that we can train a neural answer presenter to present the data in a natural way.

The semiPQA folder contains the dataset. It contains two sub-folders for the attribute ranking and data-to-text generation respectively. In each sub-folder, there exists 4 csv files: train, dev, testseen and testunseen, where the latter two correspond to seen and unseen attributes as mentioned in the paper. CSV files are separated by "\t".

The attribute ranking filels contain the following columns:

qid: question id
qa_pair_id: question-candidate pair id
question: question text
candidate: candidate text (json-formed attributes converted into strings)
label: 1 means that the candidate is relevant with the question and 0 otherwise.

The data-to-text files contain the following columns:

data: data text (json-formed attributes converted into strings)
text: manual written text describing the content of the data

HetPQA

It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, userprovided contents, etc. hetPQA is a large-scale benchmark dataset for answering product questions from 6 heterogeneous sources:

semi-structured attribute: Product attributes in json format as in the semiPQA dataset.
bullet point: Product summaries in the form of bullet points from the product page.
product description: Product descriptions from the manufacturer and Amazon.
on-site publication: Manually written Publications about products (for example here).
user review: User reviews written for the product.
community answer: Top-voted community answers. Answers directly replying to questions in our question set are discarded

The hetPQA dataset features (1) It provides clear annotations for both evidence ranking and answer generation, enabling us to perform in-depth evaluation of these two components separately. (2) We consider a mix of 6 heterogeneous sources, ranging from semi-structured specifications (jsons) to free sentences and (3) It represents naturally-occurring questions, unlike previous collections that elicited questions by showing answers explicitly. Questions from the hetPQA dataset are all about the "toys and games" product category. The number of instances in the released version is slightly different from the one reported in the original paper as we filtered some sensitive contents.

The hetPQA folder contains the dataset. It contains two sub-folders for the evidence ranking and answer generation respectively. In each sub-folder, there exists 3 csv files: train, dev and test. CSV files are separated by "\t".

The attribute ranking filels contain the following columns:

qid: question id
ASIN: ASIN number of the corresponding product
qa_pair_id: question-candidate pair id
question: question text
candidate: candidate text
label: 1 means that the candidate is relevant with the question and 0 otherwise
source: the source of the candidate, can be one of the 6 information sources included in the dataset

The data-to-text files contain the following columns:

ASIN: ASIN number of the corresponding product
question: question text
candidate: candidate text
answer: manually written natural-sounding answer given the question and information contained in the candidate
source: the source of the candidate, can be one of the 6 information sources included in the dataset

ePQA

ePQA is a cleaner version of hetPQA with the following differences

The attribute ranking filels contain the following columns:

It has higher annotation quality with rounds of verifications. In our in-house annotation, the error rate is less than 5%
It does not restrict the product categories, while the hetPQA dataset focuses only on the toys and games product domain
It defines finer-grained 3-class labels for each candidate, while the hetPQA dataset contains only binary labels
Every candidate is checked with its context (surrounding sentences) to make sure the label is correct, while the hetPQA dataset does not check the context.
For questions in the testset where none of the top-5 candidates are fully answering, we further ask annotators to actively search for the answer from the product page.

The files contain the following columns:

ASIN: ASIN number of the corresponding product
question: question text
qid: question id
candidate: candidate text
qa_pair_id: question-answer id
source: original source of the candidate
context: surrounding sentences of the candidate
label: 2 means the fully ansewring, 1 means helpful but not fully answering, 0 means irrelevant
answer: manually written natural-sounding answer given the question and information contained in the candidate (if the candidate is fully answering)

xPQA

xPQA is a large-scale annotated cross-lingual PQA dataset. It contains product questions asked in the following 12 languages across 9 branches:

Language	Branch	Script	Market
German	Germanic	Latin	Germany
Italian	Romance	Latin	Italy
French	Romance	Latin	France
Spanish	Romance	Latin	Spain
Portuguese	Romance	Latin	Brazil
Polish	Balto-Slavic	Latin	Poland
Arabic	Semitic	Arabic	Saudi-Arabien
Hindi	Indo-Aryan	Devanagari	India
Tamil	Dravidian	Tamil	India
Chinese	Sinitic	Chinese	China
Japanese	Japonic	Kanji;Kana	Japan
Korean	Han	Hangul	United States

Each language contains 500 questions for train/dev and 1000 questions for testing. Every question is annotated with at least 5 relevance labels plus manually written answers. The files contain the following columns:

ASIN: ASIN number of the corresponding product
question: question text
question_en: the translated English question
qid: question id
candidate: candidate text
qa_id: question-answer id
source: original source of the candidate
context: surrounding sentences of the candidate
label: 2 means the fully ansewring, 1 means helpful but not fully answering, 0 means irrelevant
answer: manually written natural-sounding answer given the question and information contained in the candidate (if the candidate is fully answering)

"test_answerable_corrected.csv" contains only answerable questions in the "test.csv". The translations for Tamil and German are manually corrected translations instead of machine translations.

If you use these datasets, please cite out paper:

@inproceedings{shen-etal-2022-semipqa,
    title = "semi{PQA}: A Study on Product Question Answering over Semi-structured Data",
    author = "Shen, Xiaoyu  and
      Barlacchi, Gianni  and
      Del Tredici, Marco  and
      Cheng, Weiwei  and
      Gispert, Adri{\`a}",
    booktitle = "Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.ecnlp-1.14",
    doi = "10.18653/v1/2022.ecnlp-1.14",
    pages = "111--120",
}

@inproceedings{shen-etal-2022-product,
    title = "Product Answer Generation from Heterogeneous Sources: A New Benchmark and Best Practices",
    author = "Shen, Xiaoyu  and
      Barlacchi, Gianni  and
      Del Tredici, Marco  and
      Cheng, Weiwei  and
      Byrne, Bill  and
      Gispert, Adri{\`a}",
    booktitle = "Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.ecnlp-1.13",
    doi = "10.18653/v1/2022.ecnlp-1.13",
    pages = "99--110",
}

@article{shen2023xpqa,
  title={xPQA: Cross-Lingual Product Question Answering across 12 Languages},
  author={Shen, Xiaoyu and Asai, Akari and Byrne, Bill and de Gispert, Adri{\`a}},
  journal={arXiv preprint arXiv:2305.09249},
  year={2023}
}

License

This library is licensed under the CDLA 1.0 Sharing License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
ePQA		ePQA
hetPQA		hetPQA
semiPQA		semiPQA
xPQA		xPQA
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextual Product Question Answering

SemiPQA

HetPQA

ePQA

xPQA

License

About

Releases

Packages

Contributors 2

License

amazon-science/contextual-product-qa

Folders and files

Latest commit

History

Repository files navigation

Contextual Product Question Answering

SemiPQA

HetPQA

ePQA

xPQA

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages