WatsonQASystem

Building (a part of) IBM’s Watson Question Answering (QA) system

CSC 583 Final Project (Spring 2024)
Authors: Junfeng Xu and Chia-Lin Ko

Motivation

IBM’s Watson is a Question Answering (QA) system that “can compete at the human champion level in real time on the TV quiz show, Jeopardy.” This, as we will see in class, is a complex undertaking. However, the answers to many of the Jeopardy questions are actually titles of Wikipedia pages. For example, the answer to the clue “This woman who won consecutive heptathlons at the Olympics went to UCLA on a basketball scholarship” is “Jackie Joyner-Kersee”, who has a Wikipedia page with the same title: http://en.wikipedia.org/wiki/Jackie_Joyner-Kersee. In these situations, the task reduces to the classification of Wikipedia pages, that is, finding which page is the most likely answer to the given clue. This is the focus of this project.

What the code does?

Indexing and Retrieval
Measuring Performance
Error Analysis
Improved Implementation

How to run the code?

clone the git repository https://github.com/astrochialinko/WatsonQASystem
download all 4 index files from https://drive.google.com/drive/folders/1G-6E7y7_5KKqEu-CcnfOB4f7YBQtIDkK?usp=sharing and unzip them under the git repository directory.
change directory to the git repository folder and there are 5 tests you can run:

TestWastonStd This test runs query on index with stop words removal, lowercasefilter, standard tokenizer.
TestWastonLemma This test runs query on index with stop words removal, lowercasefilter, standard tokenizer, Lemmanization with OpenNLPLemmatizerFilter.
TestWastonWiki This test runs query on index with stop words removal, lowercasefilter, wikipedia tokenizer.
TestWastonStem This test runs query on index with stop words removal, lowercasefilter, standard tokenizer, Porter Stemming.
TestWastonStemChat This test runs query on index with stop words removal, lowercasefilter, standard tokenizer, Porter Stemming, then rerank top 10/100 results by ChatGPT.

to run a test above. Issue $ mvn -Dtest=<TestName> test. Note: if you want to run the TestWastonStemChat test you need to go to QueryEngine.java and update the apiKey field with your ChatGPT secret key.
the output will show the performance result including Precision at 1, Mean Reciprocal Rank, etc. for each of the 5 similarity formulas below:

BM25Similarity
BooleanSimilarity
ClassicSimilarity
LMDirichletSimilarity
LMJelinekMercerSimilarity

Dataset

100 questions from previous Jeopardy games, whose answers appear as Wikipedia pages. The questions are listed in a single file, with 4 lines per question, in the following format: CATEGORY CLUE ANSWER NEWLINE. For example:
```
NEWSPAPERS
The dominant paper in our nation’s capital, it’s among the top 10 U.S. papers in circulation
The Washington Post
```
A collection of approximately 280,000 Wikipedia pages, which include the correct answers for the above 100 questions. The pages are stored in 80 files (thus each file contains several thousand pages). Each page starts with its title, encased in double square brackets. For example, BBC’s page starts with “[[BBC]]”.

File structures

WatsonQASystem
├── README.md
├── en-lemmatizer.dict.txt
├── index-file-lemma/
├── index-file-std/
├── index-file-stem/
├── index-file-wiki/
├── pom.xml
├── notebooks
│   └── plot_analysis.ipynb
├── src
│   ├── main/java/edu/arizona/cs
│   │   │               ├── BuildIndex.java
│   │   │               ├── LemmaAnalyzer.java
│   │   │               ├── MainWatson.java
│   │   │               ├── QueryEngine.java
│   │   │               ├── ResultClass.java
│   │   │               └── WikipediaAnalyzer.java
│   │   └── resources
│   │       ├── questions.txt
│   │       └── wiki-folder
│   └── test/java/edu/arizona/cs
│                       ├── TestWastonLemma.java
│                       ├── TestWastonStd.java
│                       ├── TestWastonStem.java
│                       ├── TestWastonWiki.java
│                       └── TestWastonStemChat.java
└── target/

References

https://lucene.apache.org/

https://platform.openai.com/docs/api-reference

Acknowledgements

We would like to acknowledge Prof. Mihai Surdeanu and TA Haris Riaz for their guidance and support throughout the semester.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WatsonQASystem

Motivation

What the code does?

How to run the code?

Dataset

File structures

References

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
en-lemmatizer.dict		en-lemmatizer.dict
pom.xml		pom.xml

astrochialinko/WatsonQASystem

Folders and files

Latest commit

History

Repository files navigation

WatsonQASystem

Motivation

What the code does?

How to run the code?

Dataset

File structures

References

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages