Skip to content
Permalink
Browse files

Connected to E2E pipeline (#85)

* added RetrieveSentences.py

* removed index

* connet to E2E pipeline

* updated code

* Some modification

* clean up

* config

* documentation

* added documentation

* update documentation

* update doc

* update doc

* add js

* updated documentation

* changed doc

* changed requirements.txt
  • Loading branch information...
meng-f authored and rosequ committed Nov 25, 2017
1 parent 61c8c0e commit ad69d3abc24e22608c14c332ec81e07bbe99a124
@@ -1,31 +1,56 @@
## Retrieve Sentences
## Setup Retrieve Sentences and end2end QA pipeline

#### 1. Clone [Anserini](https://github.com/castorini/Anserini.git) and [Castor](https://github.com/castorini/Castor.git)
#### 1. Clone [Anserini](https://github.com/castorini/Anserini.git), [Castor](https://github.com/castorini/Castor.git), [data](https://github.com/castorini/data.git), and [models](https://github.com/castorini/models.git):
```bash
git clone https://github.com/castorini/Anserini.git
git clone https://github.com/castorini/Castor.git
git clone https://github.com/castorini/data.git
git clone https://github.com/castorini/models.git
```

Your directory structure should look like
```
.
├── Anserini
└── Castor
├── Castor
├── data
└── models
```

#### 2. Compile Anserini

```bash
cd Anserini
mvn package
cd ..
```

This creates `anserini-0.0.1-SNAPSHOT.jar` at `Anserini/target`

We highly recommend the use of [virtualenv](https://virtualenv.pypa.io/en/stable/) as the dependencies
are subjected to frequent changes.

Install the dependency packages:

```
cd Castor
pip3 install -r requirements.txt
```
Make sure that you have PyTorch installed. For more help, follow [these](https://github.com/castorini/Castor) steps.

#### 3. Download Dependencies
- Download the TrecQA lucene index
- Download the Google word2vec file from [here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNWJkWExmaklYNTA?usp=sharing)

#### 4. Run the following command
#### 4. Additional files for pipeline:
As some of the files are too large to be uploaded onto GitHub, please download the following files from
[here](https://drive.google.com/drive/folders/0B2u_nClt6NbzNm1LdjlwUFdzQVE?usp=sharing) and place them
in the appropriate locations:

- copy the contents of `word2vec` directory to `data/word2vec`
- copy `word2dfs.p` to `data/TrecQA/`

### To run RetrieveSentences:

```bash
python ./anserini_dependency/RetrieveSentences.py
@@ -44,3 +69,60 @@ Possible parameters are:
| `-k` | [1, inf) | 1 | top-k passages to be retrieved |

Note: Either a query or a topic must be passed in as an argument; they can't be both empty.


__NB:__ The speech UI cannot be run in Ubuntu. To test the pipeline in Ubuntu, make the following changes:
- Comment out the JavaScript part and run the Bash script
- Make a REST API query to the endpoint using Postman, Curl etc.

### To setup the demo

#### 1. Installing libraries for demo

```sh
cd anserini_dependency/js
npm install
cd ../..
```

#### 2. Flask

- Flask is used as the server for the API
- Copy `config.cfg.example` to `config.cfg` and make necessary changes, such as setting the index path and API keys.


#### 3. Run the Demo

```sh
./run_ui.sh
```

### Additional Notes
- This is the documentation for the API call to send a question to the model and get back the predicted answer.
- The request body fields are: question(required )num_hits(optional) and k(optional).
```
# REQUEST:
HTTP Method: POST
Endpoint: [host]:[port]/answer
Content-Type: application/json
text of body in raw format:
{
"question": "What is the birthdate of Einstein?",
"num_hits": 50,
"k": 30
}
```

- The response body contains answers which is a list of objects with two fields - passage, score.
```
# RESPONSE:
Content-Type: application/json
text of body in raw format:
{
"answers": [
{"passage": "Einstein was born in the 1800s", 'score': 0.976},
{"passage": "Einstein was a physicist", 'score': 0.524}
]
}
```
@@ -5,7 +5,7 @@
from jnius import autoclass


class CallRetrieveSentences:
class RetrieveSentences:
"""Python class built to call RetrieveSentences
Attributes
----------
@@ -27,28 +27,51 @@ def __init__(self, args):
"""
RetrieveSentences = autoclass("io.anserini.qa.RetrieveSentences")
Args = autoclass("io.anserini.qa.RetrieveSentences$Args")
String = autoclass("java.lang.String")
self.String = autoclass("java.lang.String")

self.args = Args()
index = String(args.index)
index = self.String(args.index)
self.args.index = index
embeddings = String(args.embeddings)
embeddings = self.String(args.embeddings)
self.args.embeddings = embeddings
topics = String(args.topics)
topics = self.String(args.topics)
self.args.topics = topics
query = String(args.query)
query = self.String(args.query)
self.args.query = query
self.args.hits = int(args.hits)
scorer = String(args.scorer)
scorer = self.String(args.scorer)
self.args.scorer = scorer
self.args.k = int(args.k)
self.rs = RetrieveSentences(self.args)

def getRankedPassages(self):
def getRankedPassages(self, query, index, hits, k):
"""
Call RetrieveSentneces.getRankedPassages
Calls RetrieveSentences.getRankedPassages
Parameters
----------
query : str
The query to be searched in the index
index: str
The index
hits: str
The number of document IDs to be returned
k: str
The number of passages to be returned
"""

scorer = self.rs.getRankedPassagesList(query, index, int(hits), int(k))
candidate_passages_scores = []
for i in range(0, scorer.size()):
candidate_passages_scores.append(scorer.get(i))

return candidate_passages_scores

def getTermIdfJSON(self):
"""
Calls RetrieveSentences.getTermIdfJSON
"""
self.rs.getRankedPassages(self.args)
return self.rs.getTermIdfJSON()

if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Retrieve Sentences')
@@ -61,9 +84,7 @@ def getRankedPassages(self):
parser.add_argument("-k", help="top-k passages to be retrieved", default=1)

args_raw = parser.parse_args()
rs = CallRetrieveSentences(args_raw)
rs.getRankedPassages()



rs = RetrieveSentences(args_raw)
sc = rs.getRankedPassages(args_raw.query, args_raw.index, args_raw.hits, args_raw.k)

@@ -0,0 +1,122 @@
import argparse
import configparser
import os
import sys

from flask import Flask, jsonify, request
# FIXME: separate this out to a classifier class where we can switch out the models

from RetrieveSentences import RetrieveSentences
from sm_cnn.bridge import SMModelBridge

app = Flask(__name__)
rs = None

@app.route("/", methods=['GET'])
def hello():
return "Hello! The server is working properly... :)"

@app.route('/answer', methods=['POST'])
def answer():
try:
req = request.get_json(force=True)
question = req["question"]
num_hits = req.get('num_hits', 30)
k = req.get('k', 20)
print("Question: {}".format(question))
# FIXME: get the answer from the PyTorch model here
answers = get_answers(question, num_hits, k)
answer_dict = {"answers": answers}
return jsonify(answer_dict)
except Exception as e:
print(e)
error_dict = {"error": "ERROR - could not parse the question or get answer. "}
return jsonify(error_dict)

@app.route('/wit_ai_config', methods=['GET'])
def wit_ai_config():
return jsonify({'WITAI_API_SECRET': app.config['Frontend']['witai_api_secret']})

# FIXME: separate this out to a classifier class where we can switch out the models
def get_answers(question, num_hits, k):

parser = argparse.ArgumentParser(description='Retrieve Sentences')
parser.add_argument("-index", help="Lucene index", required=True)
parser.add_argument("-embeddings", help="Path of the word2vec index", default="")
parser.add_argument("-topics", help="topics file", default="")
parser.add_argument("-query", help="a single query", default="")
parser.add_argument("-hits", help="max number of hits to return", default=100)
parser.add_argument("-scorer", help="passage scores", default="Idf")
parser.add_argument("-k", help="top-k passages to be retrieved", default=1)
args_raw = parser.parse_args(["-query", question, "-hits", str(num_hits), "-scorer",
"Idf", "-k", str(k), "-index", app.config['Flask']['index']])

global rs
if rs == None:
rs = RetrieveSentences(args_raw)
candidate_passages_scores = rs.getRankedPassages(question, app.config['Flask']['index'], num_hits, k)

candidate_sent_scores = []
candidate_passages_sm = []

for ps in candidate_passages_scores:
ps_split = ps.split('\t')
candidate_passages_sm.append(ps_split[0])
candidate_sent_scores.append((float(ps_split[1]), ps_split[0]))

if app.config['Flask']['model'] == "sm":
path_to_castorini = os.getcwd() + "/.."
model = SMModelBridge(path_to_castorini + '/models/sm_model/sm_model.fixed_ext_feats_paper.puncts_stay',
path_to_castorini + '/data/word2vec/aquaint+wiki.txt.gz.ndim=50.cache',
app.config['Flask']['index'])

idf_json = rs.getTermIdfJSON()
flags = {
"punctuation": "", # ignoring for now you can {keep|remove} punctuation
"dash_words": "" # ignoring for now. you can {keep|split} words-with-hyphens
}
answers_list = model.rerank_candidate_answers(question, candidate_passages_sm, idf_json, flags)
sorted_answers = sorted(answers_list, key=lambda x: x[0], reverse=True)
else:
# the re-ranking model chosen is idf
sorted_answers = list(candidate_sent_scores)

print("in idf:{}".format(sorted_answers))
answers = []
for score, sent in sorted_answers:
answers.append({'passage': sent, 'score': score})

return answers

if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Start the Flask API at the specified host, port')
parser.add_argument('--config', help='config to use', required=False, type=str, default='config.cfg')
parser.add_argument("--debug", help="print debug info", action="store_true")
parser.add_argument("--model", help="[idf|sm]", default="idf")
args = parser.parse_args()

if not os.path.isfile(args.config):
print("The configuration file ({}) does not exist!".format(args.config))
sys.exit(1)

config = configparser.ConfigParser()
config.read(args.config)

for name, section in config.items():
if name == 'DEFAULT':
continue

app.config[name] = {}
for key, value in config.items(name):
app.config[name][key] = value

app.config['Flask']['model'] = args.model

print("Config: {}".format(args.config))
print("Index: {}".format(app.config['Flask']['index']))
print("Host: {}".format(app.config['Flask']['host']))
print("Port: {}".format(app.config['Flask']['port']))
print("Re-ranking Model: {}".format(app.config['Flask']['model']))
print("Debug info: {}".format(args.debug))

app.run(debug=args.debug, host=app.config['Flask']['host'], port=int(app.config['Flask']['port']))
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,71 @@
<html>
<head>
<title>Anserini Speech Demo</title>
<script>if (typeof module === 'object') {window.module = module; module = undefined;}</script>
<script src="https://code.jquery.com/jquery-3.1.1.min.js" integrity="sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8=" crossorigin="anonymous"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
<script src="https://use.fontawesome.com/b2b2989db9.js"></script>
<script src="recorder.js"></script>
<script src="speech.js"></script>
<script>if (window.module) module = window.module;</script>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
<style>
html {
overflow: hidden;
}
body {
display: flex;
flex-direction: column;
margin: 10px 0;
height: 100%;
}
#analyser {
background: #f9f9f9;
flex: 0 1 auto;
max-height: 70px;
}
#viz {
flex: 1 1 auto;
display: flex;
flex-direction: column;
align-items: center;
}
#controls {
flex: 0 1 50px;
margin: 5px 5px 10px 5px;
}
#controls > span {
display: table;
margin: 0 auto;
}
#record.recording {
text-shadow: 0px 0px 10px red;
}
#question {
font-weight: bold;
font-size: larger;
flex: 1 0 auto;
margin: 5px;
text-align: center;
}
#answer {
flex: 1 1 auto;
overflow-y: auto;
margin: 5px;
}
</style>
</head>
<body>
<canvas id="analyser"></canvas>
<div id="viz">
<p class="lead" id="question">What can I help you with?</p>
<p id="answer"></p>
</div>
<div id="controls">
<span id="record" class="fa-stack fa-lg fa-2x" onclick="toggleRecording(this);">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-microphone fa-stack-1x fa-inverse"></i>
</span>
</div>
</body>
</html>
Oops, something went wrong.

0 comments on commit ad69d3a

Please sign in to comment.
You can’t perform that action at this time.