# DensePhrases Demo

<em>DensePhrases</em> 是一项由Korea University和Princeton University联合完成的，基于短语级的英文文本匹配（召回）模型，面向于NLP中“开放域问答”和“阅读理解”任务。其项目[论文](https://arxiv.org/abs/2012.12624)被收录于ACL2021，你也可以直接通过其[Github项目地址](https://github.com/princeton-nlp/DensePhrases)来了解此模型，或使用其面向维基百科(2018.12.20)数据所训练的[Demo](http://densephrases.korea.ac.kr)来切身体会。

#### 更新于
**\*\*\*\*\* 2021.09.30 \*\*\*\*\***

**\*\*\*\*\* by lilingwei（lilingwei20@mails.ucas.ac.cn）\*\*\*\*\***

## 设置colab
首先需要设置使用GPU，点击上方工具栏‘Edit’，进入‘Notebook Settings’，将硬件设置为GPU
需要注意的是，每次使用GPU的时间一般不能超过24小时，使用得越多，每次获得的时间片越少
实验结束后记得切换回None再关闭Notebook

In [65]:
# Check your GPU here
%tensorflow_version 2.x
import tensorflow as tf
if __name__ == '__main__':
  print(tf.test.gpu_device_name())
  with tf.device('/device:GPU:0'):
    print(tf.test.gpu_device_name())

/device:GPU:0
/device:GPU:0


其次需要在此挂载Google Drive以保存实验文件，需要提前申请好Google账号，然后关联建立相关Google Drive
具体步骤为运行以下代码，点击显示的URL，登录Google账号，将验证码粘贴到提示的黑框空白处，回车即可

In [2]:
# Mount your google drive here
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


创建DensePhrases项目，并安装好所需工具包

In [2]:
# Install apex
%cd /content/gdrive/MyDrive/
!wget https://github.com/NVIDIA/apex/archive/refs/heads/master.zip
!unzip master.zip
%cd apex-master/
!python setup.py install
%cd ..

/content/gdrive/MyDrive
--2021-10-08 11:05:21--  https://github.com/NVIDIA/apex/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/NVIDIA/apex/zip/refs/heads/master [following]
--2021-10-08 11:05:21--  https://codeload.github.com/NVIDIA/apex/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.121.10
Connecting to codeload.github.com (codeload.github.com)|140.82.121.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip.1’

master.zip.1            [ <=>                ] 861.65K  --.-KB/s    in 0.1s    

2021-10-08 11:05:21 (6.82 MB/s) - ‘master.zip.1’ saved [882329]

Archive:  master.zip
3ad9db2adb968c67dd509c5408eeca0884f6ab3f
   creating: apex-master/
  inflating: apex-master/.gitignore  
  infla

In [2]:
# Install densephrases for course of UCAS
!wget https://github.com/blackli7/DensePhrases/archive/refs/heads/main.zip
!unzip main.zip
# Install other toolkits
%cd DensePhrases-main
!pip install -r requirements.txt
!python setup.py develop

--2021-10-08 11:20:54--  https://github.com/blackli7/DensePhrases/archive/refs/heads/main.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/blackli7/DensePhrases/zip/refs/heads/main [following]
--2021-10-08 11:20:54--  https://codeload.github.com/blackli7/DensePhrases/zip/refs/heads/main
Resolving codeload.github.com (codeload.github.com)... 140.82.121.9
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘main.zip’

main.zip                [        <=>         ]  43.62M  5.62MB/s    in 7.8s    

2021-10-08 11:21:02 (5.62 MB/s) - ‘main.zip’ saved [45745017]

Archive:  main.zip
a5a8b2c364306d94155839346671affb387bc4a1
   creating: DensePhrases-main/
  inflating: DensePhrases-main/.DS_Store  
  infl

设置相关环境变量

In [4]:
# Running config.sh will set the following three environment variables:
# DATA_DIR: for datasets
# SAVE_DIR: for pre-trained models or index; new models and index will also be saved here
# CACHE_DIR: for cache files from huggingface transformers
!source config.sh


Environment variables are set as follows:
DATA_DIR=.//densephrases-data
SAVE_DIR=.//outputs
CACHE_DIR=.//cache
Add to ~/.bashrc (recommended)? [yes/no]: yes
--2021-10-08 11:06:39--  https://nlp.cs.princeton.edu/projects/densephrases/models/densephrases-multi.tar.gz
Resolving nlp.cs.princeton.edu (nlp.cs.princeton.edu)... 128.112.136.61
Connecting to nlp.cs.princeton.edu (nlp.cs.princeton.edu)|128.112.136.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1114266598 (1.0G) [application/x-gzip]
Saving to: ‘.//outputs/densephrases-multi.tar.gz’


2021-10-08 11:07:11 (33.6 MB/s) - ‘.//outputs/densephrases-multi.tar.gz’ saved [1114266598/1114266598]

densephrases-multi/special_tokens_map.json
densephrases-multi/vocab.txt
densephrases-multi/config.json
densephrases-multi/tokenizer_config.json
densephrases-multi/training_args.bin
densephrases-multi/pytorch_model.bin


In [5]:
import os
os.environ['DATA_DIR']='./densephrases-data'
os.environ['CACHE_DIR']='./cache'
os.environ['SAVE_DIR']='./outputs'

检查一下正确性

In [6]:
# Check downloads
!pip list

Package                       Version        Location
----------------------------- -------------- ------------------------------------------
absl-py                       0.12.0
alabaster                     0.7.12
albumentations                0.1.12
altair                        4.1.0
apex                          0.1
appdirs                       1.4.4
argcomplete                   1.12.3
argon2-cffi                   21.1.0
arviz                         0.11.4
asgiref                       3.4.1
astor                         0.8.1
astropy                       4.3.1
astunparse                    1.6.3
atari-py                      0.2.9
atomicwrites                  1.4.0
attrs                         21.2.0
audioread                     2.1.9
autograd                      1.3
Babel                         2.9.1
backcall                      0.2.0
beautifulsoup4                4.6.3
bleach                        4.1.0
blis                          0.4.1
blosc                      

## 训练Demo
运行以下命令，生成模型Demo

In [None]:
# generate phrase vectors
# build phrase index
# evaluate phrase retrieval
# (try it more times if something goes wrong.)
!make step1

## 测试Demo
通过命令台输入测试Demo模型。

In [8]:
# evaluate phrase retrieval with input question
# output the answer, but write details in 'sample/step1_question_test_out.json'

!make step1_test

python step1_test_with_question.py \
	--dump_dir ./outputs/densephrases-physics/dump \
    	--index_dir start/32_flat_OPQ96 \
    	--query_encoder_path ./outputs/densephrases-multi
10/08/2021 11:10:30 - INFO - densephrases.utils.file_utils -   PyTorch version 1.7.1 available.
10/08/2021 11:10:30 - INFO - densephrases.utils.file_utils -   TensorFlow version 2.6.0 available.
just input what you want to ask (relevant to the sample articles) : 
what is polarization?
Query encoder will be loaded from ./outputs/densephrases-multi
10/08/2021 11:10:44 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/SpanBERT/spanbert-base-cased/config.json from cache at /root/.cache/torch/transformers/f657841af52c33ad17850f7918b10fbfc48e447d49d35bee4081df30d7b54545.e736d0f2e9459c34485cfbdd4c15e2b18d74c1c4359ee99166a419eaaab2994b
10/08/2021 11:10:44 - INFO - transformers.configuration_utils -   Model config BertConfig {
  "attention_pro

进一步地，通过运行`web_demo_django`文件夹下或者自己编写的网页演示程序来将模型封装，进行交互式的输入输出

需将训练好的模型copy至本地再运行Django，具体可以项目Readme

## Reference
Please cite the paper if you use DensePhrases in your work:
```bibtex
@inproceedings{lee2021learning,
   title={Learning Dense Representations of Phrases at Scale},
   author={Lee, Jinhyuk and Sung, Mujeen and Kang, Jaewoo and Chen, Danqi},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}
```

## License
Please see LICENSE for details.

[demo]: http://densephrases.korea.ac.kr