# Data preparation/downloading/processing

### First, we create the root collection directory and point environment variable `COLLECT_ROOT` to this directory

In [1]:
!mkdir -p ~/Desktop/flexneuart_collections

In [2]:
%env COLLECT_ROOT=/Users/yangchen/Desktop/flexneuart_collections

env: COLLECT_ROOT=~/Desktop/flexneuart_collections


In [3]:
!bash -c "echo $COLLECT_ROOT"

/Users/yangchen/Desktop/flexneuart_collections


## Downloading preprocessed data

This notebook works with a sub-sample of the natural question collection (__Wikipedia DPR__) prepared by [Karpukhin et al.](https://github.com/facebookresearch/DPR). This subset includes all the questions from __Wikipedia DPR__, but only a sample  of passages (about one million). 

The generation of this subset is briefly described below, but for your convenience we provide an archive with already processed data.

Change the directory, downloaded and unpack data:

In [4]:
cd ~/Desktop/flexneuart_collections

/Users/yangchen/Desktop/flexneuart_collections


  bkms = self.shell.db.get('bookmarks', {})
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [5]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_conf_2023-01-17.tar.bz2

--2024-04-05 13:38:24--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_conf_2023-01-17.tar.bz2
正在解析主机 boytsov.info (boytsov.info)... 69.60.127.165
正在连接 boytsov.info (boytsov.info)|69.60.127.165|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：2692 (2.6K) [application/x-bzip2]
正在保存至: “wikipedia_dpr_nq_sample_conf_2023-01-17.tar.bz2”


2024-04-05 13:38:24 (52.4 MB/s) - 已保存 “wikipedia_dpr_nq_sample_conf_2023-01-17.tar.bz2” [2692/2692])



In [6]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2

--2024-04-05 13:38:27--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2
正在解析主机 boytsov.info (boytsov.info)... 69.60.127.165
正在连接 boytsov.info (boytsov.info)|69.60.127.165|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：414972906 (396M) [application/x-bzip2]
正在保存至: “wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2”


2024-04-05 13:41:10 (2.43 MB/s) - 已保存 “wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2” [414972906/414972906])



In [7]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2

--2024-04-05 13:45:02--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2
正在解析主机 boytsov.info (boytsov.info)... 69.60.127.165
正在连接 boytsov.info (boytsov.info)|69.60.127.165|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：2722927168 (2.5G) [application/x-bzip2]
正在保存至: “wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2”


2024-04-05 13:51:54 (6.30 MB/s) - 已保存 “wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2” [2722927168/2722927168])



In [8]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2  

--2024-04-05 13:55:19--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2
正在解析主机 boytsov.info (boytsov.info)... 69.60.127.165
正在连接 boytsov.info (boytsov.info)|69.60.127.165|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：43284754 (41M) [application/x-bzip2]
正在保存至: “wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2”


2024-04-05 13:55:42 (1.89 MB/s) - 已保存 “wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2” [43284754/43284754])



In [9]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2  

--2024-04-05 13:56:30--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2
正在解析主机 boytsov.info (boytsov.info)... 69.60.127.165
正在连接 boytsov.info (boytsov.info)|69.60.127.165|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：55835230 (53M) [application/x-bzip2]
正在保存至: “wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2”


2024-04-05 13:56:43 (4.19 MB/s) - 已保存 “wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2” [55835230/55835230])



In [10]:
!tar jxvf wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2

x wikipedia_dpr_nq_sample/
x wikipedia_dpr_nq_sample/input_data/
x wikipedia_dpr_nq_sample/input_data/train_fusion/
x wikipedia_dpr_nq_sample/input_data/train_fusion/QuestionFields.jsonl
x wikipedia_dpr_nq_sample/input_data/train_fusion/qrels.txt
x wikipedia_dpr_nq_sample/input_data/train_fusion/QuestionFields.bin
x wikipedia_dpr_nq_sample/input_data/dev/
x wikipedia_dpr_nq_sample/input_data/dev/QuestionFields.jsonl
x wikipedia_dpr_nq_sample/input_data/dev/qrels.txt
x wikipedia_dpr_nq_sample/input_data/dev/QuestionFields.bin
x wikipedia_dpr_nq_sample/input_data/bitext/
x wikipedia_dpr_nq_sample/input_data/bitext/QuestionFields.jsonl
x wikipedia_dpr_nq_sample/input_data/bitext/qrels.txt
x wikipedia_dpr_nq_sample/input_data/pass_sample/
x wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin
x wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.jsonl.gz
x wikipedia_dpr_nq_sample/input_data/dev_official/
x wikipedia_dpr_nq_sample/input_data/dev_official/QuestionFields.js

In [11]:
!tar jxvf wikipedia_dpr_nq_sample_conf_2023-01-17.tar.bz2

x wikipedia_dpr_nq_sample/
x wikipedia_dpr_nq_sample/model_conf/
x wikipedia_dpr_nq_sample/model_conf/vanilla_bert.json
x wikipedia_dpr_nq_sample/model_conf/vanilla_bert_with_scores.json
x wikipedia_dpr_nq_sample/exper_desc.best/
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/avgembed.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/cedr8080.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance_exported_sparse.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25=text+model1=text_bert_tok+lambda=0.3+probSelfTran=0.35.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_cedr8080.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_avgembed.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25.json
x wikipedia_dpr_nq_sample/exper_desc.best/extractors/ance.json
x wikipedia_dpr_nq_sample/exper_desc.best/

In [12]:
!tar jxvf wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2  

x wikipedia_dpr_nq_sample/derived_data/bitext/
x wikipedia_dpr_nq_sample/derived_data/bitext/answer_text_unlemm
x wikipedia_dpr_nq_sample/derived_data/bitext/question_text_bert_tok
x wikipedia_dpr_nq_sample/derived_data/bitext/answer_text_bert_tok
x wikipedia_dpr_nq_sample/derived_data/bitext/question_text_unlemm


In [13]:
!tar jxvf wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2

x wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/model.best


In [14]:
!tar jxvf wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2  

x wikipedia_dpr_nq_sample/derived_data/embeddings/
x wikipedia_dpr_nq_sample/derived_data/embeddings/glove/
x wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6B.50d.txt.bz2


#### For all the following experiments we use scripts installed via `flexneuart_install_extra.sh`. They must be called from their respective installation directory:

In [19]:
cd ~/Desktop/flexneuart

/Users/yangchen/Desktop/flexneuart


#### Carry out a basic sanity check:

In [20]:
!report/get_basic_collect_stat.sh wikipedia_dpr_nq_sample

zsh:1: no such file or directory: report/get_basic_collect_stat.sh


## Preprocessing in more details : This is for information purposes only because the downloaded data is already pre-processed

The download and conversion script can be found in the directory `data_convert/wikipedia_dpr`.

In [23]:
!mkdir -p $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw

### Converting passages and queries

In [24]:
!data_convert/wikipedia_dpr/download_dpr_passages.sh $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw

--2024-04-05 14:13:28--  https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
正在解析主机 dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.156.184.22, 108.156.184.129, 108.156.184.78, ...
正在连接 dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.156.184.22|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：4694541059 (4.4G) [application/gzip]
正在保存至: “psgs_w100.tsv.gz”


2024-04-05 14:18:17 (15.5 MB/s) - 已保存 “psgs_w100.tsv.gz” [4694541059/4694541059])



In [33]:
!data_convert/wikipedia_dpr/download_dpr_queries.sh nq $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw

--2024-04-05 14:26:47--  https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
正在解析主机 dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.156.184.100, 108.156.184.78, 108.156.184.22, ...
正在连接 dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.156.184.100|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：2314892908 (2.2G) [application/gzip]
正在保存至: “nq_train.json.gz”


2024-04-05 14:29:54 (11.8 MB/s) - 已保存 “nq_train.json.gz” [2314892908/2314892908])

--2024-04-05 14:29:54--  https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz
正在解析主机 dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.156.184.78, 108.156.184.22, 108.156.184.129, ...
正在连接 dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.156.184.78|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：256239282 (244M) [application/gzip]
正在保存至: “nq_dev.json.gz”


2024-04-05 14:30:08 (18.2 MB/s) - 已保存 “nq_dev.json.gz” [256239282/256239282])



### Randomly split the training set into the new training and development sets. This script also converts the data into FlexNeuART format

In [41]:
!data_convert/wikipedia_dpr/split_and_convert_dpr_queries.sh \
    wikipedia_dpr_nq_sample \
    $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw \
    nq \
    -partition_sizes ,5000,2500 

Using collection root: /Users/yangchen/Desktop/flexneuart_collections
Namespace(seed=0, partitions_names='bitext,train_fusion,dev', partitions_sizes=',5000,2500', src_file='/Users/yangchen/Desktop/flexneuart_collections/wikipedia_dpr_nq_sample/input_raw/nq_train.json.gz', dst_file_pref='/Users/yangchen/Desktop/flexneuart_collections/wikipedia_dpr_nq_sample/input_raw/nq')
Reading input files...
58880it [01:46, 554.78it/s]
Shuffled query IDs using sid 0
Final partitions sizes: [('bitext', 51380), ('train_fusion', 5000), ('dev', 2500)]
Actually splitting data
58880it [09:04, 108.23it/s]
Disabled Spacy components:  ['ner', 'parser']
BERT-tokenizing input into the field: text_bert_tok
tokenizer_config.json: 100%|█████████████████| 48.0/48.0 [00:00<00:00, 18.2kB/s]
config.json: 100%|██████████████████████████████| 570/570 [00:00<00:00, 754kB/s]
vocab.txt: 100%|█████████████████████████████| 232k/232k [00:00<00:00, 5.91MB/s]
tokenizer.json: 100%|████████████████████████| 466k/466k [00:00<00:0

### The split & convert script produces outputs of two types:
1. The set of questions in JSONL format. These questions are divided into several subsets:

In [42]:
!ls $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data

[1m[36mbitext[m[m       [1m[36mdev[m[m          [1m[36mdev_official[m[m [1m[36mpass_sample[m[m  [1m[36mtrain_fusion[m[m


The `bitext` subset and the `train_fusion` subsets are supposed to be used to train models. The difference is that `train_fusion` is a smaller subset that can be used to create fusion models. The `bitext` part can be used to train, e.g., neural models.

For the queries from the `bitext` set, the conversion script creates parallel data (bitext) where questions are aligned with respective answer-bearing sentences. We create three parallel corpora that correspond to three ways to lemmatize & tokenize input (lemmas and original tokens with stopwords removed and BERT-tokenized text). They are stored in the `derived_data/bitext` subdirectory:

In [43]:
!ls $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/bitext

answer_text            answer_title_unlemm    question_text_unlemm
answer_text_bert_tok   question_text          question_title_unlemm
answer_text_unlemm     question_text_bert_tok


### Embedding document and queries (ANCE, Sentencer BERT)

1. We already __ship__ data with documents and queries (except for the bitext part) embedded using an [ANCE Wikipedia model](https://github.com/microsoft/ANCE). This is done using the scripts in the `data_convert/biencoder/ance` directory.
2. A much more diverse set of embeddings (provided by [Sentence BERT](https://www.sbert.net/)) is available if use the script `data_convert/biencoder/sbert/embed.py`.
3. First, one needs to download the models using the script `data_convert/biencoder/ance/download_ance_models.sh`.
4. Then, one can embed documents using a command like this one:

```
data_convert/biencoder/ance/embed.py \
    --input $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw/psgs_w100.tsv.gz \
    --output $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin \
    --field_name dense  \
    --model_dir <model download directory> \
    --data_type dpr_nq \
    --doc_ids collections/wikipedia_dpr_nq_sample/input_raw/nq_selected_psg_ids.npy
```

4. ... and queries using a command like this one (note we specify __the binary field name__):

```
data_convert/biencoder/ance/embed.py \
    --input collections/wikipedia_dpr_nq_sample/input_raw/psgs_w100.tsv.gz \
    --output collections/wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin \
    --field_name dense  \
    --model_dir <model download directory> \
    --data_type dpr_nq \
    --doc_ids collections/wikipedia_dpr_nq_sample/input_raw/nq_selected_psg_ids.npy
```

```
for part in train_fusion dev dev_official ; do \
    data_convert/biencoder/ance/embed.py \
        --input $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data/$part/QuestionFields.jsonl \
        --output $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data/$part/QuestionFields.bin \
        --field_name dense  \
        --model_dir <model download directory> \
        --data_type dpr_nq 
done
```