Skip to content

cwszz/XPR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-lingual Phrase Retriever

This repository contains the code and pre-trained models for our paper XPR: Cross-lingual Phrase Retriever.

**************************** Updates ****************************

Overview

We propose a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences named XPR.

Dataset

We also create a large-scale cross-lingual phrase retrieval dataset, which contains 65K bilingual phrase pairs and 4.2M example sentences in 8 English-centric language pairs.

Getting Started

In the following sections, we describe how to use our XPR.

Requirements

  • First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct torch==1.8.1+cu111 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.8.1 should also work.
  • Then, run the following script to fetch the repo and install the remaining dependencies.
git clone git@github.com:cwszz/XPR.git
cd xpr
pip install -r requirements.txt
mkdir data
mkdir model
mkdir result

Dataset

Before using XPR, please process the dataset by following the steps below.

  • Download Our Dataset Here: link

  • Unzip our dataset and move dataset into data folder. (Make sure the path in bash file is the path of dataset)

Checkpoint

Before using XPR, please process the checkpoint by following the steps below.

  • Download Our Checkpoint Here: link

  • Get our checkpoint files and move the files in repo into model folder.

Train XPR

bash train.sh

Evaluation

Test our method:

  • Download the XPR checkpoint from Huggingface: [link]
  • Make sure the model path and dataset path in test.sh are correct
  • The output log can be found in log folder

Here is an example for evaluate XPR:

bash test.sh

or

export CUDA_VISIBLE_DEVICES='0'
python3 predict.py \
--lg $lg \
--test_lg $test_lg \
--dataset_path ./datset/ \
--load_model_path ./model/pytorch_model.bin \
--queue_length 0 \
--unsupervised 0 \
--wo_projection 0 \
--layer_id = 12 \
> log/test-${lg}-${test_lg}-32.log 2>&1
  • $lg: The language on which the model was trained
  • $test_lg: The language on which the model will be tested on
  • --dataset_path: The path of dataset folder
  • --load_model_path: The path of checkpoint folder
  • --queue_length: The length of memory queue
  • --unsupervised: Unsupervised mode
  • --wo_projection: Without SimCLR projection head
  • --layer_id: The layer to represent phrase

References

Please cite this paper, if you found the resources in this repository useful.

About

Cross-lingual Phrase Retriever

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published