Skip to content

fe1ixxu/Contextual_Mapping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Lingual Contextual Embedding Space Mapping

sense-level Example of sense-level mapping: 'bank' is split into two sense embeddings, which are respectively mapped to German 'bank' (financial establishment) and 'ufer' (shore). Similar discussion also applies to the word 'hard', where its two sense vectors are mapped to German 'schwer' (difficult) and 'hart' (solid).

Prerequisites

First install the virtual environmemt including required packages.

conda create --name cmap python=3.7
conda activate cmap
pip install -r requirements.txt

Use Pre-Trained Embeddings and Reproduce the Number in the Paper

Embeddings
En-De (word-level)
En-De (sense-level)
En-Ar (word-level)
En-Ar (sense-level)
En-Nl (word-level)
En-Nl (sense-level)

Bilingual Dictionary Induction

An example of evaluating English-German mapping through BDI. The results of isotropy, isometry and isomorphism will also print out.

python mapping.py --tgt de --emb_path $EMB/PATH --if_iter_norm True

To Create Your Own Aligned Embeddings:

Alternatively, if you want to get your own customized aligned embeddings, please see the following instrctions.

1. Installing Fast Align

Please install the fast align toolkit by following their instruction.

2. Download and Preprocess Parallel Corpora

Parallel Corpora are downloaded from ParaCrawl. A preprocessing file is already prepared for you and you can easily run it by one command. An example of downloading and preprocessing En-De parallel corpora:

wget https://s3.amazonaws.com/web-language-models/paracrawl/release6/en-de.txt.gz
gunzip ./en-de.txt.gz
./data/preprocess.sh de YOUR/PATH/FOR/PARALLEL/CORPUS YOUR/PATH/FOR/FAST/ALIGN

As for the En-Ar parallel corpus, please find all preprocessed file here.

3. Obtain Contextual Embeddings

Continuing the above example, we run getwordvectorsfrombert.py to obtain aligned contextual embeddings.

path=YOUR/PATH/FOR/PARALLEL/CORPUS
python getwordvectorsfrombert.py --src en --tgt de --open_src_file ${path}en_token.txt  --open_tgt_file ${path}de_token.txt  --open_align_file ${path}forward_align.txt --write_vectors_path ${path}vectors/ --max_num_word 10000 --batch_size 256 --max_seq_length 150

vectors

4. Cluster aligned Embeddings

To obtain sense-level embeddings:

python cluster_vector.py --input_file ${path}vectors/ --write_file $output --stopwords ./data/stopwords/en.txt --min_threshold 100 --min_num_words 5 

To obtain word-level embeddings, we just simply increase the threshold of clustering (vectors will be clusterd if its occurance is higher than the thereshold) to a large number (>10000).

python cluster_vector.py --input_file ${path}vectors/ --write_file $output --stopwords ./data/stopwords/en.txt --min_threshold 100000 --min_num_words 5 

The output contextual aligned embeddings file includes 5 columns. They respectively represents:

0 - source word (sense)
1 - translated word (sense) in the target side
2 - occurance of the source word (didn't use in this paper)
3 - entropy of the cluster (didn't use in this paper)
4 - source word (sense) embedding
5 - target word (sense) embedding

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published