Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

This repository contains code accompanying the paper: Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models (Stańczak et al., NAACL 2022).

Setup

These instructions assume that conda is already installed on your system.

Clone this repository. NOTE: We recommend keeping the default folder name when cloning.
First run conda env create -f environment.yml.
Activate the environment with conda activate multilingual-typology-probing.
(Optional) Setup wandb, if you want live logging of your runs.

Generate data

You will also need to generate the data.

First run mkdir unimorph && cd unimorph && wget https://raw.githubusercontent.com/unimorph/um-canonicalize/master/um_canonicalize/tags.yaml
Download UD 2.1 treebanks and put them in data/ud/ud-treebanks-v2.1
Clone the modified UD converter to this repo's parent folder and then convert the treebank annotations to the UniMorph schema ./scripts/ud_to_um.sh.
Run ./scripts/preprocess_bert.sh, ./scripts/preprocess_xlmr_base.sh, and ./scripts/preprocess_xlmr_large.sh to preprocess all the relevant treebanks using relevant embeddings. This may take a while.

Run Experiments

The run.py script can be used to invoke the experiments. Commands are of the format python run.py [ARGS] MODE [MODE-ARGS].

First, in our paper we employ a latent variable probe presented in A Latent-Variable Model for Intrinsic Probing (Stańczak et al., 2022) to identify the relevant subset of neurons in each language for each morphosyntactic attribute. We opt for a Poisson sampling scheme. We solve the optimization problem using greedy search using mutual information as a performance measure.

Hence, we run python run.py --language $language --attribute $attribute --trainer poisson --gpu --embedding $embedding greedy --selection-size 50 --selection-criterion mi for each analysed language--attribute pair for each of the three probed language models, m-BERT, XLM-R-base, and XLM-R-large. Alternatively, you can also run make 01_bert_ALL, make 01_xlmr_base_ALL, and make 01_xlmr_large_ALL which run the above command for all the chosen languages and attributes, and generates the appropriate files.

Next, you can run exploratory analysis on the generated results. Plots presented in the paper were generated with the following files: 01_neuron_overlap.py, 02_lang_similarity_no_attr.py, and 03_genus_similarity.py.

Extra Information

Citation

If this code or the paper were usefull to you, consider citing it:

@inproceedings{stanczak-etal-2022-same,
    title = "Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models",
    author = "Stańczak, Karolina and 
    Ponti, Edoardo and 
    Torroba Hennigen, Lucas and 
    Cotterell, Ryan and 
    Augenstein, Isabelle",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    year = "2022",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2205.02023",
}

Contact

To ask questions or report problems, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
embedders		embedders
models		models
probekit		probekit
readers		readers
scripts		scripts
trainers		trainers
utils		utils
01_neuron_overlap.py		01_neuron_overlap.py
02_lang_similarity_no_attr.py		02_lang_similarity_no_attr.py
03_genus_similarity.py		03_genus_similarity.py
Makefile		Makefile
README.md		README.md
commands.py		commands.py
config.py		config.py
config.yml		config.yml
environment.yml		environment.yml
generate_list_of_probed_attribute_value_pairs.py		generate_list_of_probed_attribute_value_pairs.py
preprocess_treebank.py		preprocess_treebank.py
run.py		run.py
setup.cfg		setup.cfg
test_variational_family.py		test_variational_family.py
word.py		word.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

Setup

Generate data

Run Experiments

Extra Information

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

copenlu/multilingual-typology-probing

Folders and files

Latest commit

History

Repository files navigation

Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

Setup

Generate data

Run Experiments

Extra Information

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages