Unsupervised Chunking

This repository contains the code of our knowledge transfer approach for unsupervised chunking.

Pre-requisite Python Libraries

Install the following libraries specified in the requirements.txt before running our code.

transformers==2.2.0
numpy==1.15.4
tqdm==4.26.0
torch==1.3.1
nltk==3.6
matplotlib==2.2.3

Data preparation (CONLL)

All the experiements in this repository are done on the CoNLL-2000 dataset. Please refer to the code inside data_conll for both processed and original dataset files.

Files

All the output files ready for conll eval script are provided in the output_chunks directory. The reported results on the test set for CPCFG, LM and HRNN can be replicated by running eval_conll2000_updated.pl on these files (updated official perl eval script which only prints overall phrase level F1 and tag level accuracy)

Langugae model for unsupervised chunking

run.py to extract the features from different layers and different attention heads
evaluate_distances.py to evaluate the distances obtained from run.py and create a file ready for the evaluation script
evaluate_heuristics.py to evaluate maximal left branching, maximal right branching heuristics on top of LM parser

Hierarchical RNN for unsupervised chunking

model4_hrnn.py to train the student HRNN model
test_model4_hrnn.py to test the HRNN model and create a file ready for the evaluation script

How to run the code?

(Teacher model 1) Language model

python3 run.py [--data-path PATH] [--result-path PATH]
python3 evaluate_distances.py [--train-test-val train] [--dist-measure avg_hellinger] [--model bert-base-cased] [--distpath PATH] [--gt-tag-path PATH]

(Teacher model 2) Compound PCFG

Refer to the repo https://github.com/Anup-Deshmukh/CompoundPCFG-Chunker

(Student model) Hierarchical RNN

python3 model4_hrnn.py [--is-training 1] [--dire SAVE_PATH]
python3 test_model4_hrnn.py

Acknowledgments

If you find this code helpful do cite our EMNLP 2021 paper: "Unsupervised Chunking as Syntactic Structure Induction with a Knowledge-Transfer Approach" (https://aclanthology.org/2021.findings-emnlp.307/)

Utility functions from the folder "utils" and datasets used in this repo are originally from the source code for:

Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction** (Y. Kim et al., ACL 2019).
For more details, visit the original repo.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
data_conll		data_conll
output_chunks		output_chunks
output_distances		output_distances
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
best_test.txt		best_test.txt
chunk_heu.py		chunk_heu.py
eval_conll2000.pl		eval_conll2000.pl
eval_conll2000_updated.pl		eval_conll2000_updated.pl
evaluate_distances.py		evaluate_distances.py
evaluate_distances_threshold.py		evaluate_distances_threshold.py
evaluate_heuristic.py		evaluate_heuristic.py
extract_features.py		extract_features.py
model4_hrnn.py		model4_hrnn.py
requirements.txt		requirements.txt
run.py		run.py
test_model4_hrnn.py		test_model4_hrnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Chunking

Pre-requisite Python Libraries

Data preparation (CONLL)

Files

How to run the code?

Acknowledgments

About

Releases

Packages

Languages

Anup-Deshmukh/Unsupervised-Chunking

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Chunking

Pre-requisite Python Libraries

Data preparation (CONLL)

Files

How to run the code?

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages