Multidimensional Backoff

Author: Anna Currey
Created: March 2015
Hosted on: https://github.com/annacurrey/multidimensional-backoff/tree/master/text-process

Overview

This project is an implementation of multidimensional backoff for factored language models.

Current status

There is a script for creating a multidimensional back language model for bigrams with three clusters. There are also text processing scripts for preparing training and test data for use in multidimensional backoff models.

To dos

Querying program for the language model
Script for creating language model for trigrams with two clusters (inc. word)
Implement better discounting in language model
Generalize scripts to n-grams with m clusters
Check calculations of probabilities and back

Text processing

The text-process directory contains scripts for converting training and test data into the correct format.

add-factors.py

Adds two sets of factors to a data set for use in a factored language model.

Usage: ./add-factors.py [infile] [factor1] [label1] [factor2] [label2] > outfile

Input: * data file to which factors should be added * one sentence per line * words separated by spaces * no additional tags * file containing the first set of factors * one word per line * format: "[word] [factor label]" * label for the first set of factors (one letter) * file containing the second set of factors (same format as other factors file) (map from first set to second set) * label for the second set of factors * desired outfile name

Output: file identical to infile, except that factors have been added words are of the format: W-word|A-factor1|B-factor2 (A and B factor labels)

Notes: In the SRILM implementation of factored language models, factors need to be separated with “:”. Here, I separate them with “|”.

word2cluster.py

Replaces words in a file with their corresponding clusters.

Usage: ./word2cluster.py infile clusterfile > outfile

Input file format: * one sentence per line * words separated by spaces

Cluster file format: * one word per line * line format: word cluster * no other information in file * one cluster per word

Output file format: * identical to input file, but words replaced with their clusters * words without clusters are not replaced * note we don't replace unclustered words with -1 here because we don't want them to cluster with other words if we run the word clustering program again

Notes: For use in creating the larger set of clusters, if you want to force the smaller clusters to be proper subsets of the larger clusters.

Creating the language model

The program create-lm_2g3c.py creates a language model for multidimensional backoff for bigrams with three clusters (including the word itself).

Usage: ./create-lm_2g3c.py training_file > output_file

Training file format: * one sentence per line * words separated by space * words in the format W-word|S-small_cluster|L-large_cluster * assume small clusters are subsets of large clusters (i.e. given small know large) * also assume word-cluster mapping is 1-1

Output file format: * similar to ARPA file format * cannot use traditional ARPA format because the backoffs are in a different dimension

About multidimensional backoff

Multidimensional backoff is used adapt factored language models for use with word vectors. For more information on multidimensional backoff, see the paper here.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
lm		lm
text-process		text-process
Currey_multidimensional-backoff.pdf		Currey_multidimensional-backoff.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multidimensional Backoff

Overview

Current status

To dos

Text processing

add-factors.py

word2cluster.py

Creating the language model

About multidimensional backoff

About

Releases

Packages

Languages

annacurrey/multidimensional-backoff

Folders and files

Latest commit

History

Repository files navigation

Multidimensional Backoff

Overview

Current status

To dos

Text processing

add-factors.py

word2cluster.py

Creating the language model

About multidimensional backoff

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages