MtmlDayFour

From MtmlNotes.

day 4: Thu 27 Jan 2011

Kemal Oflazer

http://cl.haifa.ac.il/MT/pres/kemal.pptx

Turkish. Altaic language. Super-agglutinative, very productive, and has lots of derivational morphology.

Some frozen forms, but it's mostly productive. Fairly regular, but not entirely. Hundreds of inflections are possible for a given root. Observed, in 490 million words, about 50k combinations of suffixes.

"existing at the time we caused something to become strong" -- one word in Turkish. "saglamlastidigimizdaki" (but with diacritics). How to generate words like that, given English phrases?

"after we had caused them to become pretty".

Finnish numerals are apparently pretty rough too. "twenty-eighth" -- and the whole thing agrees with its case, which might be genitive or maybe more unusual.

"I never said that I wanted to go to Paris." -- one word in Inuktitut.

Breaking words in Turkish gets you sentences with like 100 tokens.

So the insight here is: Hey! Why not unchop the English instead of chopping the Turkish? Factored representation mapping to Turkish morphology. It's an agglutinativizationismic approach.

Transform dependency parses in English into big agglutinative words in pseudo-English; this dramatically shortens the English sentences without losing information. Using Joakim Nivre's parser.

For the translation, use MosesStatMt. Dual-path. Upshot seems to be that a reduction in the number of tokens makes it easier.

"The green is the BLEU..."

Tried reordering on the syntactic level too, and that didn't help much. Turkish->English seems easier. Is it possible to learn the rules for agglutinating English? How about doing Finnish? ...

Other groups seem to be using SuperTaggers. Chris points out that it's rare to enrich the source side and have it work out well.

Mermer

http://cl.haifa.ac.il/MT/pres/mermer.ppt

"Unsupervised Turkish Morphological Segmentation"

no human involvement
language independent
auto-optimize by believing the data

Maybe better than using an MA! So let's learn the morphology from the corpus. Search for the MAP segmentations.

Look up [Creutz and Lagus]. Find the most probable model given the corpus we observe.

Okita and Way

http://cl.haifa.ac.il/MT/pres/okita.pdf

EN-JP patent corpus. 3.2 million sentence pairs. Tried a bunch of different SuperTaggers. "Conventional PittmanYor process..."

Jason Eisner

http://cl.haifa.ac.il/MT/pres/jason1.ppt

http://cl.haifa.ac.il/MT/pres/jason2.pdf

Non-parametric bayesian approach to inflectional morphology!

Joint work with Markus Dreyer.

"Only thing standing between you and lunch."

Not assuming concatenative morphology. "How is morphology like clustering?" Put things together if they're commonly explained. Dirichlet mixture process. Infinitely many Gaussians?

"The vegetable snozzcumber: repulsive and disgustable." (weaselosity)

"phlim, phlam, phlum"

Predict "singed" from "sing" -- fails, so back off to "sang". Jurafsky had some work to do this? Want our system to act like a "reasonable linguist". Can we do this jointly with how to translate? "GibbsSampling and other MCMC methods..."

Alignments of string pairs... you can apparently learn which letters are vowels, unsupervised.

"faithfulness constraints"

[Dreyer, Smith, Eisner 2008]

Really, really good at Irish.

"In general, joint prediction is a good idea; it's just difficult on strings."

infinitive: brechen. Get "brachen", etc. And then they inform each other.

Build a MarkovRandomFields! Variables are these complicated things; it's a graphical model on strings. Approximate algorithm: LoopyBeliefPropagation. Forward-backward is a special case of this.

"It has to be infinite, because the problem is undecidable."

David Makai's textbook? (Bakai? ...)

3. Text and paradigms... point is: you have to explain everything in the corpus.

nonparametric: infinitely many paradigms. Stick-breaking process (somehow equivalent to a chinese restaurant process)

"Here we're doing two-best instead of thousand best..."

Children often think of go/went as different verbs.

(this space intentionally left blank)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly