-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Initially, I selected 68 French words that occur 10 times in news2014.fr (I could select many more), and crawled them with Linguee. This gave me 736 URL pairs, along with aligned sentence fragments. The goal of the study is to find out how many of these we crawl with our baseline pipeline, so we have a sense of where we lose the most, and hence we could get the largest gain by better methods.
ssh syn
cd /home/pkoehn/statmt/project/crawl/linguee
./crawl-lingue.perl LANGUAGE < LANGUAGE-unknown-freq10.word-list
./get-urls-from-linguee.perl LANGUAGE > LANGUAGE-unknown-freq10.info
Runs:
- 68 French words, 736 URL pairs
- 660 French words, 6424 URL pairs, 2171 unique web domains
245 of the 736 URLs point to pdf files. These are not in CommonCrawl.
- Total number of URLs: 736 (646 unique)
- Loss: 33%*
- Remaining: 491 (459 unique)
This analysis is based on the first run with 68 French words, resulting in 736 URL pairs.
Use the fancy querying interface http://statmt.org:8030/query_domain?domain=URL
- Starting point: 456 URL pairs (note: just not all processed)
- Found English URLs: 36 URLs (92% loss)
- Found French URLs: 23 URLs (95% loss)
- Found both URLs: 13 URLs (97% loss)
So, that's not so good.
This is currently the main bottleneck. We probably have to crawl ourselves, and use CommonCrawl only as means to find promising sites.
- Starting point: 491 French URLs (246 unique web domains)
- Loss: 15%
- Found domain in CommonCrawl: 417 URLs (184 unique web domains)
The baseline pipeline here is URL matching.
- Starting point: 490 (458 unique, 165 web domains)
- Loss: 67%
- Remaining: 164 URLs (50? unique web domains)
For now, the starting point are the URLs from Linguee. We first check if these are still alive, so we crawl them. For the ones for which we downloaded HTML documents, we check if the French contains the matched word - basic but reliable sanity check, since the French words are rare. We lose some URL pairs because the pages are Latin-1 encoded and grepping for the matched word (which is UTF-8) fails.
download-urls.perl LANGUAGE
-
check-downloaded-html-for-key-word.perl LANGUAGE
(creates LANGUAGE-*.crawl-check) - URL pairs: 6424 (4917 unique)
- URL pairs that are not PDF: 4473 (3573 unique, 1496 unique web domains)
- Crawled with non-empty response: 2974 URL pairs (2325 unique, 879 unique web domains)
- Successfully crawled: 1761 URL pairs (1375 unique, 448 unique web domains)
- Loss: 61% (62%, 71%)
For these 1761 URLs, we completely crawl the 448 web domains (using Bitextor / httrack).
Of these 448 unique domains:
- we excluded some since they are way too big: Canadian Parliament (265 URLs) and Europarl/europa.eu (253 URLs).
- we excluded some because the French and the English have different domains
This leaves 389 domains.
Most, yes. Over 90%. And the rest are similar.
- Total number of URLs: 1790 URL pairs
- Web pages under same domain: 1591 URL pairs
- Not on same domain: www.fifa.com ne fr.fifa.com, eng.royalcanin.com ne www.royalcanin.com, www.uefa.com ne fr.uefa.com, voilesnews.fr ne www.voilesnews.fr, www.nintendo.fr ne www.nintendo.co.uk, www.nintendo.fr ne www.nintendo.pt, www.marcmaison.com ne www.marcmaison.fr
The task of document alignment is to find the URL pair (that we know from Linguee) in the full crawl of the web site.
crawl-single-domain-linguee-matches.perl LANGUAGE
check-linguee-matches-in-site-downloads.perl LANGUAGE > result-check-linguee-matches-in-site-downloads-LANGUAGE
-
stage 1
Starting point: 1761 URL pairs (1375 unique, 448 unique web domains) -
stage 2
Different web domains for source and target removed: 1591 URL pairs (1245 unique, 395 unique web domains) -
stage 3
Big web domains removed: 881 URL pairs (741 unique, 389 unique web domains) -
stage 4
Crawling completed: 872 URL pairs --- temporary loss grep -v ^CH result-check-linguee-matches-in-site-downloads-LANGUAGE | wc
- Both URLs found in domain: 490 URL pairs
run-bitextor-on-all-domains.perl LANGUAGE
- Bitextor finished document alignment: 446 URL pairs --- temporary loss
Task definition:
- Given the site crawls in
/home/pkoehn/statmt/project/crawl/data/site-crawls
(Valhalla) - Align the web pages for each site
- Answer key:
grep -v ^C /home/pkoehn/statmt/project/crawl/data/result-check-linguee-matches-in-site-downloads-LANGUAGE
Bitextor performance:
evaluate-bitextor-document-align.perl
- Correct: 192 (43%)
- Wrongly aligned: 52 (12%)
- Not aligned: 202 (45%)
Bitextor misses obvious URL patterns:
- www.fasska.com/fr/Home004f.html - www.fasska.com/en/Home004f.html
- bugadacargnel.com/fr/pages/artistes73ef.html - bugadacargnel.com/en/pages/artistes73ef.html
The 'unaligned' may be due to boilerplate removal, which removes everything and then de-duplicates pages.
Only documents that are detected to be authored in different languages may be aligned.
Setup: Extract text from both sites (only preprocessing unicode normalization/sanatization), classify text spans, remove spans that are not en/fr, take most common language as 'document language'. If both pages are classified to be of the same language that's a loss, otherwise a win.
- One document classified as EN the other as FR: 474 (96.7%)
- Both in the same language: 16 (3.3%)
Note: Bitextor might be better or worse due to tika/boilerpipe and document level language classification as opposed to classifying spans. We can get that information from the .lett files.
Using exactly the same stripping as in the "Dirt Cheap"-paper we match 162 (33%) pairs, after removing "//" from paths 174 (35.5%). The latter matches correctly the case of bla.com/index and bla.com/fr/index.
Attention: This is by matching the original URLs that are kept in a comment of the end of the downloaded HTML files.
Given document pairs from Linguee, can we extract the same sentence pair?
Starting point are the URLs for which we crawl valid HTML pages from the web. On these we run the Bitextor sentence alignment pipeline.
/home/pkoehn/statmt/project/crawl/linguee/evaluate-bitextor-sentence-aligment.perl
Evaluation gives full credit for cases where sentence fragments are partial matches, e.g.,
- Linguee: The big man has a funny nose.
- Bitextor: The big man has a funny nose. Really.
Bitextor performance:
- Correctly aligned: 136/212 (64%)
Pipeline:
-
Collect all files from a crawl directory and extract those that are HTML files. Similar to Bitextors webdir2ett:
find mammusique.com -type f -exec file -N --mime-type --mime-encoding {} + | /bin/grep -E "(text|html|xml)" > mammusique.com.files
-
Determine main language for each document by parsing HTML, UTF-8 normalization/sanitization, extraction of spans in different languages and then picking the most common one if it is in the list of languages we're looking for.
python /home/buck/net/build/mtma_bitext/baseline/checklang.py -annotate mammusique.com.files mammusique.com.languages
this produces a file of format "filenamelang". This file is used to determine the two sides of the bipartite matching graph.
-
Extract, for each English and each French file the English text using the pipeline from 2) and for each French file the French text that is to be translated. This step already performs sentence splitting and, for English, normalization and tokenization using moses scripts. Since this is more efficient to do later for the French part, tokenization and normalization are deactivated during French text extraction.
cat mammusique.com.files | python ~/net/build/mtma_bitext/baseline/extract_foreign_text.py -o mammusique.com.keep -prefix=/fs/gna0/buck/cc/linguee/site-crawls/mammusique.com/ -lang en
cat mammusique.com.files | python /home/buck/net/build/mtma_bitext/baseline/extract_foreign_text.py -o mammusique.com.translate -prefix=/fs/gna0/buck/cc/linguee/site-crawls/mammusique.com/ -lang fr -tokenizer="" -normalizer=""
Here is idea is, that even the French pages will contain some English that we want to use in matching. It's most likely boilerplate but may help when comparing pages for very different parts of a website.
The fileformat is:
`filename<TAB>sentence`
-
Copy DOMAIN.translate file to CLSP cluster and translate with moses:
cd /home/cbuck/b07/en-fr
cut -f 2 mammusique.com.translate | /home/pkoehn/moses/scripts/tokenizer/normalize-punctuation.perl fr | /home/pkoehn/moses/scripts/tokenizer/tokenizer.perl -l fr | /home/pkoehn/moses/scripts/recaser/truecase.perl --model /home/pkoehn/experiment/crawltest-fr-en/truecaser/truecase-model.3.fr | /home/pkoehn/moses/bin/moses.2015-03-23 -f moses.tuned.ini.7 -threads 30 | /home/pkoehn/moses/scripts/recaser/detruecase.perl > mammusique.com.translated
Note that we don't detokenize.
-
Copy file DOMAIN.translated back to Edin and add the first column (the filenames) again and extract n-grams:
paste <(cut -f 1 mammusique.com.translate) mammusique.com.translated | python /home/buck/net/build/DataCollection/baseline/ngrams.py -n 4 > mammusique.com.tngrams
These are the 'translated n-grams' those generated by translation. This example uses 4-gram but we should use bigrams as in the Google paper.
-
Extract n-gram for English segments (make sure data is tokenized in the same way as the translated data) as well:
cat mammusique.com.keep | python /home/buck/net/build/DataCollection/baseline/ngrams.py -n 4 | sort > mammusique.com.engrams
-
Compute idf-weighted cosine distance between all source and target documents (we skip the 5-gram based matching step for now):
python /home/buck/net/build/DataCollection/baseline/score_ngrams.py mammusique.com.engrams mammusique.com.tngrams mammusique.com.languages -outfile mammusique.com.matches
Output format is:
`source_file<TAB>target_file<TAB>cosine distance.`