GitHub - accurat-toolkit/LEXACC: Fast parallel sentence mining from comparable corpora

accurat-toolkit / LEXACC Public

Notifications You must be signed in to change notification settings
Fork 2
Star 3

Fast parallel sentence mining from comparable corpora

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LEXACC_source_code		LEXACC_source_code
dict		dict
res		res
test/en-ro		test/en-ro
CreateFileList.README		CreateFileList.README
CreateFileList.exe		CreateFileList.exe
LEXACC-src.zip		LEXACC-src.zip
Lucene.Net.dll		Lucene.Net.dll
README		README
lexacc32.exe		lexacc32.exe
lexacc64.exe		lexacc64.exe

Repository files navigation

--param seg=true => the texts are already sentence segmented and tokenized (default false)
--param maxrep=<integer> => integer:integer alignments are allowed (that is, one source sentence may be aligned to e.g. the first 100 candidates and viceversa) (default 1)
--param kif=true => keep intermediary files true (default false)
--param t=<float> => the output threshold (default is 0.2)
--param filter=false => do not execute the pre-filtering step after searching for candidates (default true)
--input <file> => the document list file for the source collection; if --docalign is specified, this argument MUST NOT be given
--input <file> => the document list file for the target collection; if --docalign is specified, this argument MUST NOT be given
--docalign <file> => the document alignment file; format: source document <TAB> target document <TAB> score <NEWLINE>; if this is given then --input MUST NOT be given
--source <lang> => en, ro, lt, lv, ... the source language
--target <lang> => en, ro, lt, lv, ... the target language
--output <file> => the name of the file to output the results to
--test <file> => output of the program specified with --output

Example

lexacc.exe 
	--input en_de_enList.txt
	--input en_de_deList.txt
	--source en
	--target de
	--output results_en_de.txt
	--param seg=true
	--param kif=false
	--param t=0.1
	--param maxrep=3

or

lexacc.exe 
	--docalign en-de-docs-align.txt
	--source en
	--target de
	--output results_en_de.txt
	--param kif=false
	--param t=0.1
	--param maxrep=2

or

lexacc.exe 
	--source en --target de \
	--test results_en_de.txt \
	en_de_en-100-parallel.txt en_de_de-100-parallel.txt