Skip to content

abumatran/spidextor

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

Spidextor

Spidextor is a simple glueing tool for running Bitextor (https://sourceforge.net/projects/bitextor/), a bitext extraction tool, on the output of SpiderLing (http://corpus.tools/wiki/SpiderLing), a crawler focused on text.

Dependencies

The two dependencies are Python<=2.7 and make.

Running Spidextor

You should first edit the config.py file and define your own parameters. Each parameter is followed by an explanatory comment.

Spidextor is run via the spidextor.pyscript. You can get the help via ./spidextor.py -h.

The input format for Spidextor is the "prevert" format from SpiderLing (the result of physical deduplication obtained by the SpiderLings util/remove_duplicates.py). It can be given either as an argument or fed to STDIN.

What Spidextor does is the following:

  • It organises data by domain, encoding them in the Bitextor format, and keeps only those domains in which data in both languages was found (.lett.gz extensions)
  • It generates a Makefile and runs it with the predefined number of jobs
  • Each job runs the Bitextor pipeline, domain by domain (producing .lettr, .idx, .ridx1 and .ridx2, .ridx1df and .ridx2df, .align, .aseg files)
  • The results are can be found in the output directory, organised by domain in the predefined formats

About

Glue for the SpiderLing crawler and the Bitextor bitext extractor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages