Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
Perl Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain

FLUX has already been used in production (see publications).

The scripts are tested on UNIX (Debian flavors), they should work on other UNIX-like systems provided the modules needed are installed (see installation).

Copyright (C) Adrien Barbaresi, 2012-2015.


Recommandations for the Debian/Ubuntu systems (probably useful for other Linux distributions):

  • Make sure you have following packages installed (Perl modules): libhtml-clean-perl libhtml-strip-perl libstime-piece-perl libtry-tiny-perl libdevel-size-perl

  • A few scripts can use both the default library (LWP, possibly slower) or FURL, a faster alternative. This Perl module is not installed by default (install Furl in CPAN). The scripts detect which module is available.

  • Perl and Python versions: FLUX should work with Perl 5.10 but will work better with 5.14 or 5.16 (mainly because of Unicode support). The scripts were written with Python 2.6 and 2.7 in mind. As is, they won't work with Python 3.

The language-identification scripts are to be used with the language identification system.

Using FLUX server configuration

The server can be started as follows:

python -s
python -s --host=localhost &> langid-log &	# as a background process on localhost

Check a list of URLs for redirections

Send a HTTP HEAD request to see where the link is going.

perl --timeout 10 --all FILE
perl -h			# display all the options

Prints a report on STDOUT and creates X files.

Clean the list of URLs

Removes non-http protocols, images, PDFs, audio and video files, ad banners, feeds and unwanted hostnames like, google.something, or

python -h				# for help

It is also possible to use a blacklist of URLs or domain names as input, such a list can be retrieved from using the script named The script focuses on a particular subset of spam categories. For licensing issues please refer to the original license conditions.

Fetch the pages, clean them and send them as a PUT request to the server

This Perl script fetches the webpages of a list, strips the HTML code, sends raw text to a server instance of and retrieves the answer. Usage : takes a number of links to analyze as argument. Example (provided there is a list named LINKS_TODO):

perl 200
perl -h		# display all the options

Prints a report on STDOUT and creates X files.

Sampling approach (option --hostreduce): pick only one URL at random if several ones seem to have the same hostname.


Parallel threads are implemented, the bash script starts several instances of the scripts, merges and saves the results.

Following syntax: filename + number of links to check + number of threads (+ source if needed)

Resolve redirections:

 bash FILE 100000 10 &> rr.log &

Fetch and send the pages to lang-id :

  • Expects the langid-server to run on port 9008.

  • Expects the python script (in order to avoid crawler traps).

  • Results already collected can be skipped (not required)

    (bash FILE 100000 8 SOURCE1 &> fs.log &) # as a detached background process; "SOURCE" is a word or a code, so that the results are can be linked to it

Get statistics and interesting links

The list written by the Perl script can be examined using a Python script which features a summary of the languages concerned (language code, number of links and percentage). It also to gather a selection of links by choosing relevant language codes.

Usage: [options]

Getting the statistics of a list named RESULTS_langid:

python --input-file=RESULTS_langid

Getting the statistics as well as a prompt of the languages to select and store them in a file:

python -l --input-file=... --output-file=...

Wiki-friendly output: -w option.

The script shows how to extract and group specific information from the URL directory.

Extraction of words from the Wiktionary

The script allows for the extraction of discourse and temporal markers in multiple languages from the Wiktionary. This feature is still experimental, but it can be used by FLUX to get more targeted information about the content.


Related Projects

For upstream applications: