Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
123 lines (85 sloc) 4.448 kb
THE BATCH MODE: LARGE WIKI ANALYSIS MADE POSSIBLE
=================================================
In general, to use WikiTrust on a wiki wiki dump you have two choices.
One is to install the wiki as usual, install the WikiTrust extension,
use mwdumper to load the wiki dump in the database, and use
eval_online_wiki as explained in the top-level README file. The
advantage of this approach is that it is simple. On the other hand,
if you have many revisions (say, more than 100,000) in the wiki, the
approach causes quite a bit of database traffic, and overall, is not
very efficient. You should expect a speed of 20 to 60 revisions per
second, depending on the disks available to the database.
The other approach consists in analyzing the dump file first, and then
loading it into the database once the analysis is complete. This is
what is covered in this README file.
PREREQUISITES:
=============
See the main README file for how to build WikiTrust.
You need to perform step 2) of the installation procedures.
You also need to have a wiki dump file.
We recommend running all these commands under "screen", so that if you
get disconnected from the host, the commands do not terminate -- they
can take quite a long while, for large wikis.
PROCESSING THE DUMP
===================
Processing the dump used to be a complex process, composed of several
steps. To facilitate the processing, we have written a wrapper file,
util/batch_process.py, which takes care of performing all steps
optimally, using the available multi-processing capabilities of your
machine. All you need to do is:
$ cd util
$ ./batch_process.py --cmd_dir <path>/WikiTrust/analysis --dir <process_dir> <dump_file_name.xml.7z>
Where:
<path> is the path to WikiTrust
<process_dir> is the name of a directory that will be used.
This directory needs to be sufficiently large; as of September 2009,
the processing of the Italian Wikipedia uses about 250 GB of space.
Notes:
The command batch_process.py has many options, which allow also to do
the processing in step-by-step fashion; do
$ ./batch_process.py --help
for more information. In particular, batch_process performs the
following phases:
--do_split: splits the input file into chunks
--do_compute_stats: compute the stat files
--do_sort_stats: sorts the statistic files
--do_compute_rep: computes the user reputations
--do_compute_trust: computes text trust
These phases need to be performed in the above order.
Processing of the Italian Wikipedia as of September 2009 takes about 1
day on an 8-CPU-core machine.
For the English Wikipedia, it is advisable to use a cluster; for
cluster processing, you can adapt do_all_it_stats.py and
do_all_it_revs.py to your needs.
Load the data in the db:
------------------------
First, we need to load the Wikipedia data.
* For a single file, do:
$ cd ../test-scripts
$ cp db_access_data.ini.sample db_access_data.ini
Edit db_access_data.ini to reflect the database information for the
wiki you are using. Then, load the revisions into the wiki:
$ python load_data.py <wiki-xml-file>
where <wiki-xml-file> is the uncompressed Wikipedia dump.
If the wiki is not empty, you can use the --clear_db option instruct
load_data.py to erase any previous data in the wiki.
The above command uses mwdumper, see http://www.mediawiki.org/wiki/MWDumper
* For loading many files, see the command util/load_all_files.sh
Once that has been done, you need to load the mysql code generated by
the batch analysis. You can do that via:
$ ./load_db.sh <process_dir>/sql <wiki_user> <wiki_db> <wiki_password>
Next, we have to load the user reputations. Do it with the following
command:
$ cd ../analysis
$ cat <process_dir>/user_reputations.txt | \
./load_reputations -db_user <db_user> -db_pass <db_pass> -db_name <db_name>
Finally, you need to tell the WikiTrust.php Mediawiki extension where
the annotated revisions are. They are stored in the filesystem,
rather than in the database, to improve performance. You need to set
the following variable in LocalSettings.php in the Mediawiki installation:
$wgWikiTrustBlobPath = "<process_dir>/buckets";
where <process_dir> is as you selected when running util/batch_process.py
...AND FINALLY...
=================
Check that the WikiTrust files are executable by Apache (see the
installation instructions), and ... fire it up! It should work.
Jump to Line
Something went wrong with that request. Please try again.