Skip to content

adamwulf/wikipedia2text

Repository files navigation

# Sample command to convert wiki-markup to XML
php ../mediawiki-1.28.2/maintenance/parse.php  ../../data/wikipedia-articles/af/af/Wikipedia%3AToday%27s_second_feature%2FJanuary_1%2C_2006.txt > sample


# Generate 8 scripts that will convert all wiki files to xml files
java -jar sleep.jar into8.sl ../../data/wikipedia-xml/en.files &

# Launch all 8 scripts to generate XML from wiki-markup

./launch-all.sh


# Find XML files and concat paths into single file
find ../../data/wikipedia-articles -type f -name *.xml > ../../data/wikipedia-xml/en.xml.files

# Split list of XML files into 8 lists of XML files, so we can generate 1 corpus per list
java -jar sleep.jar into8xml.sl ../../data/wikipedia-xml/en.xml.files &


# Sample command to generate corpus file from XML files

./build-corpus.sh en.xml.files corpus-out/

# the build-corpus.sh can be stopped and restarted. to stop the running process, create a file called 'stop.command'
touch stop.command

#to allow the command to run again next time, just remove or rename that file
rm stop.command

# build-corpus.sh also outputs log and progress files next to its output. if the process is stopped, it'll output state files to its corpus output directory so that it can continue later.


# Run top and filter by process name (also, press 'c' to show full process path)
top -p $(pgrep -d',' process-name)

# If I need to restart generating the plain corpus from the xml, then I should also remove all the *.plain.txt from wikipedia-articles.
# the scripts to do this can be generated with:
java -jar sleep.jar removeplain8.sl removeplain8.sl ../../data/wikipedia-xml/en.files
# and then these scripts can be run directly from ../../wikipedia-xml/ directory

About

Tools to convert wikipedia to text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published