Skip to content

asprenger/doc2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

doc2vec

This repository contains Python scripts to train a doc2vec models using Gensim.

Details about the doc2vec algorithm can be found in the paper Distributed Representations of Sentences and Documents.

Create a DeWiki dataset

Doc2vec is an unsupervised learning algorithm and a model can be trained with any set of documents. A document can be anything from a short 140 character tweet, a single paragraph like an article abstract, a news article, or a book.

For German a good baseline is to train a model using the german Wikipedia.

Download the latest DeWiki dump:

wget http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2

Extract the content:

wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -c -b 25M -o extracted dewiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > dewiki.xml

Clean up XML and extract the text:

cat dewiki.xml|tr '\n' ' '|sed 's/<\/doc>/<\/doc>\n/g' > dewiki1
grep -v 'title="Datei:' dewiki1 > dewiki2
grep -v 'title="Wikipedia:' dewiki2 > dewiki3
cat dewiki3|sed 's/<[^>]*>//g' > dewiki4
grep -v "^\s\sKategorie:" dewiki4 > dewiki.txt
rm dewiki1 dewiki2 dewiki3 dewiki4

Train a model from the DeWiki dataset

Preprocess the train data:

python -u preprocess.py dewiki.txt dewiki-preprocessed.txt

Train the model:

python -u train.py --epochs 40 --vocab-min-count 10 data/stopwords_german.txt dewiki-preprocessed.txt /tmp/models/doc2vec-dewiki

About

Python scripts to train a doc2vec models using Gensim.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published