Skip to content
This repository has been archived by the owner on Jan 31, 2023. It is now read-only.
/ aovec Public archive

Easy aozorabunko Word2Vec Builder - 青空文庫全書籍のWord2Vecビルダー+構築済みモデル

License

Notifications You must be signed in to change notification settings

eggplants/aovec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aovec

Model release Release Package PyPI version

pre-commit.ci status

model

Requirements

How to use

Make *.model file

# Install from pypi
pip install aovec

# Clone aozorabunko/aozorabunko (>20GB)
aovec clone

# Parse html files and write to results to novels/
aovec parse

# Make word2vec and write to aozora_model.model
aovec mkvec

Use built model from Python (See: official document)

  • *.model
from gensim.models import Word2Vec

model = Word2Vec.load('aozora_model.model')
  • *.kv
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('aozora_model.kv')
  • *.kv.bin (fastest)
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('aozora_model.kv.bin',
                                          binary=True,
                                          unicode_errors='ignore')

(Optional) Set up mecab-ipadic-neologd on Ubuntu

Download and install

sudo apt install build-essential
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd neologd && cd $_
sudo bin/install-mecab-ipadic-neologd -y
sudo mv /usr/lib/*/mecab/dic/mecab-ipadic-neologd /var/lib/mecab/dic

Update /etc/mecabrc

sudo cp /etc/mecabrc /etc/mecabrc.bak
sudo sed -i 's_^dicdir.*_; &\'$'\ndicdir = /var/lib/mecab/dic/mecab-ipadic-neologd_' /etc/mecabrc
--- /etc/mecabrc.bak
+++ /etc/mecabrc
@@ -3,7 +3,8 @@
 ;
 ; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
 ;
-dicdir = /var/lib/mecab/dic/debian
+; dicdir = /var/lib/mecab/dic/debian
+dicdir = /var/lib/mecab/dic/mecab-ipadic-neologd

 ; userdic = /home/foo/bar/user.dic

Help

$ aovec -h
usage: aovec [-h] [-V] {clone,c,parse,p,mkvec,m} ...

Make Word2Vec from aozorabunko/aozorabunko

positional arguments:
  {clone,c,parse,p,mkvec,m}
    clone (c)           clone aozorabunko/aozorabunko (>20GB)
    parse (p)           parse html files and write to results
    mkvec (m)           make word2vec and write to *.model

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
$ aovec clone -h
usage: aovec clone [-h]

optional arguments:
  -h, --help  show this help message and exit
$ aovec parse -h
usage: aovec parse [-h] [-d DIR]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --savedir DIR
                        directory name of saving results (default: novels)
$ aovec mkvec -h
usage: aovec mkvec [-h] [-d DIR] [-o NAME] [-e INT] [-v INT] [-m INT] [-w INT]
                   [-p INT] [-b] [--both]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --parsedir DIR
                        directory name of saved parsing results (default:
                        novels)
  -o NAME, --model NAME
                        name of word2vec model (default: aozora_model)
  -e INT, --epochs INT  number of word2vec epochs (default: 5)
  -v INT, --vector_size INT
                        dimensionality of the word vectors (default: 1000)
  -m INT, --min_count INT
                        ignore words total frequency lower than this (default:
                        5)
  -w INT, --window INT  window size of words before and for learning (default:
                        5)
  -p INT, --workers INT
                        worker threads (default: 3)
  -b, --binary          save model files as one binary (default: False)
  --both                save model files as both row data and binary (default:
                        False)

License

MIT