# How to create the vector space models

Author: Stephan <stephan@bayesimpact.org>

Skip the run test because it would take too much time to train the models.

This notebook crashes when I run it in the Docker, I suspect because something runs out of memory. It works when I executed directly on my machine.

This notebook explains how to create the vector space models from Wikipedia used in the [job description to skill analysis](./job_description_to_skill.ipynb). We use the [Gensim](http://radimrehurek.com/gensim/) python library to do all the heavy lifting. 

First you have to download the latest wikipedia dump: 

    ftp://wikipedia.c3sl.ufpr.br/wikipedia/frwiki/

The file to download ends with `-pages-articles.xml.bz2`

Gensim comes with a handy script that operates on exactly this dump. You can even leave it in its zipped state to save space on disk. To create a [Tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) representation of the text, and its related dictionary, you can simply run (watch out, this is going to take **around 6 hours**):

    python -m gensim.scripts.make_wiki DUMPNAME OUTPUT_PREFIX
    
I used `frwiki` as the `OUTPUT_PREFIX` and stored dump and results of the script in `bob_emploi/data/wiki`. Other notebooks rely on this location and naming convention. 

To train the LSA model, simply run the following cells. This took **about 4 hours** on my MacBook pro.

In [1]:
from __future__ import division
import logging
import pickle
import pandas as pd
from gensim.models.lsimodel import LsiModel
from gensim.corpora import WikiCorpus, MmCorpus, Dictionary

dict_path = '../../data/wiki/frwiki_wordids.txt.bz2'
corpus_path = '../../data/wiki/frwiki_tfidf.mm'
lsi_model_path = '../../data/wiki/frwiki_lsi'
title2id_path = '../../data/wiki/title2id_mapping.pckl'
skills_wiki_path = '../../data/linkedin_skills_wiki.csv'
skills_corpus_path = '../../wiki/data/skills_corpus.json'

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
dictionary = Dictionary.load_from_text(dict_path)
tfidf_corpus = MmCorpus(corpus_path)

In [3]:
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

lsi_model = LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_topics=400)
lsi_model.save(lsi_model_path)

INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:preparing a new chunk of documents
DEBUG:gensim.models.lsimodel:converting corpus to csc format
INFO:gensim.models.lsimodel:using 100 extra samples and 2 power iterations
INFO:gensim.models.lsimodel:1st phase: constructing (100000, 500) action matrix
INFO:gensim.models.lsimodel:orthonormalizing (100000, 500) action matrix
DEBUG:gensim.matutils:computing QR of (100000, 500) dense matrix
DEBUG:gensim.models.lsimodel:running 2 power iterations
DEBUG:gensim.matutils:computing QR of (100000, 500) dense matrix
DEBUG:gensim.matutils:computing QR of (100000, 500) dense matrix
INFO:gensim.models.lsimodel:2nd phase: running dense svd on (500, 20000) matrix
INFO:gensim.models.lsimodel:computing the final decomposition
INFO:gensim.models.lsimodel:keeping 400 factors (discarding 7.013% of energy spectrum)
INFO:gensim.models.lsimodel:processed do