Random Vector Accumulator [Google Summer of Code 2017]
python WikiDetector.py enwiki-20170520-pages-articles.xml modifies the dump by replacing surface forms with their corresponding entites and saves it to the file 'updatedWiki'.
python WikiExtractor.py -o output updatedWiki clears the xml markup and saves the plain text files to the output/ directory.
python WikiTrainer.py output trains Word2Vec embeddings using the plain text articles.
python RVA.py output generates embeddings for the entities using the plain text articles in the output directory (uses multiprocessing module).
python RVA_single.py output generates embeddings for the entities using the plain text articles in the output directory.
- WikiDetector.py (replaceAnchorText)
..1. For each article in the Wikipedia Dump, a 'local' dictionary is generated with the entity as the key and a list of surface forms as the values.
Dictionary[entity] = [surfaceForms1, surfaceForm2,...]
Trailing parantheses are removed from the surface forms, each entity name is capitalized and the spaces are replaced by underscores.
..2. Within the same article, the occurences of all surface forms held by the dictionary are replaced by their corresponding entity names. For example, if the article contains the following text:
"These are often described as [[stateless society|stateless societies]] and it is in a stateless society..."
..While substituting the entity name string 'resource/' is appended at the beginning.
"These are often described as [[stateless society|resource/Stateless_society]] and it is in a entity/Stateless_society..."
- WikiDetector.py (replaceSurfaceForms)
..1. replaceSurfaceForms generates a 'global' dictionary using all the anchor text in the Wikipedia Dump file.
..2. A list of all the linked entities for a specific article is generated.
..3. All the surface forms in the 'global' dictionary are replaced by their corresponding entities. For example the text:
"[[Barack Obama]] is the president of the... When Obama did..."
..The replacement is possible since, in some other article resource/Barack_Obama has the anchor text 'Obama' and this pair was stored in the 'global' dictionary.
"entity/Barack_Obama is the president of the .. When resource/Barack_Obama did..."
WikiExtractor.py This python file clears xml markup from the Wikipedia Dump and retains clean text.
WikiTrainer.py Using the plain text generated by the WikiExtractor.py, Word2Vec embeddings can be trained.
RVA.py Using the plain text generated by the WikiExtractor.py, RVA embeddings are generated for entities.
Random Vector Accumulator
The RVA is an attempt at generating scalable word vectors for Wikipedia entities. There are two sets of vectors defined for each word, index vector which are hyperdimensional sparse random vectors and lexical memory vector which is a weighted sum of the index vectors of the words occuring in the context of the entity associated with the lexical memory vector. For example,
"resource/Church_of_England_parish_church of resource/Saint_Peter dates in part from the 12th century" the lexical memory vector of 'resource/Saint_Peter' is the sum of the index vectors of the words appearing within a window which is defined as the context of the word. If the windows is 2, then
lexical_memory_vector['resource/Saint_peter'] = index_vector['of'] + index_vector['dates'].
The index vector of the title entity is also added to the lexcial memory vector. In the above example the line was taken from the article 'Burnham, Buckinghamshire', so the embedding is updated as
lexical_memory_vector['resource/Saint_peter'] += weight*index_vector['Burnham, Buckinghamshire'].
Additionally a larger context window is used to capture the semantic information of the entities appearing in context. If the entity context window is 4 (say), then the embedding is updated as
lexical_memory_vector['resource/Saint_peter'] += weight*index_vector['resource/Church_of_England_parish_church'] as 'resource/Church_of_England_parish_church' is the only other entity appearing in context with 'resource/Saint_peter'.
As new words appear, their corresponding index and embeddings are generated dynamically.
To test the quality of embeddings generated by the RVA, lexical memory vectors of locations were generated and tested on a modified subset of the Google Analogies Test Set.
In terms of time required for generation of the embeddings, the RVA embeddings are generated in 25 seconds on 0.67% of the corpus with dimension 500 and context window of size 2 and entity context window as 10. On the same machine, the Word2Vec embeddings are generated in 200 seconds on 0.67% of the corpus with dimension 300 and window size 10.