In [1]:
%matplotlib inline


Corpora and Vector Spaces
=========================

In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

From Strings to Vectors
------------------------
First, let’s create a small corpus of nine short documents represented as strings. (A single in sentence in each.)




In [3]:
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

* Tokenize the documents, remove common words (using a toy stoplist) and words that only appear once.



In [4]:
from pprint import pprint  # pretty-printer
from collections import defaultdict

stoplist = set('for a of the and to in'.split())

texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


* This simple setup mimics the experiment done in Deerwester et al's original LSA article [1]_.

* Use a Bag of Words document representation to convert documents to vectors. In this representation,
each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

- Question: How many times does the word `system` appear in the document?
- Answer: Once.

* It helps to represent questions by their (integer) ids. The mapping between the questions and ids is called a dictionary.



In [5]:
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary)

2020-04-26 17:23:42,691 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-26 17:23:42,691 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2020-04-26 17:23:42,692 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None
2020-04-26 17:23:42,709 : INFO : saved /tmp/deerwester.dict


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


In [6]:
!ls /tmp/*dict

/tmp/deerwester.dict


* We assigned a unique integer id to all words appearing in the corpus with `gensim.corpora.dictionary.Dictionary`. This class sweeps across the texts, collecting word counts and relevant statistics. 
* In the end, we see 12 distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:



In [7]:
pprint(dictionary.token2id)

{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


* To convert tokenized documents to vectors:



In [8]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())

pprint(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

[(0, 1), (1, 1)]


* _doc2bow_ counts the #occurrences of each word, converts it to its integer id, and returns the result as a sparse vector. 
* _[(0, 1), (1, 1)]_ therefore reads: in the document `"Human computer interaction"`, the words `computer` (0) and `human` (1) appear once; the other ten dictionary words appear (implicitly) zero times.



In [9]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use

pprint(corpus)

2020-04-26 17:23:55,254 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2020-04-26 17:23:55,256 : INFO : saving sparse matrix to /tmp/deerwester.mm
2020-04-26 17:23:55,256 : INFO : PROGRESS: saving document #0
2020-04-26 17:23:55,258 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2020-04-26 17:23:55,259 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index


[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


In [10]:
!ls /tmp/deerwester*

/tmp/deerwester.dict  /tmp/deerwester.mm  /tmp/deerwester.mm.index


Corpus Streaming -- One Document at a Time
-------------------------------------------

* Assume there are millions of documents in the corpus. We can't store them in RAM, so let's assume they are stored in a file on disk, one document per line. Gensim only requires that a corpus must be able to return one document vector at a time.




In [11]:
!ls ..

auto_examples_jupyter.zip  notebooks-20200423  src
notebooks		   notebooks-bjp       Untitled.ipynb


In [15]:
from smart_open import open  # for transparently opening remote files

class MyCorpus(object):
    def __iter__(self):
        #for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
        for line in open('file://Marcus-Aurelius-Meditations.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

* A corpus doesn't have to be a list or NumPy array, Pandas dataframe. Gensim *accepts any object that, when iterated over, successively yields documents*.
* This allows you to create your own corpus classes to be streamed from wherever they originate.



* The assumption that each document occupies one line in a single file is not important; you can mold
the `__iter__` function to fit your input format, whatever it is.
* Walking directories, parsing XML, accessing the network, .. just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.



In [16]:
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
pprint(corpus_memory_friendly)

<__main__.MyCorpus object at 0x7f0823fc1c18>


* Corpus is now an object. We didn't define any way to print it, so `print` just outputs address
of the object in memory. iterate over the corpus and print each document vector (one at a time):



In [20]:
#for vector in corpus_memory_friendly:  # load one vector into memory at a time
#    pprint(vector)

* Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.
* Similarly, to construct the dictionary without loading all texts into memory:



In [22]:
from six import iteritems

# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('file://Marcus-Aurelius-Meditations.txt'))
                                
# remove stop words and words that appear only once
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
                                
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
pprint(dictionary)

2020-04-26 17:26:01,850 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-26 17:26:01,965 : INFO : built Dictionary(9104 unique tokens: ['meditations', 'aurelius', 'marcus', '180', 'published:']...) from 3806 documents (total 65661 corpus positions)


<gensim.corpora.dictionary.Dictionary object at 0x7f0823f679e8>


Corpus Formats
---------------

* There are several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. Gensim implements them via the *streaming corpus interface* mentioned earlier: documents are read from (resp. stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.

* One example: [Market Matrix format](http://math.nist.gov/MatrixMarket/formats.html).
* To save a corpus in the Matrix Market format, create a toy corpus of 2 documents, as a plain Python list:



In [23]:
corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it

corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

2020-04-26 17:26:53,220 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
2020-04-26 17:26:53,222 : INFO : saving sparse matrix to /tmp/corpus.mm
2020-04-26 17:26:53,222 : INFO : PROGRESS: saving document #0
2020-04-26 17:26:53,223 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2020-04-26 17:26:53,224 : INFO : saving MmCorpus index to /tmp/corpus.mm.index


In [24]:
!ls /tmp/*corpus*

/tmp/corpus.mm	/tmp/corpus.mm.index


Other formats:
* [SVMlight](http://svmlight.joachims.org/)
* [LDA-C](http://www.cs.princeton.edu/~blei/lda-c/)
* [LDA++](http://gibbslda.sourceforge.net/)



In [25]:
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

2020-04-26 17:26:56,868 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight
2020-04-26 17:26:56,870 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index
2020-04-26 17:26:56,871 : INFO : no word id mapping provided; initializing from corpus
2020-04-26 17:26:56,871 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2020-04-26 17:26:56,873 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2020-04-26 17:26:56,874 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
2020-04-26 17:26:56,875 : INFO : no word id mapping provided; initializing from corpus
2020-04-26 17:26:56,875 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low
2020-04-26 17:26:56,877 : INFO : saving LowCorpus index to /tmp/corpus.low.index


* To load a corpus iterator from a Matrix Market file:



In [26]:
corpus = corpora.MmCorpus('/tmp/corpus.mm')

2020-04-26 17:27:01,677 : INFO : loaded corpus index from /tmp/corpus.mm.index
2020-04-26 17:27:01,679 : INFO : initializing cython corpus reader from /tmp/corpus.mm
2020-04-26 17:27:01,682 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries


* Corpus objects are streams - typically you can't print them directly:



In [27]:
print(corpus)

MmCorpus(2 documents, 2 features, 1 non-zero entries)


* Instead, to view the contents of a corpus: calling list() first will convert any sequence to a plain Python list



In [28]:
print(list(corpus))  

[[(1, 0.5)], []]


* or, print one document at a time, making use of the streaming interface



In [29]:
for doc in corpus:
    print(doc)

[(1, 0.5)]
[]


* For testing and development purposes, nothing beats the simplicity of calling _list(corpus)_.

* To save the same Matrix Market document stream in Blei's LDA-C format:



In [30]:
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

2020-04-26 17:27:12,897 : INFO : no word id mapping provided; initializing from corpus
2020-04-26 17:27:12,899 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2020-04-26 17:27:12,900 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2020-04-26 17:27:12,901 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index


* This way, gensim can also be used as a memory-efficient I/O format conversion tool - just load a document stream using one format and immediately save it in another format.
* Adding formats is easy. See the [SVMlight example](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py).

### [NumPy and SciPy compatibility functions](http://radimrehurek.com/gensim/matutils.html)



In [31]:
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5, 2])  # random matrix as an example
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)

number_of_corpus_features = 10
numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)
pprint(numpy_matrix)

array([[7., 1.],
       [1., 6.],
       [0., 1.],
       [4., 0.],
       [8., 8.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]], dtype=float32)


In [32]:
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5, 2)  # random sparse matrix as example
corpus              = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix    = gensim.matutils.corpus2csc(corpus)

pprint(scipy_csc_matrix)

<0x2 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Column format>
