In [1]:
%matplotlib inline

Core Concepts - Documents, Corpora, Vectors & Models
====================================================


In [4]:
import pprint

The core concepts of ``gensim`` are:

1. `core_concepts_document`: some text.
2. `core_concepts_corpus`: a collection of documents.
3. `core_concepts_vector`: a mathematically convenient representation of a document.
4. `core_concepts_model`: an algorithm for transforming vectors from one representation to another.

Document
--------
* In Gensim, a *document* is a [Python text sequence](https://docs.python.org/3.7/library/stdtypes.html#text-sequence-type-str), aka a "string". It could be anything from a 140 character tweet, a single paragraph, a news article, or a book.


In [5]:
document = "Human machine interface for lab abc computer applications"


Corpus
------

A *corpus* is a collection of `core_concepts_document` objects. They serve two purposes.

1. Input for training a *core_concepts_model*. During training, the models use this *training corpus* to look for common themes and topics, initializing their internal model parameters. Gensim focuses on *unsupervised* models - no human intervention is required.

2. Documents to organize: after training, a topic model can extract topics from new documents.

In [6]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

* This example loads the entire corpus into memory. In practice, corpora may be very large, so loading them into memory may be impossible. Gensim handles these cases by *streaming* them one document at a time.

* We then remove common English words (such as 'the') and words that occur only once in the corpus. In the process of doing so, we'll tokenize our data. Tokenization breaks up the documents into words (in this case using space as a delimiter).


In [7]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))

# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


* Associate each word in the corpus with a unique integer ID. This dictionary defines the vocabulary of all words that our model knows about.

In [8]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

unable to import 'smart_open.gcs', disabling that module


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


* Because our corpus is small, there are only 12 different tokens in this dictionary. Larger dictionaries with hundreds of thousands of tokens are quite common.

Vector
------

* We need to represent documents mathematically. One approach is to represent each document as a vector of *features*. A single feature may be thought of as a question-answer pair:

1. How many times does the word *splonge* appear in the document? (Zero.)
2. How many paragraphs does the document consist of? (Two.)
3. How many fonts does the document use? (Five.)

* The question is represented by its integer id: *(1,2,3)* - so this document becomes a series of pairs: *(1, 0.0), (2, 2.0), (3, 5.0)*. This *dense vector* contains an explicit answer to each question.

* If we already know all the questions, we may leave them implicit and represent the document as *(0, 2, 5)*. This is the **vector** for our document.

* Vectors usually consist of many zero values. Gensim omits all vector elements with value 0.0 to save memory - our example becomes *(2, 2.0), (3, 5.0)* - a 'sparse' (ie 'bag of words' (BoW) vector). All missing feature values can therefore be resolved to zero.

* Assuming identical questions, we can compare the vectors of two documents. For example, assume we are given two vectors (0.0, 2.0, 5.0) and (0.1, 1.9, 4.9). Because the vectors are similar to each other, we can conclude the corresponding documents are also similar.

* Another approach to represent a document as a vector containing the frequency counts of each word in the dictionary. For example, assume we have a dictionary containing the words ['coffee', 'milk', 'sugar', 'spoon']. A document consisting of the string "coffee milk coffee" would be represented by the vector [2, 1, 0, 0] where the entries are (in order) the occurrences of "coffee", "milk", "sugar" and "spoon" in the document. The length of the vector is the number of entries in the dictionary. 

* One of the main properties of this model is that it completely ignores the order of the tokens in the document - hence the 'bag-of-words' name.

* Our corpus has 12 unique words in it, so each document will be represented by a 12-dimensional vector. We can use the dictionary to turn tokenized documents into these 12-dimensional vectors.

In [11]:
pprint.pprint(dictionary.token2id)

{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


* Suppose we wanted to vectorize the phrase "Human computer interaction" (Note: this phrase was not in our original corpus). We can create the BoW representation for a document using the **doc2bow** method - it returns a sparse representation of the word counts.

In [12]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 1), (1, 1)]


* The 1st entry in each tuple = the token ID in the dictionary.
* The 2nd entry in each tuple = the count of this token.

* Note: "interaction" was not in the original corpus - so it was not included in the vectorization. Also: this vector only contains entries for words that actually appeared in the document. Because any document may only contain a few words out in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.

* We can convert our entire original corpus to a list of vectors:

In [13]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


* While this list lives entirely in memory, in most applications you will want a more scalable solution. **gensim** accepts any iterator that returns a single document vector at a time.

Model
-----

* **gensim** documents are represented as vectors, so a model can be thought of as a transformation between two vectors. The model learns the details of this transformation during training, when it reads the training corpus

* One example is [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). it transforms vectors from BoW format to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

* Here's a simple example. Let's initialize the tf-idf model, train it on our corpus and transform the string "system minors":

In [14]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


* **tfidf** returns a list of (ID, weighting) tuples. Note the ID corresponding to "system" (which occurred 4 times in the original corpus) has been weighted lower than the ID corresponding to "minors" (which only occurred twice).

* Once you've created the model, you can do all sorts of cool stuff with it. For example, to transform the whole corpus via TfIdf and index it, in preparation for similarity queries:

In [15]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

* Find the similarity of our query document against every document in the corpus:

In [17]:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
pprint.pprint(list(enumerate(sims)))

[(0, 0.0),
 (1, 0.32448703),
 (2, 0.41707572),
 (3, 0.7184812),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0)]


* Document 3 has a similarity score of 0.718=72%; document 2 has a similarity score of 42%.

In [18]:
pprint.pprint(list(sorted(enumerate(sims), key=lambda x: x[1], reverse=True)))

[(3, 0.7184812),
 (2, 0.41707572),
 (1, 0.32448703),
 (0, 0.0),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0)]
