# Tutorial 1: Corpora and Vector Spaces
See this *gensim* tutorial on the web [here](https://radimrehurek.com/gensim/tut1.html).

Version adapted for the Text Mining classes at ISCTE-IUL 

Don’t forget to set:

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.WARNING)

if you want to see logging events.

## From Strings to Vectors

This time, let’s start from documents represented as strings:

In [None]:
logging.info("Vamos lá ver se o GENSIM está instalado")
from gensim import corpora, models, similarities

In [None]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of only a single sentence.

First, let’s tokenize the documents, remove common words (using a toy stoplist) as well as words that only appear once in the corpus:

In [None]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

print(texts)

Your way of processing the documents will likely vary; here, I only split on whitespace to tokenize, followed by lowercasing each word. In fact, I use this particular (simplistic and inefficient) setup to mimic the experiment done in [Deerwester et al.’s original LSA article](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) (Table 2).

The ways to process documents are so varied and application- and language-dependent that I decided to not constrain them by any interface. Instead, a document is represented by the features extracted from it, not by its “surface” string form: how you get to the features is up to you. Below I describe one common, general-purpose approach (called bag-of-words), but keep in mind that different application domains call for different features, and, as always, it’s [garbage in, garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)...

To convert documents to vectors, we’ll use a document representation called [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model). In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

"How many times does the word *system* appear in the document? Once"

It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary:

In [None]:
dictionary = corpora.Dictionary(texts)
print(dictionary)

Here we assigned a unique integer id to all words appearing in the corpus with the [gensim.corpora.dictionary.Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:

In [None]:
print(dictionary.token2id)

To actually convert tokenized documents to vectors:

In [None]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())

# the word "interaction" does not appear in the dictionary and is ignored
print(new_vec)  

The function `doc2bow()` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector `[(word_id, 1), (word_id, 1)]` therefore reads: in the document *“Human computer interaction”*, the words *"computer"* and *"human"*, identified by an integer id given by the built dictionary, appear once; the other ten dictionary words appear (implicitly) zero times. Check their id at the dictionary displayed in the previous cell and see that they match.

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts]
for c in corpus:
    print(c)

By now it should be clear that the vector feature with `id=10 stands` for the question “How many times does the word graph appear in the document?” and that the answer is “zero” for the first six documents and “one” for the remaining three. As a matter of fact, we have arrived at exactly the same corpus of vectors as in the [Quick Example](https://radimrehurek.com/gensim/tutorial.html#first-example). If you're running this notebook by your own, the words id may differ, but you should be able to check the consistency between documents comparing their vectors. 

## What is the most related document ?
First we will convert the existing sparse representation of the documents into a matrix

In [None]:
print("Corpus contains %d documents. Vocabulary of %d words" % ( len(corpus), len(dictionary)))

In [None]:
import numpy as np
dados = np.zeros([len(corpus), len(dictionary)], dtype=float)
dados

In [None]:
for (i, doc) in enumerate(corpus):
    for w in doc:
        dados[i, w[0]] = w[1]
dados

In [None]:
teste = np.zeros([len(dictionary)], dtype=float)
for w in new_vec:
    teste[w[0]] = w[1]
teste

In [None]:
from scipy.spatial import distance
for i in range( len(dados)):
    print("distance(%d)=%f"% (i, distance.cosine(teste, dados[i]) ))

In [None]:
for i, d in enumerate(documents):
    print(f"{i} => {d}")
print(new_doc)

## Now let's try to use TF-IDF

In [None]:
for c in corpus:
    print(c)
print("==>", new_vec)

In [None]:
tfidf = models.TfidfModel(corpus)

In [None]:
print(tfidf[new_vec])

In [None]:
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

In [None]:
print(index)

In [None]:
sims = index[tfidf[new_vec]]
sims

In [None]:
list(enumerate(sims))