# Gensim – Vectorizing Text and Transformations and n-grams



In [1]:
## Bag of words

In [2]:
vocab = ['dog', 'sat', 'mat', 'love', 'cat']

## TF-IDF

TF-IDF is short for term frequency-inverse document frequency. Largely used in search engines to find relevant documents based on a query, it is a rather intuitive approach to converting our sentences into vectors.

```
TF(t) = (number of times term t appears in a document) / (total number of terms in the document)

IDF(t) = log_e (total number of documents / number of documents with term t in it)
```

## Vector transformation in Gensim

In [3]:
from gensim import corpora
documents = [u"Football club Arsenal defeat local rivals this weekend.", u"Weekend football frenzy takes over London.", u"Bank open for takeover bids after losing millions.", u"London football clubs bid to move to Wembley stadium.", u"Arsenal bid 50 million pounds for striker Kane.", u"Financial troubles result in loss of millions for bank.", u"Western bank files for bankruptcy after financial losses.", u"London football club is taken over by oil millionaire from Russia.", u"Banking on finances not working for Russia."]

ModuleNotFoundError: No module named 'gensim'

In [4]:
documents

['Football club Arsenal defeat local rivals this weekend.',
 'Weekend football frenzy takes over London.',
 'Bank open for takeover bids after losing millions.',
 'London football clubs bid to move to Wembley stadium.',
 'Arsenal bid 50 million pounds for striker Kane.',
 'Financial troubles result in loss of millions for bank.',
 'Western bank files for bankruptcy after financial losses.',
 'London football club is taken over by oil millionaire from Russia.',
 'Banking on finances not working for Russia.']

In [10]:
import spacy

nlp = spacy.load('en')

texts = []
for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)

In [11]:
texts

[['football', 'club', 'arsenal', 'defeat', 'local', 'rival', 'weekend'],
 ['weekend', 'football', 'frenzy', 'take', 'london'],
 ['bank', 'open', 'takeover', 'bid', 'lose', 'million'],
 ['london', 'football', 'club', 'bid', 'wembley', 'stadium'],
 ['arsenal', 'bid', 'pound', 'striker', 'kane'],
 ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
 ['western', 'bank', 'file', 'bankruptcy', 'financial', 'loss'],
 ['london', 'football', 'club', 'take', 'oil', 'millionaire', 'russia'],
 ['bank', 'finance', 'work', 'russia']]

start by setting up a bagof word representation for our mini-corupus. Gensim allows us to do this very conv

In [12]:
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

{'arsenal': 0, 'club': 1, 'defeat': 2, 'football': 3, 'local': 4, 'rival': 5, 'weekend': 6, 'frenzy': 7, 'london': 8, 'take': 9, 'bank': 10, 'bid': 11, 'lose': 12, 'million': 13, 'open': 14, 'takeover': 15, 'stadium': 16, 'wembley': 17, 'kane': 18, 'pound': 19, 'striker': 20, 'financial': 21, 'loss': 22, 'result': 23, 'trouble': 24, 'bankruptcy': 25, 'file': 26, 'western': 27, 'millionaire': 28, 'oil': 29, 'russia': 30, 'finance': 31, 'work': 32}


There are 32 unique words in our corpus, all of which are represented in our dictionary with each word being assigned an index value. When we refer to a word's word_id henceforth, it means we are talking about the words integer-id mapping made by the dictionary.We will be using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

In [13]:
corpus = [dictionary.doc2bow(text) for text in texts]

In [14]:
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(3, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(1, 1), (3, 1), (8, 1), (11, 1), (16, 1), (17, 1)], [(0, 1), (11, 1), (18, 1), (19, 1), (20, 1)], [(10, 1), (13, 1), (21, 1), (22, 1), (23, 1), (24, 1)], [(10, 1), (21, 1), (22, 1), (25, 1), (26, 1), (27, 1)], [(1, 1), (3, 1), (8, 1), (9, 1), (28, 1), (29, 1), (30, 1)], [(10, 1), (30, 1), (31, 1), (32, 1)]]


his is a list of lists, where each individual list represents a documents bag-of-words representation. A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur. Unlike the example we demonstrated, where an absence of a word was a 0, we use tuples that represent (word_id, word_count). We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list. We can also notice in this case each document has not greater than one count of each word - in smaller corpuses, this tends to happen.

We can start by storing the corpus, once it is created, to disk. One way to do this is as follows:

```python

corpora.MmCorpus.serialize('/tmp/example.mm', corpus)
```



In [16]:
from gensim import models
tfidf = models.TfidfModel(corpus)

So, what does a TF-IDF representation of our corpus look like? All we have to do is this:

In [17]:
for document in tfidf[corpus]:
    print(document)

[(0, 0.3292179861221233), (1, 0.24046829370585296), (2, 0.4809365874117059), (3, 0.1774993848325406), (4, 0.4809365874117059), (5, 0.4809365874117059), (6, 0.3292179861221233)]
[(3, 0.24212967666975266), (6, 0.4490913847888623), (7, 0.6560530929079719), (8, 0.32802654645398593), (9, 0.4490913847888623)]
[(10, 0.18797844084016113), (11, 0.25466485399352906), (12, 0.5093297079870581), (13, 0.3486540744136096), (14, 0.5093297079870581), (15, 0.5093297079870581)]
[(1, 0.29431054749542984), (3, 0.21724253258131512), (8, 0.29431054749542984), (11, 0.29431054749542984), (16, 0.5886210949908597), (17, 0.5886210949908597)]
[(0, 0.354982288765831), (11, 0.25928712547209604), (18, 0.5185742509441921), (19, 0.5185742509441921), (20, 0.5185742509441921)]
[(10, 0.19610384738673725), (13, 0.3637247180792822), (21, 0.3637247180792822), (22, 0.3637247180792822), (23, 0.5313455887718271), (24, 0.5313455887718271)]
[(10, 0.18286519950508276), (21, 0.3391702611796705), (22, 0.3391702611796705), (25, 0.495

If you remember what we said about TF-IDF, you will be able to identify the float next to each word_id - it is the product of the TF and IDF scores for that particular word, instead of just the word count which was present before. The higher the score, the more important the word in the document.

## n-grams and some more preprocessing

In [18]:
import gensim
bigram = gensim.models.Phrases(texts)

In [19]:
texts = [bigram[line] for line in texts]



In [20]:
dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

In [21]:
dictionary.filter_extremes(no_below=20, no_above=0.5)