# Gensim – Vectorizing Text and Transformations and n-grams

## intro
- vector spaces
- bag-of-words
- TF-IDF (term frequency-inverse doc frequency)
- LSI (latent sematinc indexing)
- word2vec

---
Primary features of Gensim are memory-independent nature, mulicore implementation of lenten sematic analysis, latent Dirchelt allocation, random projection, 

## Vectorsn and why we need them

- expected to pass vectos as input to IR algorithms (ex LDA LSI) bc it invoves matrices
- these models are called **Vector SpaceModels**

- ML algorithms use these vectos tomake predictions.
- purpose is to learn from tehprovided data by decreasing the error of their predictions.


## Bag-of-words
- most straightfoward form of representing s sentence as a vector

```
S1:"The dog sat by the mat."
S2:"The cat loves the dog."
```

basic processing
```
S1:"dog sat mat."
S2:"cat love dog."

```
as a python list

```python
S1:['dog', 'sat', 'mat']
S2:['cat', 'love', 'dog']
```

If we wan to represent this as a vector, we need to first constric our vocabulary. whic woulde be the unique wors found in the sentece

```python
Vocab = ['dog', 'sat', 'mat', 'love', 'cat']
```
- this is represented as a vecor with a length of $5$
- or our vector has $5$ dimensions

The bag-of-words model involves frquencies to construct our vectors.

```
S1:[1, 1, 1, 0, 0]
S2:[1, 0, 0, 1, 1]
```
- ther is 1 occurence of `dog`, 
- 0 occurences of `lov3` in  the first sentence
- if the first sentence has 2 occurences of the word `dog` it woudl be represented as:
```
S1: [2, 1, 1, 0, 0]
```



## TF-IDF

*term frequency-inverse document frequency*

-  used in search engines to find relevant docs based on a query
- tries to encode two diffent kinds of information
    - term frequency:**TF** is the number of times a word appears in a document
    - inverse document frequency: helps us understand the i,prtance of a word ina document by calculating the logarithmically scaled invese fraction of the doucments tha tcontain the word( obtained by dividien the total num of documents by the  number of docs containing the term) and then takinng the log of that quotent
    
$$
TF(t)= \frac{\text{number of times term t appears in a doc}}{\text{total number of terms in the doc}}\\
\\
IDF(t) = log_e\frac{\text{total num of docs}}{\text{num of docs with term } t \text{ in it}}
$$

**TF-IDF** is the product of these two factors

- makes rare words more prominent and ignores common words such as *is, of ,that* which may appaer a lot of times, but have little importance.

## Vector transfromations in Gensim
- corpus is a collection of documents

In [23]:
from gensim import corpora

documents = ["Football club Arsenal defeat local rivals this weekend.", u"Weekend football frenzy takes over London.", u"Bank open for takeover bids after losing millions.", u"London football clubs bid to move to Wembley stadium.", u"Arsenal bid 50 million pounds for striker Kane.", u"Financial troubles result in loss of millions for bank.", u"Western bank files for bankruptcy after financial losses.", u"London football club is taken over by oil millionaire from Russia.", u"Banking on finances not working for Russia."]


In [24]:
import spacy 

nlp = spacy.load('en')

texts = []

for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        # no stop words, no punctuation, no nums, no stems, no seeds..
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)
texts

[['football', 'club', 'arsenal', 'defeat', 'local', 'rival', 'weekend'],
 ['weekend', 'football', 'frenzy', 'take', 'london'],
 ['bank', 'open', 'takeover', 'bid', 'lose', 'million'],
 ['london', 'football', 'club', 'bid', 'wembley', 'stadium'],
 ['arsenal', 'bid', 'pound', 'striker', 'kane'],
 ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
 ['western', 'bank', 'file', 'bankruptcy', 'financial', 'loss'],
 ['london', 'football', 'club', 'take', 'oil', 'millionaire', 'russia'],
 ['bank', 'finance', 'work', 'russia']]

In [25]:
documents

['Football club Arsenal defeat local rivals this weekend.',
 'Weekend football frenzy takes over London.',
 'Bank open for takeover bids after losing millions.',
 'London football clubs bid to move to Wembley stadium.',
 'Arsenal bid 50 million pounds for striker Kane.',
 'Financial troubles result in loss of millions for bank.',
 'Western bank files for bankruptcy after financial losses.',
 'London football club is taken over by oil millionaire from Russia.',
 'Banking on finances not working for Russia.']

In [26]:
texts

[['football', 'club', 'arsenal', 'defeat', 'local', 'rival', 'weekend'],
 ['weekend', 'football', 'frenzy', 'take', 'london'],
 ['bank', 'open', 'takeover', 'bid', 'lose', 'million'],
 ['london', 'football', 'club', 'bid', 'wembley', 'stadium'],
 ['arsenal', 'bid', 'pound', 'striker', 'kane'],
 ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
 ['western', 'bank', 'file', 'bankruptcy', 'financial', 'loss'],
 ['london', 'football', 'club', 'take', 'oil', 'millionaire', 'russia'],
 ['bank', 'finance', 'work', 'russia']]

- create a bag-of-words representation for our mini-corpus.
- gensim allows us to do it through its `dictionary` class

In [27]:
dictionary = corpora.Dictionary(texts)
len(dictionary.token2id)

33

In [28]:
dictionary.token2id

{'arsenal': 0,
 'club': 1,
 'defeat': 2,
 'football': 3,
 'local': 4,
 'rival': 5,
 'weekend': 6,
 'frenzy': 7,
 'london': 8,
 'take': 9,
 'bank': 10,
 'bid': 11,
 'lose': 12,
 'million': 13,
 'open': 14,
 'takeover': 15,
 'stadium': 16,
 'wembley': 17,
 'kane': 18,
 'pound': 19,
 'striker': 20,
 'financial': 21,
 'loss': 22,
 'result': 23,
 'trouble': 24,
 'bankruptcy': 25,
 'file': 26,
 'western': 27,
 'millionaire': 28,
 'oil': 29,
 'russia': 30,
 'finance': 31,
 'work': 32}

- 32 unique wores in our corpus
- all represented in our dict with a value
- a words `integer-id` mapping 

- we will use `doc2bow` method to  convert our docuemtn to bag-of-words

In [29]:
corpus = [dictionary.doc2bow(text) for text in texts]

In [30]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(3, 1), (6, 1), (7, 1), (8, 1), (9, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
 [(1, 1), (3, 1), (8, 1), (11, 1), (16, 1), (17, 1)],
 [(0, 1), (11, 1), (18, 1), (19, 1), (20, 1)],
 [(10, 1), (13, 1), (21, 1), (22, 1), (23, 1), (24, 1)],
 [(10, 1), (21, 1), (22, 1), (25, 1), (26, 1), (27, 1)],
 [(1, 1), (3, 1), (8, 1), (9, 1), (28, 1), (29, 1), (30, 1)],
 [(10, 1), (30, 1), (31, 1), (32, 1)]]

- this is a list of lists, where each individual list represents a docuemtn of bag-of-words representation.
- `(word_id, word_count)`

We previously mentioned how Gensim is powerful because it uses streaming corpuses. But in this case, the entire list is loaded into the RAM. This is not a bother for us because it is a toy example, but in any real-world cases, this might cause problems. How do we get past this?

We can start by storing the corpus, once it is created, to disk. One way to do this is as follows:



In [35]:
corpora.MmCorpus.serialize('docs/tmp/example.mm', corpus)

By storing the corpus to disk and then later loading from disk, we are being far more memory efficient, because at most one vector resides in the RAM at a time. The Gensim tutorial [13] on corpora and vector spaces covers a little more than what we discussed so far and may be useful for some readers.

---

converting a bag of words representation into TF-IDF is easy with Gensim
- choose the model/representtion we want fro teh gensim models dir

In [37]:
from gensim import models

tfidf = models.TfidfModel(corpus)

this means that `tfidf` now represents a **TF-IDF** table *trained* on our corpus.

- in the case of **TFIDF**, the *training* consis simply of going thhrough teh supplied corpus one and computing docuemnt frequencies of all its features.

So what does a **TF-IDF** repredentation of our corpus look like?

In [38]:
for document in tfidf[corpus]:
    print(document)

[(0, 0.3292179861221233), (1, 0.24046829370585296), (2, 0.4809365874117059), (3, 0.1774993848325406), (4, 0.4809365874117059), (5, 0.4809365874117059), (6, 0.3292179861221233)]
[(3, 0.24212967666975266), (6, 0.4490913847888623), (7, 0.6560530929079719), (8, 0.32802654645398593), (9, 0.4490913847888623)]
[(10, 0.18797844084016113), (11, 0.25466485399352906), (12, 0.5093297079870581), (13, 0.3486540744136096), (14, 0.5093297079870581), (15, 0.5093297079870581)]
[(1, 0.29431054749542984), (3, 0.21724253258131512), (8, 0.29431054749542984), (11, 0.29431054749542984), (16, 0.5886210949908597), (17, 0.5886210949908597)]
[(0, 0.354982288765831), (11, 0.25928712547209604), (18, 0.5185742509441921), (19, 0.5185742509441921), (20, 0.5185742509441921)]
[(10, 0.19610384738673725), (13, 0.3637247180792822), (21, 0.3637247180792822), (22, 0.3637247180792822), (23, 0.5313455887718271), (24, 0.5313455887718271)]
[(10, 0.18286519950508276), (21, 0.3391702611796705), (22, 0.3391702611796705), (25, 0.495

- (`word_id`, `product of the TF and IDF score for this word`)
- the higher the score the more important the wored in teh document.
- we can use this representation as input for our ML algo as well

## n-grams and some more preprocessing

- $n$-gram is a contigus swquence of $n$ items in the text, in our case, we willbe dealing with words being the *ithem*
- but other cases could be letters, sylables or somethimes the cas eo fspeach,
- a `bi-gram` is when $n=2$

### bigram calcuations
- the conditoina probability of a token given by the receding token
- by choosing words that appear next to each other
- more likely to appear as a pair are called  a collacation.
    - ex: New York
    - machine learning
    - we identify with high probaility that th word York follows the word New, and  is worth considering  "New York" as one identity
- must get rid of stop words before running a bi-gram model on our corpus** as ther ecould be meaningless bi-grams formed.

The Gensim **bi-gram** model is basically an implementation of collocation identification.

- Gensim approaches  bigram by simply combinint hetwo high probability tokens with an underscore.
- the token new and yoru will now become `new_york` instead
- TF-IDF model, bigrams can be cfreate usign another Gensim model **Phrases**

In [40]:
import gensim
bigram = gensim.models.Phrases(texts)

- we now have train bi-gram model fro our corpus.
- we can perform our *transformation*  ont eh text the same waw we used TF-IDF
- we create our corpus like this

In [42]:
texts = [bigram[line] for line in texts]



- each line will now have all possible bi-grams created

- Since by creating new phrases we add words to our dictionary, this step must be done before we create our dictionary. We would have to run this:

In [43]:
dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

For example, one popular preprocessing technique involves removing both high frequency and low-frequency words. We can do this in Gensim with the dictionary module. Let's say we would like to get rid of words that occur in less than 20 documents, or in more than 50% of the documents, we would add the following:

```python
dictionary.filter_extremes(no_below=20, no_above=0.5)

```


## Summary
We've seen in this chapter why it makes sense to change our representation of text from words to numbers, and why this is the only language a computer understands. There are different ways computers can interpret words, and TF-IDF and bag of words are two such vector representations. Gensim is a Python package that offers us ways to generate such vector representations, which are later used as inputs into various machine learning and information retrieval algorithms.

There are further preprocessing techniques such as creating n-grams, collocations and removing low-frequency words, which can help us arrive at better results. The concepts of vectors form a basis in natural language processing and we can now get back to using spaCy's pipelines; indeed, Chapter 5, POS-Tagging and Its Applications, Chapter 6, NER-Tagging and Its Applications, and Chapter 7, Dependency Parsing, all showcase the power of spaCy, and we will start with POS-tagging algorithms using spaCy