![BTS](img/Logo-BTS.jpg)

# Session 5: Text Mining (II)

### Juan Luis Cano Rodríguez <juan.cano@bts.tech> - Data Science Foundations (2018-10-19)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Juanlu001/bts-mbds-data-science-foundations/blob/master/sessions/05-Text-Mining-II.ipynb)

In [5]:
# Source: http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html

t0 = "China has a strong economy that is growing at a rapid pace. However politically it differs greatly from the US Economy."
t1 = "At last, China seems serious about confronting an endemic problem: domestic violence and corruption."
t2 = "Japan's prime minister, Shinzo Abe, is working towards healing the economic turmoil in his own country for his view on the future of his people."
t3 = "Vladimir Putin is working hard to fix the economy in Russia as the Ruble has tumbled."
t4 = "What's the future of Abenomics? We asked Shinzo Abe for his views"
t5 = "Obama has eased sanctions on Cuba while accelerating those against the Russian Economy, even as the Ruble's value falls almost daily."
t6 = "Vladimir Putin was found to be riding a horse, again, without a shirt on while hunting deer. Vladimir Putin always seems so serious about things - even riding horses."

## Exercise 1: Jaccard similarity

1. Write a function `lemmatize` that receives a spaCy `Doc` and returns a list of lemmas as strings, removing stopwords, punctuation signs and whitespace
2. Write a function that receives two spaCy `Doc`s and returns a floating point number representing the Jaccard similarity (see formula below) (hint: use [`set`s](https://docs.python.org/3/library/stdtypes.html#set))
3. Compute the Jaccard similarity between `t0` and `t1`
4. Create a pandas `DataFrame` that holds the Jaccard similarity of all the text combinations from `t0` to `t6` (hint: use [`enumerate`](http://book.pythontips.com/en/latest/enumerate.html#enumerate))

$$ J(A,B) = {{|A \cap B|}\over{|A \cup B|}} $$

In [6]:
import spacy

In [7]:
nlp = spacy.load("en")

In [8]:
from spacy.lang.en.stop_words import STOP_WORDS

In [24]:
def lemmatize(doc):
    return [
        token.lemma_ for token in doc
        if not token.is_punct and not token.is_space
        and (token.text == "US" or not token.lower_ in STOP_WORDS)
        and not token.tag_ == "POS"
    ]

lemmatize(doc0)

['china',
 'strong',
 'economy',
 'grow',
 'rapid',
 'pace',
 'politically',
 'differ',
 'greatly',
 'us',
 'economy']

In [32]:
type(doc0)

spacy.tokens.doc.Doc

In [18]:
doc0

China has a strong economy that is growing at a rapid pace. However politically it differs greatly from the US Economy.

In [31]:
doc0[-3]

US

In [28]:
doc0[-3].text

'US'

In [27]:
type(doc0[-3].text)

str

In [17]:
lemmatize(doc0)

['china',
 'strong',
 'economy',
 'grow',
 'rapid',
 'pace',
 'politically',
 'differ',
 'greatly',
 'economy']

In [14]:
doc2[1].pos_

'PART'

In [15]:
doc2[1].tag_

'POS'

In [39]:
l1 = set(lemmatize(doc1))
l2 = set(lemmatize(doc2))

In [41]:
print(l1)

{'violence', 'confront', 'corruption', 'problem', 'endemic', 'domestic', 'china'}


In [42]:
print(l2)

{'turmoil', 'japan', 'economic', 'minister', 'view', 'country', 'shinzo', 'heal', 'prime', 'people', 'abe', 'work', 'future'}


In [46]:
l1 & l2

set()

In [45]:
l1 - l2

{'china',
 'confront',
 'corruption',
 'domestic',
 'endemic',
 'problem',
 'violence'}

In [44]:
l1 | l2

{'abe',
 'china',
 'confront',
 'corruption',
 'country',
 'domestic',
 'economic',
 'endemic',
 'future',
 'heal',
 'japan',
 'minister',
 'people',
 'prime',
 'problem',
 'shinzo',
 'turmoil',
 'view',
 'violence',
 'work'}

In [47]:
len([1, 2, 3])

3

In [48]:
len({1, 2, 3, 4, 5, 5, 5, 5})

5

In [38]:
def jaccard(doc1, doc2):
    lemmas1 = set(lemmatize(doc1))
    lemmas2 = set(lemmatize(doc2))
    return len(lemmas1.intersection(lemmas2)) / len(lemmas1.union(lemmas2))

jaccard(doc0, doc1)

0.0625

In [36]:
texts = [t0, t1, t2, t3, t4, t5, t6]

In [49]:
docs = [nlp(text) for text in (t0, t1, t2, t3, t4, t5, t6)]

In [55]:
letters = ['a', 'b', 'c']

for position, letter in enumerate(letters):
    print(position, letter)

0 a
1 b
2 c


In [57]:
list(enumerate(letters))

[(0, 'a'), (1, 'b'), (2, 'c')]

In [59]:
import pandas as pd

In [63]:
df = pd.DataFrame(index=range(7), columns=range(7), dtype=float)

for ii, doc_a in enumerate(docs):
    for jj, doc_b in enumerate(docs):
        df.loc[ii, jj] = jaccard(doc_a, doc_b)
        #print("text ", ii, "against text ", jj, ": ", jaccard(doc_a, doc_b))

df

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,0.0625,0.0,0.055556,0.0,0.05,0.0
1,0.0625,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.047619,0.25,0.0,0.0
3,0.055556,0.0,0.047619,1.0,0.0,0.111111,0.125
4,0.0,0.0,0.25,0.0,1.0,0.0,0.0
5,0.05,0.0,0.0,0.111111,0.0,1.0,0.0
6,0.0,0.0,0.0,0.125,0.0,0.0,1.0


In [69]:
for ii, doc_a in enumerate(docs):
    for jj, doc_b in enumerate(docs):
        print("text ", ii, "against text ", jj, ": ", jaccard(doc_a, doc_b))


text  0 against text  0 :  1.0
text  0 against text  1 :  0.0625
text  0 against text  2 :  0.0
text  0 against text  3 :  0.05555555555555555
text  0 against text  4 :  0.0
text  0 against text  5 :  0.05
text  0 against text  6 :  0.0
text  1 against text  0 :  0.0625
text  1 against text  1 :  1.0
text  1 against text  2 :  0.0
text  1 against text  3 :  0.0
text  1 against text  4 :  0.0
text  1 against text  5 :  0.0
text  1 against text  6 :  0.0
text  2 against text  0 :  0.0
text  2 against text  1 :  0.0
text  2 against text  2 :  1.0
text  2 against text  3 :  0.047619047619047616
text  2 against text  4 :  0.25
text  2 against text  5 :  0.0
text  2 against text  6 :  0.0
text  3 against text  0 :  0.05555555555555555
text  3 against text  1 :  0.0
text  3 against text  2 :  0.047619047619047616
text  3 against text  3 :  1.0
text  3 against text  4 :  0.0
text  3 against text  5 :  0.1111111111111111
text  3 against text  6 :  0.125
text  4 against text  0 :  0.0
text  4 ag

In [68]:
data = []
for ii, doc_a in enumerate(docs):
    row = []
    for jj, doc_b in enumerate(docs):
        row.append(jaccard(doc_a, doc_b))
        #print("text ", ii, "against text ", jj, ": ", jaccard(doc_a, doc_b))

    data.append(row)

df = pd.DataFrame(data)
df.index = list("abcdefg")
df

Unnamed: 0,0,1,2,3,4,5,6
a,1.0,0.0625,0.0,0.055556,0.0,0.05,0.0
b,0.0625,1.0,0.0,0.0,0.0,0.0,0.0
c,0.0,0.0,1.0,0.047619,0.25,0.0,0.0
d,0.055556,0.0,0.047619,1.0,0.0,0.111111,0.125
e,0.0,0.0,0.25,0.0,1.0,0.0,0.0
f,0.05,0.0,0.0,0.111111,0.0,1.0,0.0
g,0.0,0.0,0.0,0.125,0.0,0.0,1.0


## Exercise 2: TF-IDF

1. Write a function `tf` that receives a string and a spaCy `Doc` and returns the number of times the word appears in the `lemmatize`d `Doc`
2. Write a function `idf` that receives a string and a list of spaCy `Doc`s and returns _the inverse of_ the number of docs that contain the word
3. Write a function `tf_idf` that receives a string, a spaCy `Doc` and a list of spaCy `Doc`s and returns the product of `tf(t, d) · idf(t, D)`.
4. Write a function `all_lemmas` that receives a list of `Doc`s and returns a `set` of all available `lemma`s
5. Write a function `tf_idf_doc` that receives a `Doc` and a list of `Doc`s and returns a dictionary of `{lemma: TF-IDF value}`, corresponding to each the lemmas of all the available documents
6. Write a function `tf_idf_scores` that receives a list of `Doc`s and returns a `DataFrame` displaying the lemmas in the columns and the documents in the rows.
7. Visualize the TF-IDF, like this:

![TF-IDF](img/tf-idf.png)