# DSCI 521: Methods for analysis and interpretation <br>Chapter 2: Feature engineering and language processing

## Exercises
Note: numberings refer to the main notes.

#### 2.1.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

In [3]:
import re
document = open("./data/phone-numbers.txt", "r").read()

numbers = re.findall('215-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]', 
                     document)
numbers

['215-345-3463', '215-756-8273']

#### 2.1.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

In [5]:
text = 'Lost and weary, Catelyn Stark gave herself over to her gods. She knelt before the Smith, who fixed things that were broken, and asked that he give her sweet Bran his protection. She went to the Maid and beseeched her to lend her courage to Arya and Sansa, to guard them in their innocence. To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, and she asked the Warrior to keep Robb strong and shield him in his battles. Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. "Guide me, wise lady," she prayed. "Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."'

gods = re.findall("the [A-Z][a-z]+", text)

gods

['the Smith', 'the Maid', 'the Father', 'the Warrior', 'the Crone']

#### 2.1.2.4 Exercise: Improving a regex-based sentence tokenizer
First, write a few sentences in a complex (but grammatically acceptable) way so that the (above) regex-based tokenizer breaks. Then, fix the pattern so that the tokenizer can handle your text appropriately.

In [12]:
## regex-based sentence tokenizer
sentences = "With all due resp., I don't think this is a very good tokenization! Here's another one!"
sentences_tokenized = re.split("\s*(?<=[\.\?\!][^a-zA-Z0-9,])\s*", sentences)
sentences_tokenized

["With all due resp., I don't think this is a very good tokenization!",
 '',
 "Here's another one!"]

#### 2.1.3.2 Exercise: POS tagging 
Apply POS tagging to a sentence of your choosing and filter for only verbs and nouns.

In [8]:
import spacy

nlp = spacy.load("en")

running_sentence = "Use some of our test sentences; Joey's not very smart, nor charming."
doc = nlp(running_sentence)

print("token\tcoarse\tfine")
for token in doc:
    if token.pos_ in {"NOUN", "VERB", "PROPN"}:
        print(token.text + "\t" + token.pos_ + "\t" + token.tag_)

token	coarse	fine
Use	VERB	VB
test	NOUN	NN
sentences	NOUN	NNS
Joey	PROPN	NNP
's	VERB	VBZ


#### 2.1.3.5 Exercise: using grammar for information extraction
Apply the spacy grammatical parsing and extract any subject-verb token pairs.

In [9]:
running_sentence = "Let's use another one. Anything else? Happy hour is tomorrow at 5:30 at Tap House where we will all meet up and say hi."
doc = nlp(running_sentence)

print("subject\tverb")
for token in doc:
    if token.dep_ == "nsubj" and token.head.pos_ == "VERB":
        print(token.text + " "+ token.head.text)

subject	verb
's use
hour is
we meet


#### 2.1.4.4 Exercise: improved word frequency representation
Build a stop word list and lemmatization strategy (potentially using POS tags) to compute 'better' word frequencies, as you see fit.

In [21]:
from collections import Counter

text = """Word frequencies are probably the first and easiest 
numerical representation of text to compute. In some communities, 
this is referred to as the bag of words (BOW) model. 
Put simply, the BOW model simply counts up the 
number of times each word appears in a document. 
This of course depends on a few things, e.g., case and lemmatization. 
However, constructing a basic BOW model is quite straightforward, especially using `Counter`. 
Let's use this very paragraph as our example text for the BOW model."""

# in addition to excluding stop words, let's also exclude specific parts of speech, like determiners, particles,
# punctuation, and adpositions.

stop_words = {'\n', ',', '.', '`', 'the', 'and', 'of'}
excluded_pos = {"DET", "PART", "PUNCT", "ADP"}

doc = nlp(text)
word_counts = Counter()

for word in doc:
    if word.lemma_ not in stop_words and word.pos_ not in excluded_pos:
        word_counts[(word.lemma_)] += 1

word_counts.most_common(25)

[('BOW', 4),
 ('model', 4),
 ('word', 3),
 ('be', 3),
 ('text', 2),
 ('simply', 2),
 ('use', 2),
 ('frequency', 1),
 ('probably', 1),
 ('first', 1),
 ('easy', 1),
 ('numerical', 1),
 ('representation', 1),
 ('compute', 1),
 ('community', 1),
 ('refer', 1),
 ('bag', 1),
 ('put', 1),
 ('count', 1),
 ('number', 1),
 ('time', 1),
 ('appear', 1),
 ('document', 1),
 ('course', 1),
 ('depend', 1)]

#### 2.1.6.5 Exercise: exploring TF-IDF
Rank each of the example TF-IDF matrix's rows by TF-IDF values from high-to-low and interpret the kinds of words that have high TF-IDF values, i.e., are 'more important'. What about the low values, what kinds of words are these?

In [29]:
import numpy as np

def count_words(sentence):
    frequency = Counter()
    for word in sentence:
        frequency[word.text.lower()] += 1
    return frequency

text = '''Lost and weary, Catelyn Stark gave herself over to her gods. 
She knelt before the Smith, who fixed things that were broken, 
and asked that he give her sweet Bran his protection. 
She went to the Maid and beseeched her to lend her courage to Arya and Sansa, 
to guard them in their innocence. 
To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, 
and she asked the Warrior to keep Robb strong and shield him in his battles. 
Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. 
"Guide me, wise lady," she prayed. 
"Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."
'''

doc = nlp(text)
    
## the 'master' set, keeps track of the words in all documents
all_words = set()

## store the word frequencies by book
all_doc_frequencies = {}

## loop over the sentences
for j, sentence in enumerate(doc.sents):
    frequency = count_words(sentence)
    all_doc_frequencies[j] = frequency
    doc_words = set(frequency.keys())
    all_words = all_words.union(doc_words)
    
## create a matrix of zeros: (words) x (documents)
TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
## fix a word ordering for the rows
all_words = sorted(list(all_words))
## loop over the (sorted) document numbers and (ordered) words; fill in matrix
for j in all_doc_frequencies:
    for i, word in enumerate(all_words):
        TDM[i,j] = all_doc_frequencies[j][word]

num_docs = TDM.shape[1]

## start off with a copy of our TDM (frequencies)
TFIDF = np.array(TDM)
## loop over words
for i, word in enumerate(all_words):
    ## count docs containing the word
    num_docs_containing_word = len([x for x in TDM[i] if x])
    ### computen the inverse document frequence of this word
    IDF = -np.log2(num_docs_containing_word/num_docs)
    ## multiply this row by the IDF to transform it to TFIDF
    TFIDF[i,] = TFIDF[i,]*IDF

In [30]:
for j in range(TFIDF.shape[1]):
    doc_vals = TFIDF[:,j]
    
    # make word and TF-IDF value tuples, put them in a list, sort the list according to TF-IDF values, then only keep words with non-zero TF-IDF 
    
    words_and_vals = [(word, val) for word, val in sorted(zip(all_words, doc_vals), key = lambda x: x[1], reverse = True) if val]
    print("For document #" + str(j) + ", words ranked according to TF-IDF are:\n")
    for word, val in words_and_vals:
        print(word + "\t" + str(round(val, 4)))
    print()

For document #0, words ranked according to TF-IDF are:

catelyn	2.8074
gave	2.8074
gods	2.8074
herself	2.8074
lost	2.8074
over	2.8074
stark	2.8074
weary	2.8074
her	0.8074
to	0.8074
and	0.4854

For document #1, words ranked according to TF-IDF are:

that	3.6147
before	2.8074
bran	2.8074
broken	2.8074
fixed	2.8074
give	2.8074
he	2.8074
knelt	2.8074
protection	2.8074
smith	2.8074
sweet	2.8074
things	2.8074
were	2.8074
who	2.8074
asked	1.8074
his	1.8074
her	0.8074
and	0.4854
she	0.4854
the	0.4854

For document #2, words ranked according to TF-IDF are:

to	3.2294
arya	2.8074
beseeched	2.8074
courage	2.8074
guard	2.8074
innocence	2.8074
lend	2.8074
maid	2.8074
sansa	2.8074
their	2.8074
them	2.8074
went	2.8074
her	1.6147
and	0.9709
in	0.8074
she	0.4854
the	0.4854

For document #3, words ranked according to TF-IDF are:

it	5.6147
to	3.2294
battles	2.8074
father	2.8074
for	2.8074
him	2.8074
justice	2.8074
keep	2.8074
know	2.8074
robb	2.8074
seek	2.8074
shield	2.8074
strength	2.8074
strong	2.807

It seems that words that are rare across documents have higher TF-IDF values. The lower the TF-IDF value, the more common the word.