# DSCI 521: Methods for analysis and interpretation <br>Chapter 2: Feature engineering and language processing

## Exercises
Note: numberings refer to the main notes.

#### 2.1.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

In [1]:
import re
lines = open("./data/phone-numbers.txt", "r").readlines()
lines

numbers = []
for l in lines:
    if re.search('215-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]', l):
        numbers.append(l.strip())
numbers

['215-345-3463', '215-756-8273']

In [2]:
import re
document = open("./data/phone-numbers.txt", "r").read()

numbers = re.findall('215-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]', 
                     document)
numbers

['215-345-3463', '215-756-8273']

#### 2.1.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

In [3]:
text = 'Lost and weary, Catelyn Stark gave herself over to her gods. She knelt before the Smith, who fixed things that were broken, and asked that he give her sweet Bran his protection. She went to the Maid and beseeched her to lend her courage to Arya and Sansa, to guard them in their innocence. To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, and she asked the Warrior to keep Robb strong and shield him in his battles. Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. "Guide me, wise lady," she prayed. "Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."'

# code goes here

#### 2.1.2.4 Exercise: Improving a regex-based sentence tokenizer
First, write a few sentences in a complex (but grammatically acceptable) way so that the (above) regex-based tokenizer breaks. Then, fix the pattern so that the tokenizer can handle your text appropriately.

In [12]:
## regex-based sentence tokenizer
sentences = "With all due resp., I don't think this is a very good tokenization! Here's another one!"
sentences_tokenized = re.split("\s*(?<=[\.\?\!][^a-zA-Z0-9,])\s*", sentences)
sentences_tokenized

["With all due resp., I don't think this is a very good tokenization!",
 '',
 "Here's another one!"]

#### 2.1.3.2 Exercise: POS tagging 
Apply POS tagging to a sentence of your choosing and filter for only verbs and nouns.

In [17]:
import spacy

nlp = spacy.load("en")

running_sentence = "Use some of our test sentences; Joey's not very smart, nor charming."
doc = nlp(running_sentence)

print("token\tcoarse\tfine")
for token in doc:
    if token.pos_ in {"NOUN", "VERB", "PROPN"}:
        print(token.text + "\t" + token.pos_ + "\t" + token.tag_)

token	coarse	fine
Use	VERB	VB
test	NOUN	NN
sentences	NOUN	NNS
Joey	PROPN	NNP
's	VERB	VBZ


#### 2.1.3.5 Exercise: using grammar for information extraction
Apply the spacy grammatical parsing and extract any subject-verb token pairs.

In [22]:
running_sentence = "Let's use another one. Anything else? Happy hour is tomorrow at 5:30 at Tap House where we will all meet up and say hi."
doc = nlp(running_sentence)

print("token\tcoarse\tfine")
for token in doc:
    if token.dep_ == "nsubj" and token.head.pos_ == "VERB":
        print(token.text + " "+ token.head.text)

token	coarse	fine
's use
hour is
we meet


#### 2.1.4.4 Exercise: improved word frequency representation
Build a stop word list and lemmatization strategy (potentially using POS tags) to compute 'better' word frequencies, as you see fit.

In [7]:
## code here

#### 2.1.6.5 Exercise: exploring TF-IDF
Rank each of the example TF-IDF matrix's columns by TF-IDF values from high-to-low and interpret the kinds of words that have high TF-IDF values, i.e., are 'more important'. What about the low values, what kinds of words are these?

In [8]:
## code here