In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
nbdir = "/content/gdrive/My Drive/DSCI521/Colab/02-textual/"

Mounted at /content/gdrive


In [None]:
%cd /content/gdrive/My\ Drive/DSCI521/Colab/02-textual/

/content/gdrive/My Drive/DSCI521/Colab/02-textual


# DSCI 521: Methods for analysis and interpretation <br>Chapter 2: Feature engineering and language processing

## Exercises
Note: numberings refer to the main notes.

#### 2.1.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

In [None]:
## code here

#### 2.1.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

In [None]:
text = 'Lost and weary, Catelyn Stark gave herself over to her gods. She knelt before the Smith, who fixed things that were broken, and asked that he give her sweet Bran his protection. She went to the Maid and beseeched her to lend her courage to Arya and Sansa, to guard them in their innocence. To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, and she asked the Warrior to keep Robb strong and shield him in his battles. Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. "Guide me, wise lady," she prayed. "Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."'

## code here

#### 2.1.2.4 Exercise: Improving a regex-based sentence tokenizer
First, write a few sentences in a complex (but grammatically acceptable) way so that the (above) regex-based tokenizer breaks. Then, fix the pattern so that the tokenizer can handle your text appropriately.

In [None]:
## code here

#### 2.1.3.2 Exercise: POS tagging 
Apply POS tagging to a sentence of your choosing and filter for only verbs and nouns.

In [None]:
## code here

#### 2.1.3.5 Exercise: using grammar for information extraction
Apply the spacy grammatical parsing and extract any subject-verb token pairs.

In [None]:
## code here

#### 2.1.4.4 Exercise: improved word frequency representation
Build a stop word list and lemmatization strategy (potentially using POS tags) to compute 'better' word frequencies, as you see fit.

In [None]:
text = """Word frequencies are probably the first and easiest 
numerical representation of text to compute. In some communities, 
this is referred to as the bag of words (BOW) model. 
Put simply, the BOW model simply counts up the 
number of times each word appears in a document. 
This of course depends on a few things, e.g., case and lemmatization. 
However, constructing a basic BOW model is quite straightforward, especially using `Counter`. 
Let's use this very paragraph as our example text for the BOW model."""
## code here

#### 2.1.6.5 Exercise: exploring TF-IDF
Rank each of the example TF-IDF matrix's rows by TF-IDF values from high-to-low and interpret the kinds of words that have high TF-IDF values, i.e., are 'more important'. What about the low values, what kinds of words are these?

In [None]:
import numpy as np

def count_words(sentence):
    frequency = Counter()
    for word in sentence:
        frequency[word.text.lower()] += 1
    return frequency

text = '''Lost and weary, Catelyn Stark gave herself over to her gods. 
She knelt before the Smith, who fixed things that were broken, 
and asked that he give her sweet Bran his protection. 
She went to the Maid and beseeched her to lend her courage to Arya and Sansa, 
to guard them in their innocence. 
To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, 
and she asked the Warrior to keep Robb strong and shield him in his battles. 
Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. 
"Guide me, wise lady," she prayed. 
"Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."
'''

doc = nlp(text)
    
## the 'master' set, keeps track of the words in all documents
all_words = set()

## store the word frequencies by book
all_doc_frequencies = {}

## loop over the sentences
for j, sentence in enumerate(doc.sents):
    frequency = count_words(sentence)
    all_doc_frequencies[j] = frequency
    doc_words = set(frequency.keys())
    all_words = all_words.union(doc_words)
    
## create a matrix of zeros: (words) x (documents)
TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
## fix a word ordering for the rows
all_words = sorted(list(all_words))
## loop over the (sorted) document numbers and (ordered) words; fill in matrix
for j in all_doc_frequencies:
    for i, word in enumerate(all_words):
        TDM[i,j] = all_doc_frequencies[j][word]

num_docs = TDM.shape[1]

## start off with a copy of our TDM (frequencies)
TFIDF = np.array(TDM)
## loop over words
for i, word in enumerate(all_words):
    ## count docs containing the word
    num_docs_containing_word = len([x for x in TDM[i] if x])
    ### computen the inverse document frequence of this word
    IDF = -np.log2(num_docs_containing_word/num_docs)
    ## multiply this row by the IDF to transform it to TFIDF
    TFIDF[i,] = TFIDF[i,]*IDF

In [None]:
## code here

## Additional In-depth Exercises

### A. Constructing co-occurrence matrix statistics

#### A.1 Build a tokenizer
To start, build a tokenization function called `tokens = tokenize(text, space = False)` that accepts a string called `text`, in addition to a boolean argument called `space`, which if positive will allow the tokenize function to determine if whitespace characters (at all) should be stored as a part of the list of `tokens` output.

For this part of the exercise, use the character-class `'[0-9a-zA-Z'-]'` (or it's complimentary character class) to split on non-delimiters, but be sure to capture all portions of the text that are 'split' using a grouping mechanism. Likewise, ensure that all non-word-type tokens are completely resolved, e.g., there _shouldn't_ be any tokens which consist of multiple punctuation characters, such as `".\""`, which should be sub-divided into multiple tokens.

Likewise, be sure to collapse any multiple whitespace `" "` characters down to just one as an initial pre-processing step to the `text`.

In [None]:
## code here

#### A.2 Build a word-sentence tokenizer
Here, the goal will be to produce a two-level tokenization utility that is similar to what Spacy produces:

In [None]:
for s in doc.sents:
    print([w.text for w in s])

['Lost', 'and', 'weary', ',', 'Catelyn', 'Stark', 'gave', 'herself', 'over', 'to', 'her', 'gods', '.', '\n']
['She', 'knelt', 'before', 'the', 'Smith', ',', 'who', 'fixed', 'things', 'that', 'were', 'broken', ',', '\n', 'and', 'asked', 'that', 'he', 'give', 'her', 'sweet', 'Bran', 'his', 'protection', '.', '\n']
['She', 'went', 'to', 'the', 'Maid', 'and', 'beseeched', 'her', 'to', 'lend', 'her', 'courage', 'to', 'Arya', 'and', 'Sansa', ',', '\n', 'to', 'guard', 'them', 'in', 'their', 'innocence', '.', '\n']
['To', 'the', 'Father', ',', 'she', 'prayed', 'for', 'justice', ',', 'the', 'strength', 'to', 'seek', 'it', 'and', 'the', 'wisdom', 'to', 'know', 'it', ',', '\n', 'and', 'she', 'asked', 'the', 'Warrior', 'to', 'keep', 'Robb', 'strong', 'and', 'shield', 'him', 'in', 'his', 'battles', '.', '\n']
['Lastly', 'she', 'turned', 'to', 'the', 'Crone', ',', 'whose', 'statues', 'often', 'showed', 'her', 'with', 'a', 'lamp', 'in', 'one', 'hand', '.', '\n']
['"', 'Guide', 'me', ',', 'wise', 'lad

with the caveat that we use our own tokenization utility (which can be flagged to retain space characters).
Since this will then require the utilization of a sentence tokenizer, download `nltk` (if you haven't already) and utilize its `sent_tokenize()` function.

In [None]:
## code here

#### A.3 Try to re-construct the document
Now that we have the two-stage tokenizer which can retain space characters, let's try an re-construct a document from its tokenization, with and without `space=True`.

In particular, consider how to re-join the elements of the two-level list (sentences) of lists (words) of strings by a delimiter so as to re-construct the document.

In [None]:
## code here with space=True

In [None]:
## code here with space=False

#### A.4 Write a function that loads/processes a document from file
Write a function called `load_data(path, space = False)` which accepts a `path` string to identify the direct location of a text file. Upon loading the specified file, construct an (output) dictionary called `data` with three key-value pairs:

- `'sentences'`: output of word_sentence_tokenize applied to document,
- `'counts'`: a dictionary of integer counts of all tokens in the document,
- `'type_index'`: a dictionary linking tokens to indices for their order of appearance.

Test this code on the books in the local `'./data/books/'` directory, e.g., `'./data/books/84.txt'` is a copy of "Frankenstein..." (other metadata can be found in `'./data/books/metadata.json'`).

In [None]:
## code here

#### A.5 Build a context generator
Now write a function called `get_context(i, sentence, m = 0, weight = 0)` to produce a 'sliding-window' context (list of surrounding tokens) for the token of index `i` in an already tokenized `sentence` (a list of strings). Optional non-negative arguments `m` (an integer) and `weight` (a float) specify the size of the context window and the relative weights of context elements.

Specifically, `m` tokens should be taken to both the left and right of token `i` (all should be taken when the default `m=0` is set. 

Finally, `weight` should determine how to return in a list named `weights`, which should be numeric and of length equal to that of the `context`. The contents of `weights` should be the reciprocal of the absolute distance to the center token, i.e., the token of index `i`---_raised to the power valued by `weight`_. Note: this ensures setting `weight=0` 'turns off' the weights.

In [None]:
## code here

#### A.6 Compute a co-occurrence matrix
Finally, we'll utilize our context model and two-stage tokenizer to build a co-occurrence matrix with weighted contexts.

In particular, build a function called `compute_co_occurrence_matrix(data, m = 0, weight = 0)` that accepts the `data` output from `load_data()` and constructs `X`&mdash;an `N` (the vocabulary size) by `N` matrix with each row (token) and column (context) corresponding to the _total `weight`_ in which context tokens appear in the `m`-context windows of 'center' tokens. 

Note: the rows and columns of `X` should be in the order specified by `data['type_index']`.

In [None]:
## code here

#### A.7 Build a similarity function to sanity check our model
Here, we should build a cosine-similarity comparer: `most_similar(t, type_index, X, top=10)` that accepts a token `t` and the `type_index` (from `data['type_index']`), the latter of which should link any string to the rows/columns of `X`. The final arguemnt `top` specifies how many results the function should produce in output. Finally, this output should (as in Chapter 1) consist of a sorted (high-to-low, by similarity) list of `(token, similarity)` tuples.

In [None]:
## code here