# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Start by writing a function called `load_book` that reads in a book based on a provided `book_id` string and returns a list of `paragraphs` from the book. When book data is loaded, you should remove the space characters at the beginning and end of the text (e.g., using `strip()`). Then, to split books into paragraphs, use the `re.split()` method to split the input in cases where there are two or more new lines. Note, that books are in the provided `data/books` directory.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you do _not_ need to remove markup or non-substantive content.

In [1]:
# B1:Function(3/3)

import re
import sys

def load_book(book_id):
    
    paragraphs = []

    try:
        with open("./data/books/" + book_id + ".txt") as file:
            book_data = file.read()
            trimmed_book_data = book_data.strip()
            paragraphs = re.split('\n{2,}', trimmed_book_data)
            # print(trimmed_book_data)

    except FileNotFoundError:
        print("Error, file not found.")
        sys.exit()
    
    return paragraphs

To test your function, lets apply it to look at a few paragraphs from book 84.

In [2]:
# B1:SanityCheck
paragraphs = load_book('84')
print(len(paragraphs))
print(paragraphs[10])

723
These reflections have dispelled the agitation with which I began my
letter, and I feel my heart glow with an enthusiasm which elevates me
to heaven, for nothing contributes so much to tranquillize the mind as
a steady purpose--a point on which the soul may fix its intellectual
eye.  This expedition has been the favourite dream of my early years. I
have read with ardour the accounts of the various voyages which have
been made in the prospect of arriving at the North Pacific Ocean
through the seas which surround the pole.  You may remember that a
history of all the voyages made for purposes of discovery composed the
whole of our good Uncle Thomas' library.  My education was neglected,
yet I was passionately fond of reading.  These volumes were my study
day and night, and my familiarity with them increased that regret which
I had felt, as a child, on learning that my father's dying injunction
had forbidden my uncle to allow me to embark in a seafaring life.


__B2.__ _(10 points)_ Next, write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`, where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-sentence indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

Note, we have imported spacy and set it up to use the `"en"` model. This will require you to install spacy by running `pip install spacy` and downloading the `"en"` model by running the command `python -m spacy download en`.

In [3]:
# B2:Function(10/10)

from collections import defaultdict
import spacy
nlp = spacy.load("en")

def def_value():
    tmp_list = []
    return tmp_list

def kwic(paragraphs, search_terms = {}):
    
    data = defaultdict(def_value)
    idx_paragraph = 0
    idx_sentence = 0

    for para in paragraphs:
        idx_paragraph += 1
        doc = nlp(para)

        for sent in doc.sents:
            idx_sentence += 1
            # print(sent)

            word_list = []
            for token in sent:
                word_list.append(token.text)

            for token in sent:
                # print(token)

                if token.text in search_terms:
                    # print('found it')
                    tmp_list = [[idx_paragraph, idx_sentence, token.idx], word_list]
                    data[token.text].append(tmp_list)


    return data

Now, let's test your function using the paragraphs from your `load_book` function.

In [4]:
# B2:SanityCheck
kwic(paragraphs, {'Ocean', 'seas'})

defaultdict(<function __main__.def_value()>,
            {'Ocean': [[[11, 28, 479],
               ['I',
                '\n',
                'have',
                'read',
                'with',
                'ardour',
                'the',
                'accounts',
                'of',
                'the',
                'various',
                'voyages',
                'which',
                'have',
                '\n',
                'been',
                'made',
                'in',
                'the',
                'prospect',
                'of',
                'arriving',
                'at',
                'the',
                'North',
                'Pacific',
                'Ocean',
                '\n',
                'through',
                'the',
                'seas',
                'which',
                'surround',
                'the',
                'pole',
                '.',
                ' ']]],
             'seas':

__B3.__ _(2 points)_ Let's test your `kwic` search function's utility using the pre-processed `paragraphs` from book `84` for the key words `Frankenstein` and `monster` in context. Answer the inline questions about these tests.

In [5]:
# B3:SanityCheck
results = kwic(paragraphs, {"Frankenstein", "monster"})

print("# of sentences 'Frankenstein' appears in: {}".format(len(results['Frankenstein'])))
print("# of sentences 'monster' appears in: {}".format(len(results['monster'])))
print()

print(" ".join(results['Frankenstein'][7][1]))
print()
print(" ".join(results['monster'][0][1]))

# of sentences 'Frankenstein' appears in: 27
# of sentences 'monster' appears in: 31

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


In [6]:
# B3:Inline(1/2)

# Is the kwic function fast or slow? Print "Fast" or "Slow"
print()
print("SLOOOW")


SLOOOW


In [7]:
# B3:Inline(1/2)

# How many sentences does the work Frankenstein appear in? Print the integer (0 is just a placeholder).
# print(0)
len(results['Frankenstein'])

27

__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here for kwic. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. Rewrite the `load_book` function to do some additional preprocessing. Specifically, this function should be modified to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. 

In [8]:
# B4:Function(10/10)

def load_book_2(book_id):
    
    idx_paragraph = 0
    idx_sentence = 0
    idx_word = 0

    document = []
    index = defaultdict(def_value)

    '''
    tst_lst = [['paragraph1', ['sentence1', 'sentence2']], ['paragraph2']]
    print(tst_lst[0])
    print(tst_lst[0][1])
    l = ['a', 'b', ['cc', 'dd', ['eee', 'fff']], 'g', 'h']
    '''

    try:
        with open("./data/books/" + book_id + ".txt") as file:
            book_data = file.read()
            trimmed_book_data = book_data.strip()
            paragraphs = re.split('\n{2,}', trimmed_book_data)

            for para in paragraphs:
                para_list = []
                doc = nlp(para)
                para_list.append(doc.text)
                document.append(para_list)

                # sent_list = []
                for sent in doc.sents:
                    sent_list = []
                    sent_list.append(sent.text)
                    # doc_list[idx_paragraph].append(sent_list)

                    word_list = []
                    idx_word = 0

                    for token in sent:
                        word_list.append(token.text)

                        indicies_list = [idx_paragraph, idx_sentence, idx_word]
                        index[token.text].append(indicies_list)

                        idx_word += 1

                    sent_list.append(word_list)
                    document[idx_paragraph].append(sent_list)

                    idx_sentence += 1


                idx_paragraph += 1
                idx_sentence = 0

    except FileNotFoundError:
        print("Error, file not found.")
        sys.exit()

    return(document, index)

Now, let's test your new function on `book_id` = `'84'`. We'll use the returned document to access a particular sentence and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [10]:
# B4:SanityCheck

# load the book
document, index = load_book_2("84")

In [11]:
# B4:SanityCHeck

# Output paragraph 9, sentence 5
document[124][10]

['the hue of death; her features appeared to change, and I thought that I\nheld the corpse of my dead mother in my arms; a shroud enveloped her\nform, and I saw the grave-worms crawling in the folds of the flannel.\n',
 ['the',
  'hue',
  'of',
  'death',
  ';',
  'her',
  'features',
  'appeared',
  'to',
  'change',
  ',',
  'and',
  'I',
  'thought',
  'that',
  'I',
  '\n',
  'held',
  'the',
  'corpse',
  'of',
  'my',
  'dead',
  'mother',
  'in',
  'my',
  'arms',
  ';',
  'a',
  'shroud',
  'enveloped',
  'her',
  '\n',
  'form',
  ',',
  'and',
  'I',
  'saw',
  'the',
  'grave',
  '-',
  'worms',
  'crawling',
  'in',
  'the',
  'folds',
  'of',
  'the',
  'flannel',
  '.',
  '\n']]

In [12]:
# B4:SanityCheck

# Output the indices for monster
print(index['monster'])
print()
print()
print(index['Frankenstein'])

[[124, 10, 57], [136, 3, 6], [139, 3, 4], [142, 1, 4], [243, 3, 29], [261, 3, 18], [280, 0, 2], [321, 1, 35], [345, 9, 6], [380, 13, 5], [397, 1, 46], [437, 0, 16], [439, 0, 3], [477, 7, 8], [478, 7, 6], [510, 1, 8], [527, 0, 1], [538, 19, 22], [560, 3, 31], [585, 4, 43], [587, 11, 72], [606, 3, 2], [615, 2, 11], [633, 10, 9], [639, 1, 21], [644, 4, 17], [653, 8, 5], [663, 5, 2], [673, 0, 39], [673, 2, 2], [709, 11, 1]]


[[1, 0, 0], [103, 3, 11], [131, 5, 4], [134, 2, 4], [165, 8, 15], [184, 0, 11], [187, 0, 3], [232, 6, 4], [253, 5, 0], [283, 8, 2], [285, 3, 3], [285, 21, 5], [439, 2, 10], [440, 0, 2], [485, 7, 5], [673, 4, 6], [673, 10, 0], [686, 2, 0], [688, 3, 8], [695, 4, 0], [706, 3, 25], [709, 2, 2], [710, 1, 28], [712, 1, 2], [713, 0, 21], [715, 0, 5], [720, 2, 2]]


__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic` that takes a `document` and `index` from our new `load_book` function as well as a provided list of `search_terms` (just like our original kwic function). The function should loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [13]:
# B5:Function(5/5)

def fast_kwic(document, index, search_terms = {}):

    data = defaultdict(def_value)

    for term in search_terms:

        if term in index:
            # print('found it')
            lst_word_indicies = index[term]
            if len(lst_word_indicies) > 0:
                for occurence in lst_word_indicies:
                    i = occurence[0]
                    j = occurence[1]
                    k = occurence[2]
                    sentence_words = document[i][j+1][1]
                    tmp_lst = [occurence, sentence_words]
                    data[term].append(tmp_lst)
    
    return(data)

To test our new function, lets test it on the same keywords as before: `Frankenstein` and `monster`. Note that the output from this sanity check should be the same as the one from **B3**. 

In [14]:
# B5:SanityCheck

fast_results = fast_kwic(document, index, {'Frankenstein', 'monster'})

print("# of sentences 'Frankenstein' appears in: {}".format(len(fast_results['Frankenstein'])))
print("# of sentences 'monster' appears in: {}".format(len(fast_results['monster'])))
print()

print(" ".join(fast_results['Frankenstein'][7][1]))
print()
print(" ".join(fast_results['monster'][0][1]))

# of sentences 'Frankenstein' appears in: 27
# of sentences 'monster' appears in: 31

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [15]:
# B6:Function(5/5)

def load_book_3(book_id, pos = True, lemma = True):
    
    idx_paragraph = 0
    idx_sentence = 0
    idx_word = 0

    document = []
    index = defaultdict(def_value)

    '''
    tst_lst = [['paragraph1', ['sentence1', 'sentence2']], ['paragraph2']]
    print(tst_lst[0])
    print(tst_lst[0][1])
    l = ['a', 'b', ['cc', 'dd', ['eee', 'fff']], 'g', 'h']
    '''

    try:
        with open("./data/books/" + book_id + ".txt") as file:
            book_data = file.read()
            trimmed_book_data = book_data.strip()
            paragraphs = re.split('\n{2,}', trimmed_book_data)

            for para in paragraphs:
                para_list = []
                doc = nlp(para)
                para_list.append(doc.text)
                document.append(para_list)

                # sent_list = []
                for sent in doc.sents:
                    sent_list = []
                    sent_list.append(sent.text)
                    # doc_list[idx_paragraph].append(sent_list)

                    word_list = []
                    idx_word = 0

                    for token in sent:
                        if lemma:
                            text = token.lemma_
                        else:
                            text = token.text

                        if pos:
                            tag = token.pos_
                        else:
                            tag = ""
                        heading = (text, tag)

                        word_list.append(heading)

                        indicies_list = [idx_paragraph, idx_sentence, idx_word]
                        index[heading].append(indicies_list)

                        idx_word += 1

                    sent_list.append(word_list)
                    document[idx_paragraph].append(sent_list)

                    idx_sentence += 1

                idx_paragraph += 1
                idx_sentence = 0



    except FileNotFoundError:
        print("Error, file not found.")
        sys.exit()
    
    return document, index

In [16]:
# B6:SanityCheck
document, index = load_book_3("84", pos = True, lemma = True)

In [17]:
# B6:SanityCheck
print("Sentence with ('cold', 'NOUN'):")
" ".join(fast_kwic(document, index, search_terms = {('cold', 'NOUN')})[('cold', 'NOUN')][0][1])

Sentence with ('cold', 'NOUN'):


TypeError: sequence item 0: expected str instance, tuple found

In [20]:
# B6:SanityCheck
print("Sentence with ('cold', 'NOUN'):")
" ".join(fast_kwic(document, index, search_terms = {('cold', 'NOUN')}))

Sentence with ('cold', 'NOUN'):


TypeError: sequence item 0: expected str instance, tuple found

In [18]:
print(fast_kwic(document, index, search_terms = {('cold', 'NOUN')}))

defaultdict(<function def_value at 0x7fc862fed820>, {('cold', 'NOUN'): [[[13, 2, 2], [('the', 'DET'), ('\n', 'SPACE'), ('cold', 'NOUN'), ('be', 'AUX'), ('not', 'PART'), ('excessive', 'ADJ'), (',', 'PUNCT'), ('if', 'SCONJ'), ('-PRON-', 'PRON'), ('be', 'AUX'), ('wrap', 'VERB'), ('in', 'ADP'), ('fur', 'NOUN'), ('--', 'PUNCT'), ('a', 'DET'), ('dress', 'NOUN'), ('which', 'DET'), ('-PRON-', 'PRON'), ('have', 'AUX'), ('\n', 'SPACE'), ('already', 'ADV'), ('adopt', 'VERB'), (',', 'PUNCT'), ('for', 'ADP'), ('there', 'PRON'), ('be', 'AUX'), ('a', 'DET'), ('great', 'ADJ'), ('difference', 'NOUN'), ('between', 'ADP'), ('walk', 'VERB'), ('the', 'DET'), ('\n', 'SPACE'), ('deck', 'NOUN'), ('and', 'CCONJ'), ('remaining', 'ADJ'), ('seat', 'VERB'), ('motionless', 'NOUN'), ('for', 'ADP'), ('hour', 'NOUN'), (',', 'PUNCT'), ('when', 'ADV'), ('no', 'DET'), ('exercise', 'NOUN'), ('\n', 'SPACE'), ('prevent', 'VERB'), ('the', 'DET'), ('blood', 'NOUN'), ('from', 'ADP'), ('actually', 'ADV'), ('freeze', 'VERB'), ('

In [None]:
# B6:SanityCheck
print("Sentence with ('cold', 'ADJ'):")
" ".join(fast_kwic(document, index, search_terms = {('cold', 'ADJ')})[('cold', 'ADJ')][0][1])

In [19]:
print(fast_kwic(document, index, search_terms = {('cold', 'ADJ')}))

defaultdict(<function def_value at 0x7fc862fed820>, {('cold', 'ADJ'): [[[9, 0, 22], [('-PRON-', 'PRON'), ('be', 'AUX'), ('already', 'ADV'), ('far', 'ADV'), ('north', 'ADV'), ('of', 'ADP'), ('London', 'PROPN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('as', 'SCONJ'), ('-PRON-', 'PRON'), ('walk', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('street', 'NOUN'), ('of', 'ADP'), ('\n', 'SPACE'), ('Petersburgh', 'PROPN'), (',', 'PUNCT'), ('-PRON-', 'PRON'), ('feel', 'VERB'), ('a', 'DET'), ('cold', 'ADJ'), ('northern', 'ADJ'), ('breeze', 'NOUN'), ('play', 'NOUN'), ('upon', 'SCONJ'), ('-PRON-', 'DET'), ('cheek', 'NOUN'), (',', 'PUNCT'), ('which', 'DET'), ('\n', 'SPACE'), ('brace', 'VERB'), ('-PRON-', 'DET'), ('nerve', 'NOUN'), ('and', 'CCONJ'), ('fill', 'VERB'), ('-PRON-', 'PRON'), ('with', 'ADP'), ('delight', 'NOUN'), ('.', 'PUNCT'), (' ', 'SPACE')]], [[12, 3, 19], [('i', 'NOUN'), ('\n', 'SPACE'), ('accompany', 'VERB'), ('the', 'DET'), ('whale', 'NOUN'), ('-', 'PUNCT'), ('fisher', 'NOUN'), ('on', 'ADP'),