# Question 1

In this question, you'll be doing some basic processing on four books: *Moby Dick*, *War and Peace*, *The King James Bible*, and *The Complete Works of William Shakespeare*. Each of these texts are available for free on the [Project Gutenberg](https://www.gutenberg.org/ebooks/) website.

### Part A

Write a function, `read_book`, which takes the name of the text file containing the book's content, and returns a single string containing the content.

 - input: a single string indicating the name of the text file with the book
 - output: a single string of the entire book
 
Your function should be able to handle file-related errors gracefully (this means a `try`/`except` block!). If an error occurs, just return `None`.

In [286]:
def read_book(filename):
    try:
        file=open(filename, 'r')    #opening file in read mode
        string=file.read()          #assigning file data to variable string
        file.close()                #closing file
        return string               #return file data as a string
    except FileNotFoundError:       #error handling in case invalid filename
        return None

In [282]:
assert read_book("queen_jean_bible.txt") is None

In [283]:
assert read_book("complete_shakspeare.txt") is None

In [284]:
book1 = read_book("moby_dick.txt")
print(len(book1))

1238567


In [285]:
book2 = read_book("war_and_peace.txt")
print(len(book2))

3224780


### Part B

Write a function, `word_counts`, which takes a single string as input (containing an entire book), and returns as output a dictionary of word counts.

Don't worry about handling punctuation, but definitely handle whitespace (spaces, tabs, newlines). Also make sure to handle capitalization, and throw out any words with a length of 2 or less. No other "preprocessing" requirements outside these.

You are welcome to use the [`collections.defaultdict` dictionary](https://docs.python.org/3/library/collections.html#collections.defaultdict) for tracking word counts, but no other built-in Python packages or functions for counting. `defaultdict` dictionaries are nice because if you try to update an element that does not yet exist in the dictionary, it automatically creates it and initializes it to a "default" value (hence, the name); this differs from the standard built-in Python dictionaries, in that this operation would cause your program to crash!

If you want to use defaultdictionaries but aren't sure how, **please ask on Slack!**

In [257]:
from collections import defaultdict
def word_counts(string):
    try:
        string=string.lower()                #text preprocessing: making string all lowercase
        skips = [".", ", ", ":", ";", "'", '"'," ","\n","\r"]   #text preprocessing; characters for removal
        for ch in skips:
            stri = string.replace(ch, "")    #loop to replace skipped characters in string
        wordfreq = {}                        #initializing dictionary to store count
        for word in stri.split():            #looping through split string
            wordfreq[word] = wordfreq.setdefault(word, 0) + 1   #adding count to freq dictionary
        return wordfreq                      #return dictionary with word counts
    except:
        return None                          #error handling


In [258]:
assert 0 == len(word_counts("").keys())

In [259]:
assert 1 == word_counts("hi there")["there"]

In [260]:
kj = word_counts(open("king_james_bible.txt", "r").read())
assert 23 == kj["devil"]
assert 4 == kj["leviathan"]

In [261]:
wp = word_counts(open("war_and_peace.txt", "r").read())
assert 30 == wp["devil"]
assert 86 == wp["soul"]

### Part C

Write a function, `total_words`, which takes as input a string containing the contents of a book, and returns as output the integer count of the *total number of words* (this is NOT unique words, but *total* words).

Same rules apply as in Part B with respect to what constitutes a "word" (capitalization, punctuation, splitting, etc), but you are welcome to use your Part B solution in answering this question!

In [262]:
def total_words(string):
    dic=word_counts(string)   #using part b function to generate dictionary
    vals=dic.values()         #accessing values of freq dictionary: number of counts per words
    total=sum(vals)           #summing word counts
    return total              #returning sum

In [263]:
try:
    words = total_words("")
except:
    assert False
else:
    assert words == 0

In [264]:
assert 11 == total_words("The brown fox jumped over the lazy cat.\nTwice.\nMMyep. Twice.")

In [265]:
assert 824146 == total_words(open("king_james_bible.txt", "r").read())

In [266]:
assert 904061 == total_words(open("complete_shakespeare.txt", "r").read())

### Part D

Write a function, `unique_words`, which takes as input a string containing the full contents of a book, and returns an integer count of the number of *unique* words in the book.

Same rules apply as in Part B with respect to what constitutes a "word" (capitalization, punctuation, splitting, etc), but you are welcome to use your Part B solution in answering this question!

In [267]:
def unique_words(string):
    words=word_counts(string)   #converting to single string
    word=words.keys()           #accessing the keys of word counts (all the words)
    uniq=set(word)              #casting words as a set (no repeats)
    count=0                     #initializing variable for count
    for i in uniq:              #iterating through set
        count+=1                #counts unique words in set
    return count
        

In [268]:
try:
    words = total_words("")
except:
    assert False
else:
    assert words == 0

In [269]:
assert 9 == unique_words("The brown fox jumped over the lazy cat.\nTwice.\nMMyep. Twice.")

In [270]:
print(unique_words(open("moby_dick.txt", "r").read()))

31683


In [271]:
print(unique_words(open("war_and_peace.txt", "r").read()))

40186


### Part E

Write a function, `global_vocabulary`, which takes a *list of strings* as its single input argument. Each string element of the list contains the contents of an entire book. The output of the function should be a list or set of *unique words* that comprise the full vocabulary of terms present across all the books that are passed to the function.

For example, if I have the following code:

```
book1 = "This is the entire content of a book."
book2 = "Here's another book."
book3 = "What is this?"
books = [book1, book2, book3]  # list of strings

vocabulary = global_vocabulary(books)
```

this should return a list or set containing the words:

```
{'a',
 'another',
 'book.',
 'content',
 'entire',
 "here's",
 'is',
 'of',
 'the',
 'this',
 'this?',
 'what'}
```

The words should be in increasing lexicographic order (aka, standard alphabetical order), and all the preprocessing steps required in previous sections should be used. As such, you are welcome to use your `word_counts` function from Part B.

In [272]:
def global_vocabulary(strings):
    total_vocabulary = set()                #initializing set (no repeats)
    for string in strings:                  #looping through list of strings
        total_vocabulary.update(list(filter(lambda item: len(item)>2, string.lower().split())))
        #updating initiliazed set to include the unique words from each string, 
        #with text preprocessing to lowercase and split words in string
        #filter adds words repeated more twice to the set
    return sorted(list(total_vocabulary))   #returns casted list of vocab sorted alphabetically

In [273]:
doc1 = "This is a sentence."
doc2 = "This is another sentence."
doc3 = "What is this?"

assert set(["another", "sentence.", "this", "this?", "what"]) == set(global_vocabulary([doc1, doc2, doc3]))

In [274]:
print(len(global_vocabulary([open("moby_dick.txt", "r").read()])))

31586


In [275]:
print(len(global_vocabulary([open("war_and_peace.txt", "r").read()])))

40021


In [276]:
kj = open("king_james_bible.txt", "r").read()
wp = open("war_and_peace.txt", "r").read()
md = open("moby_dick.txt", "r").read()
cs = open("complete_shakespeare.txt", "r").read()

print(len(global_vocabulary([kj, wp, md, cs])))

118503


### Part F

Write a function, `featurize`, which takes a *list of strings* as its lone input argument: each element is a string with the contents of an entire book. The output of this function is a 2D NumPy array of counts, where the rows are the documents/books (i.e., one row per element!) and the columns are the counts for all the words in the global vocabulary.

For instance, if I pass a two-string list to `featurize` that collectively have 50 unique words between the two strings, the output matrix should have shape `(2, 50)`: the first row will be the respective counts of the words in that document, and same with the second row.

The rows (documents) should be in the same ordering as they're given in the function's argument list, and the columns (words) should be in increasing lexicographic order (aka alphabetic order). You are welcome to use your function from Part B, and from Part E (Part E's function would be particularly useful here, since that pretty much does all the heavy lifting; you'd just need to convert that dictionary to a matrix).

In [292]:
def featurize(strings):
    from collections import Counter
    import numpy as np
    vocabulary = global_vocabulary(strings)    #using global vocab fxn to access global vocab of strings                 
    matrix=[]                                  #initializing matrix as list
    for string in strings:                     #looping through string in input strings list
        word_count_string=Counter(string.lower().split()) #text preprocessing
        string_count=[]                        #initializing list to store vocab counts
        for word in vocabulary:                #looping through words in global vocab
            string_count.append(word_count_string[word])
                                               #adding global vocab count to string count list
        matrix.append(string_count)            #adding string count to matrix list
    for row in matrix:                         #for document within matrix
        print(row)                             #returns row=document within input strings
    matrix=np.array(matrix)                    #casting matrix list as a numpy array
    print(matrix.shape)                        #returning the shape

In [293]:
kj = open("king_james_bible.txt", "r").read()
wp = open("war_and_peace.txt", "r").read()

matrix = featurize([kj, wp])
assert 2 == matrix.shape[0]
print(matrix.shape[1])
print(int(matrix[:, 836].sum()))
print(int(matrix[:, 62655].sum()))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

AttributeError: 'NoneType' object has no attribute 'shape'

In [294]:
kj = open("king_james_bible.txt", "r").read()
wp = open("war_and_peace.txt", "r").read()
md = open("moby_dick.txt", "r").read()
cs = open("complete_shakespeare.txt", "r").read()

matrix = featurize([kj, wp, md, cs])
print(matrix.shape[0])
print(matrix.shape[1])
print(int(matrix[:, 103817].sum()))
print(int(matrix[:, 71100].sum()))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

AttributeError: 'NoneType' object has no attribute 'shape'

### Part G

Write a function, `probability`, which takes three arguments:
 - an integer, indicating the index of the word we're interested in computing the probability for (i.e., column number)
 - a 2D NumPy matrix of word counts, where the rows are the documents, and the columns are the words
 - an *optional* integer, indicating the specific document in which we want to compute the probability of the word (default is all documents)
 
This function is the implementation of $P(w)$ for some word $w$. By default, this is probability of word $w$ over our entire dataset. However, by specifying an optional integer, we can specify a conditional probability $P(w | d)$. In this case, we're asking for the probability of word $w$ *given* some specific document $d$.

Your function should return the probability, a floating-point value between 0 and 1. It should be able to handle the case where the specified word index is out of bounds (resulting probability of 0), as well as the case where the document index is out of bounds (also a probability of 0).

In [278]:
def probability(index, counts, doc='all'):
    if (index >= len(counts[0])):              #accounts for case where index is out of bounds
        return 0
    if (doc !='all'):                          #if document isn't default
        if (doc >= len(counts)):               #accounts for case where doc index is out of bounds
            return 0
    if (doc =='all'):                          #if doc is default
        s=sum([sum(x) for x in counts])        #sum of all the words
        w=sum([x[index] for x in counts])      #sum of words counts/occurences w/n input doc
        return w/s
    ans=(counts[doc][index])/(sum(counts[doc])) #calc prob: (num of word count in doc)/(doc word count) 
    return ans
 
    

In [279]:
import numpy as np
matrix = np.load("lut.npy")

np.testing.assert_allclose(0.068569417725812987, probability(104088, matrix))
np.testing.assert_allclose(0.012485067486917144, probability(54096, matrix))
np.testing.assert_allclose(0.0073786475907416712, probability(21668, matrix))
np.testing.assert_allclose(0.0, probability(66535, matrix), rtol = 1e-5)

AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 4.87393328e-07
Max relative difference: 1.
 x: array(0.)
 y: array(4.873933e-07)

In [280]:
import numpy as np
matrix = np.load("lut.npy")

np.testing.assert_allclose(0.012404288801202555, probability(54096, matrix, 0))
np.testing.assert_allclose(0.0077914081371666744, probability(21668, matrix, 1))
np.testing.assert_allclose(0.0094279749592546449, probability(117297, matrix, 3))

### Part H

Let's assume the four books you've analyzed are now going to constitute your "background" data. "Background" data is a concept that, in theory, allows you to identify important words: if you analyze a *new* book, you can compare its word counts to those in your "background" dataset. Any words in the new book that occur a lot more or a lot less frequently than in the background data could be considered "important" in some sense.

Let's say you receive a new book: [*Guns of the South*](https://www.amazon.com/Guns-South-Harry-Turtledove/dp/0345384687). You want to compare its word counts to those in your "background". However, you quickly run into a problem: there are words in *Guns of the South* that **do not exist at all** in your background dataset--words like "Abraham" and "Lincoln" only show up in the new book, but never in your "background". This is extremely problematic, since now you'd potentially be dividing by 0 to gauge the relative importance of the words in *Guns of the South*.

Can you suggest a preprocessing step that might help? 

Within your code that constitutes the background data, you could write a piece of code to "flag" words from the new book if words aren't within the existing background data. Similar things within text data analysis are 'part of speech tags' which can be used to recognize new proper nouns within a text (i.e., "Abraham Lincoln"). Using the part of speech tags along with a function that can flag new proper nouns could possibly help to accomodate for this issue. 