# Question 1

In this question, you'll be doing some basic processing on four books: *Moby Dick*, *War and Peace*, *The King James Bible*, and *The Complete Works of William Shakespeare*. Each of these texts are available for free on the [Project Gutenberg](https://www.gutenberg.org/ebooks/) website.

### Part A

Write a function, `read_book`, which takes the name of the text file containing the book's content, and returns a single string containing the content.

 - input: a single string indicating the name of the text file with the book
 - output: a single string of the entire book
 
Your function should be able to handle file-related errors gracefully (this means a `try`/`except` block!). If an error occurs, just return `None`.

In [1]:
def read_book(fname):
    try:
        file = open(fname, 'r')
        data = file.read()
        file.close()
        return data
    except:
        return None
        

In [2]:
assert read_book("queen_jean_bible.txt") is None

In [3]:
assert read_book("complete_shakspeare.txt") is None

In [4]:
book1 = read_book("moby_dick.txt")
assert len(book1) == 1238567

In [5]:
book2 = read_book("war_and_peace.txt")
assert len(book2) == 3224780

### Part B

Write a function, `word_counts`, which takes a single string as input (containing an entire book), and returns as output a dictionary of word counts.

Don't worry about handling punctuation, but definitely handle whitespace (spaces, tabs, newlines). Also make sure to handle capitalization, and throw out any words with a length of 2 or less. No other "preprocessing" requirements outside these.

You are welcome to use the [`collections.defaultdict` dictionary](https://docs.python.org/3/library/collections.html#collections.defaultdict) for tracking word counts, but no other built-in Python packages or functions for counting. `defaultdict` dictionaries are nice because if you try to update an element that does not yet exist in the dictionary, it automatically creates it and initializes it to a "default" value (hence, the name); this differs from the standard built-in Python dictionaries, in that this operation would cause your program to crash!



In [6]:
def word_counts(s):
    words = s.split()
    d = {}
    for word in words:
        d[word.strip()] = d.get(word.strip(),0) + 1
    return d


In [7]:
assert 0 == len(word_counts("").keys())

In [8]:
assert 1 == word_counts("hi there")["there"]

In [9]:
kj = word_counts(open("king_james_bible.txt", "r").read())
assert 23 == kj["devil"]
assert 4 == kj["leviathan"]

In [10]:
wp = word_counts(open("war_and_peace.txt", "r").read())
assert 30 == wp["devil"]
assert 86 == wp["soul"]

AssertionError: 

### Part C

Write a function, `total_words`, which takes as input a string containing the contents of a book, and returns as output the integer count of the *total number of words* (this is NOT unique words, but *total* words).

Same rules apply as in Part B with respect to what constitutes a "word" (capitalization, punctuation, splitting, etc), but you are welcome to use your Part B solution in answering this question!

In [11]:
def total_words(data):
    count = 0
    for i in range(len(data)):
        if i == 0 or (data[i].isspace() and not data[i-1].isspace()):
            count += 1
    return count

In [12]:
try:
    words = total_words("")
except:
    assert False
else:
    assert words == 0

In [13]:
assert 11 == total_words("The brown fox jumped over the lazy cat.\nTwice.\nMMyep. Twice.")

In [14]:
assert 681216 == total_words(open("king_james_bible.txt", "r").read())

AssertionError: 

In [None]:
assert 729531 == total_words(open("complete_shakespeare.txt", "r").read())

### Part D

Write a function, `unique_words`, which takes as input a string containing the full contents of a book, and returns an integer count of the number of *unique* words in the book.

Same rules apply as in Part B with respect to what constitutes a "word" (capitalization, punctuation, splitting, etc), but you are welcome to use your Part B solution in answering this question!

In [15]:
def unique_words(s):
    words = s.split()
    s = set()
    for word in words:
        s.add(word.lower().strip())
    return len(s)

In [16]:
try:
    words = total_words("")
except:
    assert False
else:
    assert words == 0

In [17]:
assert 9 == unique_words("The brown fox jumped over the lazy cat.\nTwice.\nMMyep. Twice.")

In [18]:
assert 31586 == unique_words(open("moby_dick.txt", "r").read())

AssertionError: 

In [19]:
assert 40021 == unique_words(open("war_and_peace.txt", "r").read())

AssertionError: 