# Assignment 2 
## Chapters 6-12: containers, loops, functions

**Questions? Drop em in the Slack under #questions**

By now, we have acquired a toolkit of Python objects and methods that
enable us to do really powerful things with texts. With containers, loops,
and functions, you have learned some of the most key parts of Python. 

These new skills means that we can also begin to work more directly with
the text you've placed in the `/BYOT` folder. It also means that we can begin
to think more about how to operationalize linguistic questions into Pythonic
procedures. 

# Warmup

<div class="alert alert-success">
    
Assign a list to a variable and populate it with five of your favorite authors.
    
</div>

<div class="alert alert-success">
    
Assign another list and populate it with books written by the same authors of the above list. Make sure the order matches.
    
</div>

<div class="alert alert-success">
    
Run the code below and consider what it does.
    
</div>

In [None]:
items = ['gameboy', 'N64', 'PC'] 
prices = [65, 100, 700]

list(zip(items, prices))

In [None]:
help(zip)

<div class="alert alert-success">
    
Use `zip` to combine your authors list with your books list. Assign it to a variable.
    
</div>

<div class="alert alert-success">
    
Select the second author/book from the new list.
    
</div>

<div class="alert alert-success">
    
Print the object type of the last author/book pairing from the author/book list.
    
</div>

<div class="alert alert-success">
    
Isolate the first author from the author/book list and assign to variable. Print it.
    
</div>

<div class="alert alert-success">
    
Write a function that takes a single argument, `author_book` which is a list of author/book two-tuples. The function should iterate through the author/book list and prints each pairing in the following format: <br>

   `The author {author} wrote {book_name}`

Execute the function by feeding it your author/book list as the argument.

</div>

<div class="alert alert-success">
    
Assign a variable to an empty set.
    
</div>

<div class="alert alert-success">
    
Assign a variable to an empty dictionary. Use both methods for initializing a new dictionary. How does this differ from the empty set you made above?
    
</div>

# BYOT 

Now that we have some more sophisticated tools for storing, looping, and writing code, we can begin to do some interesting real analysis on the text you've placed in the `BYOT` folder. 

Note that some of the following exercises will require a bit of **creativity**.

<div class="alert alert-warning">
    
Tips for completing the exercises:
* Use the Chapter material for reference! Learning to code is just as much learning when to look up how to do something!
* Use the Jupyter notebook code cells to experiment. Don't feel like you should only write perfect code in a code cell. A strength of Jupyter notebooks is that you can add arbitrary cells to experiment with code. Don't remember how exactly to write a `dict`? Add a cell and experiment with what you do remember. You can then copy/past and clean up your "scratch" code when you're done.
* Break the problem up into steps. Write them out if you need to.
* Add notes (`#`) and write your code with extra spacing so it is more readable. Think of your code segments as paragraphs. You can even give them notes as headings that simply tells what a chunk does.
* [Toggle line numbers](https://stackoverflow.com/questions/10979667/showing-line-numbers-in-ipython-jupyter-notebooks) to help you interpret error messages.
* Practice the 15 minute rule! Don't stay stuck longer than 15 minutes. Ask for help.
    
</div>

In [None]:
from get_byot import your_text

## Text-Setup

Remember that `your_text` is a large string which contains all of the text from the `BYOT` text.
Most texts contains front-matter, and some contain back-matter. Use string slices to isolate
the main content of `your_text` from the rest. Look back on Assignment 1 if you need to, where 
we did the same thing. Assign this portion to a new, descriptive variable.

## Word Counts Revisited 
*concepts: string methods, lists, built-in functions*


In ASSIGNMENT 1 we made approximate word counts using spaces. Now we will use a slightly more 
sophisticated approach with `str.split`. This process is known as ["tokenization"](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html).

<div class="alert alert-success">
   
Split your text into tokens with `.split()` (NB don't add an argument). Assign it to a variable `tokens`.
    
</div>

<div class="alert alert-success">
   
How many tokens are in `tokens`? How does this compare with the word count from Assignment 1? What might explain the discrepancy? (Hint: use `help` to look at the default behavior of `split`.
    
</div>

## Rough Lexicon
*concepts: sets, built-in functions, lists*

Note that `tokens` probably contains a lot of repeated items. 

<div class="alert alert-success">

Remove all duplicate items from `tokens` using a single line of code. Assign the result to `lexicon`.
    
</div>

<div class="alert alert-success">

How long is `lexicon`?
    
</div>

<div class="alert alert-success">

Arrange `lexicon` into an ordered object and sort it.
    
</div>

<div class="alert alert-success">

Examine the first 50 items at the top of the sorted object.
    
</div>

## Lexicon Frequencies

*concepts: loops, dictionaries, conditionals*

The lexicon tells us all the unique items in `tokens`, but it doesn't give us any idea
how often each token occurs in the text. We want to know how often each item occurs.

<div class="alert alert-success">

Count how often each token occurs in `tokens`. Assign it to `lex_freq`. Do not use additional tools like `collections`.
    
</div>

<div class="alert alert-success">

Write a small function that takes two parameters: (1) a frequency dict, (2) an integer.
The function should use the frequency dict to **return** the [N-th most](http://mathcentral.uregina.ca/QQ/database/QQ.09.04/alex1.html)
item (i.e. the integer). Thus if the `integer` argument is 10, the function will return
the top 10 items in the dict. Be sure to use a descriptive docstring. Run your function 
on your frequency dict above and show the top 25 most frequent items.
    
</div>

## Scrub-a-dub-dub: cleaning text data

*concepts: lists/sets, loops, string methods, conditionals & booleans*

You've *probably* noticed at this point that `tokens` and `lex_freq` contain strings that are 
not technically words like punctuation or numbers. Or perhaps you've noticed that some word
strings contain punctuation or other marks. This kind of "noise" is very common in text-mining
tasks. 

We can use the Python skills we've learned so far to clean out some of this extraneous data. Note
that some methods are more efficient than others. For now, we will use only what we've learned up
till now. 

<div class="alert alert-success">

Isolate the top 100 most-frequent tokens using your function from above.
Put them in a list. Examine them and see which ones are punctuation, which
ones are mixed text with punctuation? Think about these marks as a set.
    
</div>

<div class="alert alert-success">

Write a function that takes a single parameter, a string. Call
it `clean_token`. The function should remove any punctuation marks from a string 
but leave behind valid material. Return the result. If a string 
is completely punctuation, remove it anyways and return an empty
string. You can use a set to manually construct a group of these 
items and use the set to test the string's membership (see `in`).
    
</div>

Note the truth value of an empty string:

In [None]:
bool('')

or:

In [None]:
if '':
    print('string 1 is True')
elif 'a string':
    print('string 2 is True')

You can use the `False` truth value of an empty string to filter spurious strings using `clean_token`.

<div class="alert alert-success">

Write code that uses `clean_token` to filter the top 100 most-common items (from above) and 
add the filtered strings to a new list `filtered_tokens`. Do not add empty strings to the list.
    
</div>

<div class="alert alert-success">

How many items from the original 100 are left in `filtered_tokens`?
    
</div>

## Baby Verb Parser

*concepts: lists, dicts, loops, string methods, conditionals, functions*

Let's write a simple verb parser. Keep in mind that this parser will
be very basic and only work on a few types of patterns. It will also
probably not be perfect with some false positives/negatives. That's ok!

In most languages, verbs can be recognized with distinctive morphology at
either the beginning or the end of a word. For instance, in English the
ending "-ing" is indicative of either an infinitive or a present tense verb:

    running
    eating
    getting
    
Similarly the ending "-ed" indicates simple past tense verbs. 

We can store data like this in a dictionary where the keys are
distinctive endings/beginnings and the value is a parsing value:

In [None]:
parser = {
    'ly': 'adverb',
    'able': 'adjective',
}

test = 'capable'

for pattern, parse in parser.items():
    if test.endswith(pattern):
        print(f'It\'s a(n) {parse}!')

<div class="alert alert-success">

Write a dictionary where the keys are strings that 
indicate a beginning or ending verb morphology and
the values are a given parsing for the language contained
in your text. Write at least 5 patterns. But feel free
to be more detailed if you'd like.
    
</div>

<div class="alert alert-success">

Write a function `verb_parser` that takes two parameters: (1) a string, (2) a
dictionary with `pattern:parsing`. The function should use the dictionary 
to parse the string. If the pattern is not found, the function should return
`None`. Otherwise, it should return the parsing value.
</div>

<div class="alert alert-success">

Run in a loop `verb_parser` using `tokens`. Store 
postive matches as a tuple of `(string, parsing)`
in a list: `parsed_tokens`. 
    
</div>

## Text Similarity

*concepts: string methods, sets, built-in functions, functions*

Most texts contain some kind of major section headings (i.e. chapters). 
We want to measure the similarity of two given setions in the text. We will
use sets to do this.

<div class="alert alert-success">

Look at the body of `your_text` and consider where section markers occur. Isolate 3 sections / chapters in your text into three separate variables.
    
</div>

Note that we can loop twice over an item to compare things pairwise:

In [None]:
values = [1, 10, 9]

for v1 in values:
    for v2 in values:
        print(f'{v1}-{v2} = {v1-v2}')

<div class="alert alert-success">

In a double loop, tokenize each chapter and compare the overlap of each one with every other chapter 
in the dataset. Hint: see `.intersection` on sets. Store the comparisons in a list. Can you find
which two chapters are most similar? Ignore comparisons of a chapter with itself.
    
</div>