# Project 2d: Goals and Deliverables

The goals of this assignment are:
* To review basic NLP tasks for words using dictionaries, lists and file read/write. We will focus especially on *counting* things.
* To make some basic corpus-level visualizations.

Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment. Open and set up a code space (install a python kernal and select it).
2. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
3. I wrote the comments; you write the code! Complete and run `spacy_on_corpus.py` following the instructions in this notebook.
4. Edit the README.md file. Provide your name, your class year, links to/descriptions of any extensions and a list of resources. 
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

Possible extensions (from least points to most points):
* Make word counts plots for the top 100 words and entities. Look at the labels on the y axis of each plot. Where do you think spaCy is making mistakes?
* Augment the `wordcount` functionality so that it displays relative frequencies of entity label pairs and token part of speech pairs.
* Learn about the useful python collections package, especially the [Counter data type](https://docs.python.org/3/library/collections.html#collections.Counter). Copy spacy_on_corpus.py and name the copy spacy_on_corpus-counter.py. Change `get_token_counts` and `get_entity_counts` to use counters. 
* Add in the analyses from project 2c as functions `make_doc_markdown`, `make_doc_tables` and `make_doc_stats`; make sure to ask the user for a document before running any of these!
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Setup

## Install Our Packages

On the command line (in the terminal), type:

% `pip install -r requirements.txt`

## Upload Our Data

From Moodle, download `files.zip`. 

Then, upload `files.zip` to the code space.

## Make Sure We Can Work With .py Files We Are Editing

Run the code cell below.

In [None]:
# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

# Writing Better Python Code: Functions

Last week we wrote a really long python script (many lines of code)! But what if we want to reuse some code somewhere else? Then we have to copy it, and if it has a mistake we'll have to fix the bug in both places. Also, really long python scripts are very hard to understand.

To write code that is easier to understand and more reusable, we use **encapsulation**. Specifically, in this project we will start to write and use our own python **functions**. We will write functions to analyze a whole corpus.

Here is a sample python function:
```
def length(input_text):
   print(f'length is %i' % len(input_text))
   return len(input_text) 
```

A function **definition** in python assigns a variable name to a code block:
* `def` tells you we are defining a function here
* `length` is the variable we will assign to the code block
* `input_text` is an **argument** (or **parameter**) to this function
* the code block inside this function is two lines long; it prints the length, and then it returns the length
* this function **returns** a value, the length of the input text; functions do not have to return a value but many do

Once you have defined a function you can can **call** (or use) it in other places. 

Every function you have learned about so far this semester is defined either in 'core' python, or in a python package.

We do want to comment our functions. We do this using a **docstring**. The docstring tells readers: what the function does; what the arguments mean, and what the return value should look like (if any). The docstring is set off with three quotes (""") at start and end. 

```
def length(input_text):
   """Calculates the length of the input input_text.
      :param input_text: some text
      :type input_text: string
      :returns: the length of input_text
      :rtype: int
   """
   print(f'length is %i' % len(input_text))
   return len(input_text)
```

We will write a docstring in this format for every function we define. Sometimes, this may mean we do not need to write a comment for every line of code! (However, for project 2d I give you a comment for almost every line of code.)

You can learn more about python docstrings [here](https://realpython.com/documenting-python-code/).

# Functions in Notebooks

You can define a python function in a notebook code cell, and then use it. In the code cell below, paste the `length` function definition from above.

In [None]:
# paste the definition of the length() function

Now use it.

In [None]:
# call length on the text 'This is a test!'


You should get this output:
```
length is 15
15
```

# Functions in Python Files

More usually, we store functions in `.py` files. Then we can import the `.py` file and call the functions in it.

Create a new file called `test_functions.py`. Paste the definition of the `length` function in it. 

In the code cell below, import `test_functions.py` like this: `import test_functions` (you don't need to specify `.py`. 

Then call the function `length` as defined in `test_functions`. You do this like: `test_functions.length`.

In [None]:
# import from test_functions.py

# call length on the text 'This is another test!'


You should get this output:
```
length is 21
21
```

# Testing the Functions in spacy_on_corpus.py

For this project, I have given you a file `spacy_on_corpus.py`. You will fill in the functions and test them in this section.

First, we will need a test corpus. I give you one here (the text for each document comes from the Wikipedia page for the named college or university).

In [None]:
# import spacy

# import pprint


# make a spacy engine

# make a corpus
corpus = {'doc1': {'text': 'Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors. Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium.'},
          'doc2': {'text': 'Columbia University, officially titled as Columbia University in the City of New York, is a private Ivy League research university in New York City. Established in 1754 as King\'s College on the grounds of Trinity Church in Manhattan, it is the oldest institution of higher education in New York and the fifth-oldest in the United States.'}}

# run spacy on each text in the corpus
for key in corpus:
    corpus[key]['doc'] = nlp(corpus[key]['text'])

# print the corpus


## Test `get_token_counts`

Complete the implementation of `get_token_counts` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_token_counts` on the provided corpus.

In [None]:
#import spacy_on_corpus


# call get_token_counts on corpus


The output should start with:
```
[('Colby', 5),
 ('College', 6),
 ('is', 3),
 ('a', 2),
 ('private', 2),
 ('liberal', 2),
 ('arts', 2),
 ('college', 3),
 ('in', 12),
 ('Waterville', 3),
 ('Maine', 4),
 ('Founded', 1),
 ```

This function has some **optional arguments**. Look at the function **signature** (the line that starts with `def`). See that there are two arguments, but one of them has a value assigned to it already (using `=`). That means, if you don't want to say what the tags to exclude should be, you can take the default ones specified in the signature. 

Let's try changing this. Let's make *no* tags excluded.

In the code cell below, run `get_token_counts` on the corpus provided, also specifying `tags_to_exclude = []` (the empty list).

In [None]:
# call get_token_counts with no tags to exclude


The output should start with:
```
[('Colby', 5),
 ('College', 6),
 ('is', 3),
 ('a', 2),
 ('private', 2),
 ('liberal', 2),
 ('arts', 2),
 ('college', 3),
 ('in', 12),
 ('Waterville', 3),
 (',', 9),
 ('Maine', 4),
 ('.', 9),
 ```

Now, referring to [the coarse-grained tag list](https://universaldependencies.org/u/pos/all.html), what do you have to do to make `get_token_counts` *only* give you counts of (proper or regular) nouns, verbs, adjectives and adverbs?

In [None]:
# call get_token_counts excluding all tags but those corresponding to nouns, verbs, adjectives and adverbs


The output should start with:
```
[('Colby', 5),
 ('College', 6),
 ('private', 2),
 ('liberal', 2),
 ('arts', 2),
 ('college', 3),
 ('Waterville', 3),
 ('Maine', 4),
 ('Founded', 1),
 ```

## Test `get_entity_counts`

Complete the implementation of `get_entity_counts` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_entity_counts` on the provided corpus.

In [None]:
# import

# call get_entity_counts


The output should start with:
```
[('Colby College', 1),
 ('Waterville', 2),
 ('Maine', 3),
 ('1813', 1),
 ('the Maine Literary and Theological Institution', 1),
 ('Waterville College', 1),
 ('1821', 1),
 ```

Now, referring to [the spaCy model docs](https://spacy.io/models/en), what do you have to do to make `get_entity_counts` *only* give you organizations, persons and locations?

In [None]:
# call get_entity_counts so as to get only organizations, persons and locations


The output should start with:
```
[('Colby College', 1),
 ('Waterville', 2),
 ('Maine', 3),
 ('the Maine Literary and Theological Institution', 1),
 ('Waterville College', 1),
 ```

## Test `reduce_to_top_k`

Complete the implementation of `reduce_to_top_k` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `reduce_to_top_k` on the output of `get_token_counts` on the provided corpus.

In [None]:
# import

# get the token counts on corpus; assign token_counts to the returned result
token_counts = 

# call reduce_to_top_k on token_counts to get the top 5


The output should look like:
```
[('of', 5), ('College', 6), ('and', 7), ('the', 11), ('in', 12)]
```

## Test `load_textfile`

Complete the implementation of `load_textfile` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `load_textfile` on 'colby_college.txt'.

In [None]:
# import

# initialize corpus to the empty dictionary

# call load_textfile

# print corpus


The output should look like:
```
{'colby_college.txt': {'doc': Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors.
 
 Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium.
}}
```

## Test `load_compressed`

Complete the implementation of `load_compressed` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `load_compressed` on 'files.zip'. (This may take a little while!)

In [None]:
# import

# initialize corpus to the empty dictionary

# call load_compressed

# print corpus keys


The output should look like:
```
dict_keys(['temp/ark:__27927_pjb5s37cx32', 'temp/ark:__27927_phx1wcjq0tm', 'temp/ark:__27927_phzmmfj893c', 'temp/ark:__27927_phzkfzqzs41', 'temp/ark:__27927_phzq8c34ggp', 'temp/ark:__27927_pjb3ptfm8xd', 'temp/ark:__27927_phz8qhfbxzm', 'temp/ark:__27927_pjb1wn175cv', 'temp/ark:__27927_phznswfkrxz', 'temp/ark:__27927_pjb65xt4m6r', 'temp/ark:__27927_phzq26wnjzn', 'temp/ark:__27927_phzbjns29gn', 'temp/ark:__27927_phzpdcpvdnb', 'temp/ark:__27927_pjb1z8505hp', 'temp/ark:__27927_phz35174v0z', 'temp/ark:__27927_phzjj6kfdxp', 'temp/ark:__27927_pjb16g9m9r7', 'temp/ark:__27927_pjb1z5xzrx7'])
```

## Test `build_corpus`

Complete the implementation of `build_corpus` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `build_corpus` on the pattern 'c*.txt'.

Note, `build_corpus` _returns a corpus_, so you want to assign that return value to a variable (like `my_corpus`).

In [None]:
# import

# call load_compressed

# print corpus keys


The output should look like:
```
dict_keys(['colby_college.txt', 'columbia_university.txt'])
```

## Test `get_basic_statistics`

Complete the implementation of `get_basic_statistics` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_basic_statistics` on the given corpus.

In [None]:
# import

# call load_compressed

# call get_basic_statistics on my_corpus


Your output should look like:
```
Documents: 2

Tokens: 187

Unique tokens: 115

Entities: 187

Unique entities: 33
```

## Test `plot_word_entity_frequencies`

Complete the implementation of `plot_word_entity_frequencies` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `plot_word_entity_frequencies` on the given corpus.

In [None]:
# import

# call load_compressed

# call get_basic_statistics on my_corpus


The resulting file `token_counts.png` should look like:

![token_counts.png](answer_token_counts.png)

The resulting file `entity_counts.png` should look like:

![token_counts.png](answer_entity_counts.png)

## Test `plot_word_cloud`

Complete the implementation of `plot_word_cloud` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `plot_word_cloud` on the given corpus.

In [None]:
# import

# call load_compressed

# call get_basic_statistics on my_corpus


he resulting file `token_counts.png` should look like:

![token_wordcloud.png](answer_token_wordcloud.png)

# Running `spacy_on_corpus.py` from the Terminal

Complete the implementation of `main` in `spacy_on_corpus.py`. 

Now run this in the terminal:
% `python spacy_on_corpus.py`

Give it `files.zip` as the pattern. Get all of 'statistics', 'wordcount' and 'wordcloud'.

Insert the 'wordcount' images and 'wordcloud' image generated when you run it.

## Token count plot


## Entity count plot


## Word cloud

# Questions

1. *Name a core python function that _does not_ return a value*:
2. *Name a core python function that _does_ return a value*:
3. *What is the structure of each entry in `corpus`?* For example, is a list of sets of strings, or a dict of lists of ints, or...?
4. *How many tokens and unique tokens are in the corpus defined by `files.zip`?*
5. *How many entities and unique entities are in the corpus defined by `files.zip`?*
6. *How many arguments does `build_corpus` have?*
7. *What is the structure of the return value from `reduce_to_top_k`?*
8. *How can you make *sorted* reverse sort (from largest to smallest)?* You can refer to the python documentation.
9. *How can you modify the code so that the word cloud that is generated doesn't just contain uninteresting words like 'a' and 'and'?*
10. *Once you have answered question 9, from looking at the word cloud what do you think are some themes of this corpus?*

**Isn't the code in spacy_on_corpus.py easier to follow than the code from project 2c??**