# Project 3c: Goals and Deliverables

The goals of this assignment are:
* To develop an object oriented version of our corpus code, focusing on writing the most concise readable code possible.

Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment. Open and set up a code space (install a python kernel and select it).
2. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
3. I wrote the comments; you write the code! Complete and run `spacy_on_corpus.py` following the instructions in this notebook.
4. Edit the README.md file. Provide your name, your class year, links to/descriptions of any extensions and a list of resources. 
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

Possible extensions (from least points to most points):

* Add one other type of corpus or document visualization or statistic.
* Modify the token, entity, and noun chunk functions so they count only lower cased tokens, entities and noun chunks.
* the `collections` package in python has a counter class. Copy `spacy_on_corpus.py` to `spacy_on_corpus_collections.py`. Try using that class instead. Time the execution of the test code for counter in section **Test the Counter Class** below. Which is faster: your counter class, or the one in collections?
* Copy `spacy_on_corpus.py` to `spacy_on_corpus_document.py`. In the new file, make a document class. Move the methods `render_doc_markdown`, `render_doc_table` and `render_doc_statistics` to the document class. Test it.
* We can currently represent tokens, entities and sentences. Use the span class we developed this week in class to implement a representation of paragraphs. Add a list of paragraph spans to each document. Add summary statistics about paragraphs to `get_basic_statistics`.
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Setup

## Install Our Packages

On the command line (in the terminal), type:

% `pip install -r requirements.txt`

## Upload Our Data

From Moodle, download `files.jsonl.zip`. 

Then, upload `files.jsonl.zip` to the code space.

## Make Sure We Can Work With .py Files We Are Editing

Run the code cell below.

In [None]:
# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

# The Counter Class

First, we implement a counter class to abstract away that counter code we keep writing.

The counter class has a constructor, a setter (`add_item`), a getter (`get_counts`), and one other method,  `reduce_to_top_k`. 

Complete the implementation of the methods in this class using the doc strings and comments provided. 

## Test the Counter Class

In the code cell below:

1. make a counter from this list of items: `['Mary', 'had', 'a', 'little', 'lamb', '.', 'It', '\'s', 'fleece', 'was', 'white', 'as', 'snow', '.', 'And', 'everywhere', 'that', 'Mary', 'went', ',', 'the', 'little', 'lamb', 'would', 'go', '.']`
2. add these items to the counter: `['Little', 'lambs', 'are', 'cute', '.']`
3. print the counts
4. reduce the counts to the top 5
5. print the counts

In [None]:
from spacy_on_corpus import counter

Your output should look like:
```
[('Mary', 2), ('had', 1), ('a', 1), ('little', 2), ('lamb', 2), ('.', 4), ('It', 1), ("'s", 1), ('fleece', 1), ('was', 1), ('white', 1), ('as', 1), ('snow', 1), ('And', 1), ('everywhere', 1), ('that', 1), ('went', 1), (',', 1), ('the', 1), ('would', 1), ('go', 1), ('Little', 1), ('lambs', 1), ('are', 1), ('cute', 1)]
[('Mary', 2), ('little', 2), ('lamb', 2), ('.', 4)]
```

# The Corpus Class

Next, we implement a corpus class. 

The corpus class has a constructor, several getters and setters, and methods corresponding to each function from project 3b.

Complete the implementation of the methods in this class using the doc strings and comments provided.




## Test the Corpus Class Getters and Setters

In the code cell below:

1. make a corpus from `test.jsonl`
2. print the corpus name
3. use `get_document` to get the document corresponding to id `1`; print it
4. use `get_metadata` to get the metadata corresponding to id `1`; print it
5. add a document to the corpus using `add_document`; provide id `'3'` and a spaCy document with the text 'This is a third document!'
6. print the number of documents in this corpus, using `get_documents` to get the documents
7. print the number of metadatas in this corpus, using `get_metadatas` to get the metadatas


In [None]:
from spacy_on_corpus import corpus

Your output should look like:
```
test.jsonl
Autumn in Maine is really beautiful. The leaves fall from the trees, exposing the sky and the iconic Maine rocks. People go leaf peeping or apple picking, or sit around a campfire in the evenings.
{'author': 'Amanda Stent', 'fullText': 'Autumn in Maine is really beautiful. The leaves fall from the trees, exposing the sky and the iconic Maine rocks. People go leaf peeping or apple picking, or sit around a campfire in the evenings.', 'publicationYear': 2023, 'pageCount': 1, 'docType': 'paragraph', 'id': '1'}
documents:  3
metadatas:  3
```

## Test the Corpus Class Methods for Getting Token, Entity, Noun Chunk and Metadata Counts

In the code cell below:

1. print the number of unique tokens in this corpus
2. print the number of unique entities in this corpus
3. print the number of unique noun chunks in this corpus
4. print the number of unique page counts in this corpus
5. print the number of unique publication years in this corpus

Your output should look like:
```
token counts:  56
entity counts:  4
noun chunk counts:  20
pageCount counts:  1
publicationYear counts:  1
```

## Test the Corpus Class Statistical Analysis and Visualization Methods

In the code cell below, call these methods using the default tags to exclude:

1. `get_basic_statistics`
2. `plot_token_frequencies`, `plot_entity_frequencies`, `plot_noun_chunk_frequencies`, `plot_metadata_frequencies` (use the key 'pageCount')
3. `plot_token_cloud`, `plot_entity_cloud`, `plot_noun_chunk_cloud`

Your output should look like:
```
Documents: 3

Sentences: 8

Tokens: 83

Unique tokens: 56

Entities: 6

Unique entities: 4

Chunks: 22

Unique chunks: 20

Publication year range: 2023-2023

Page count year range: 1-1
```

You can compare your output plots to those in the answer files.

## Test the Corpus Class Document Visualization Methods

In the code cell below, call these methods on the document with id `1`:

1. `render_doc_markdown`
2. `render_doc_table`
3. `render_doc_statistics`

Compare your output to that in the answer files.

# Test Main

Update the `main` function to create a corpus object and use the corpus methods.

On the command line, run `spacy_on_corpus.py` on `files.jsonl.zip`. Get all outputs:

* basic statistics: *paste here*
* token, entity, noun chunk and metadata counts (for publication year)
* token, entity and noun chunk word clouds
* document markdown, table and statistics for the document with id `ark://27927/phz35174v0z`

# Questions

Answer these questions.

1. *How many attributes are defined in the corpus class definition?*
2. *Which are instance attributes and which are method attributes?*
3. *Where does the corpus itself get stored?*
4. *List the class methods defined for the corpus class.*
5. *What is the type of all the other methods in the corpus class?*
6. *How does the counter class help us implement the corpus class?*
7. *What methods does counter have that it inherits from dict? (use `help`)*
8. *In `spacy_on_corpus.py`, redefine one of the methods that counter has that it inherits from dict. Tell us which one here:*
9. *Why is it good for `nlp` to be shared among all the corpora?*
10. *List one thing about object oriented programming that you find clear and useful, and one thing you find difficult or not useful:*
    * *clear and useful:*
    * *not clear or not useful:*