# Project 3b: Goals and Deliverables

The goals of this assignment are:
* To analyze corpora, especially focusing on sentences.
* To learn more about NLP.

Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment. Open and set up a code space (install a python kernel and select it).
2. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
3. I wrote the comments; you write the code! Complete and run `spacy_on_corpus.py` following the instructions in this notebook.
4. Edit the README.md file. Provide your name, your class year, links to/descriptions of any extensions and a list of resources. 
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

Possible extensions (from least points to most points):
* Augment the function `render_doc_statistics` so that it also includes some sentence and noun chunk statistics.
* Look at the displaCy docs (https://spacy.io/api/top-level#displacy_options). Make the visualizations of the dependency parses more elegant or functional.
* Allow the user to choose an individual document. Save all the visualizations of the sentence parses into a file.
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Setup

## Install Our Packages

On the command line (in the terminal), type:

% `pip install -r requirements.txt`

## Upload Our Data

From Moodle, download `files.jsonl.zip`. 

Then, upload `files.jsonl.zip` to the code space.

## Make Sure We Can Work With .py Files We Are Editing

Run the code cell below.

In [None]:
# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

# Doing Things with Sentences

So far our NLP experience has involved **tokens**. 

This week we will look at some things NLP can do with **sentences**.

# Sentences

What is a sentence? The Oxford Dictionary of Linguistics says the term **sentence** is *Usually conceived, explicitly or implicitly, as the largest unit of grammar, or the largest unit over which a rule of grammar can operate*.

What does this mean?

1. A sentence can stand on its own.
2. Parts of a sentence may also be able to stand on their own, in particular if a sentence is composed of **independent clauses**.




# Sentence Segmentation 

Let's use spaCy to get the sentences from a corpus.

In [None]:
# import spacy
import spacy

#import spacy_on_corpus
import spacy_on_corpus

# we will use a slightly larger spaCy model for this project
nlp = spacy.load('en_core_web_md')

# make a corpus from c*.txt
my_corpus = spacy_on_corpus.build_corpus('c*.txt', {}, nlp)

# extract a list of all the sentences in the corpus
my_sentences = []
for doc_id in my_corpus:
    my_sentences.extend(list(my_corpus[doc_id]['doc'].sents))

# print the sentences
for sentence in my_sentences:
    print(sentence)


Now let's inspect one sentence. What attributes does a sentence have?

In [None]:
# does a sentence have start, end and label attributes like an entity?
for sentence in my_sentences:
    print(sentence.start, sentence.end, sentence.label)

# Parses

The structure of a token is described by its *morphology*.

The structure of a sentence is described by its *syntax*. Spacy supports a particular model of syntax, called *dependency parsing*.

Let's take a look at the dependency parse for some of these sentences.

In [None]:
from spacy import displacy

displacy.render(my_sentences, style="dep")

# Chunks

As you can tell, looking at the parses for *all* the sentences in a document can be overwhelming! Are there condensed representations of sentence structure, or phrase structure, that are easier to summarize or visualize? Yes!

Let's look at how we can extract noun phrase **chunks** from spaCy sentence parses.

In [None]:
# make a dictionary of chunk counts
my_chunks = {}

# get all the chunks from the corpus
for doc_id in my_corpus:
    for chunk in my_corpus[doc_id]['doc'].noun_chunks:
        if chunk not in my_chunks:
            my_chunks[chunk.text] = 0
        my_chunks[chunk.text] += 1

# print the chunks
print(my_chunks)

# Sentence Similarities

So far, we have only looked at the lexical (surface) and syntax (structure) of tokens, entities and sentences. What about the meanings?

In NLP, we represent the *meaning in context* of a text using a **vector**, or list of numbers, learned from data. 

We can use the vectors for a pair of texts to calculate their **semantic** similarity (as opposed to just, say, the number of characters or words they have in common). We get a floating point number out when we do this.

These vectors are quite large so they use a lot of memory. The spaCy `en_core_web_sm` model doesn't come with the vectors, but the `en_core_web_md` one does.

Let's compare the vector similarity of all pairs of our sentences.

In [None]:
for sentence in my_sentences:
    for sentence2 in my_sentences:
        print(sentence, "<->", sentence2, sentence.similarity(sentence2))

Do the output similarities match your intuitions about the similarity or difference between the sentence pairs?

Let's look a little more at this notion of vectors-as-meanings in NLP: https://jalammar.github.io/illustrated-word2vec/.

# Testing the Functions in spacy_on_corpus.py

For this project, you will be extending your code in `spacy_on_corpus.py`. You will fill in the functions and test them in this section.

## Test `build_corpus`

In the code cell below import `spacy_on_corpus` and run `build_corpus` on `test.jsonl`. Assign the result to `my_corpus`.

In [None]:
import spacy_on_corpus

my_corpus = spacy_on_corpus.build_corpus('test.jsonl', {}, nlp)

print(my_corpus)

## Test `get_noun_chunk_counts`

Complete the implementation of `get_noun_chunk_counts` in `spacy_on_corpus.py`. This will look a lot like `get_token_counts` or `get_entity_counts`. However, there are no tags to exclude.

Then, in the code cell below import `spacy_on_corpus` and run `get_noun_chunk_counts` on `my_corpus`. 

In [None]:
import spacy_on_corpus

spacy_on_corpus.get_noun_chunk_counts(my_corpus)

Your answer should look like:
```
[('Autumn', 1),
 ('Maine', 2),
 ('The leaves', 1),
 ('the trees', 1),
 ('the sky', 1),
 ('the iconic Maine rocks', 1),
 ('People', 2),
 ('leaf peeping', 1),
 ('apple picking', 1),
 ('a campfire', 1),
 ('the evenings', 1),
 ('Winter', 1),
 ('The snow', 1),
 ('all the deer ticks', 1),
 ('their skis', 1),
 ('snow shoes', 1),
 ('snowmobiles', 1),
 ('ice fishing', 1)]
```

## Test `get_basic_statistics`

Complete the implementation of `get_basic_statistics` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_basic_statistics` on `my_corpus`.

In [None]:
# import
from spacy_on_corpus import get_basic_statistics

# call get_basic_statistics on my_corpus
get_basic_statistics(my_corpus)

Your output should look like:
```
Documents: 2

Sentences: 7

Tokens: 67

Unique tokens: 50

Entities: 5

Unique entities: 3

Chunks: 20

Unique chunks: 18

Publication year range: 2023-2023

Page count year range: 1-1
```

## Test `plot_token_frequencies`, `plot_entity_frequencies` and `plot_chunk_frequencies`

Complete the implementation of these three functions in `spacy_on_corpus.py`. Make sure to exclude tokens that are 'uninformative'.

Then, in the code cell below import `spacy_on_corpus` and run each of them on `my_corpus`.

In [None]:
# import
import spacy_on_corpus

spacy_on_corpus.plot_token_frequencies(my_corpus)
spacy_on_corpus.plot_entity_frequencies(my_corpus)
spacy_on_corpus.plot_chunk_frequencies(my_corpus)

The resulting file `token_counts.png` should look like:

![token_counts.png](answer_token_counts.png)

The resulting file `entity_counts.png` should look like:

![entity_counts.png](answer_entity_counts.png)

The resulting file `chunk_counts.png` should look like:

![chunk_counts.png](answer_chunk_counts.png)

## Test `plot_token_cloud`, `plot_entity_cloud` and `plot_chunk_cloud`

Complete the implementation of these functions in `spacy_on_corpus.py`. Make sure to exclude tokens that are 'uninformative'.

Then, in the code cell below import `spacy_on_corpus` and run each of these functions on `my_corpus`.

In [None]:
# import
import spacy_on_corpus
# call plot_word_cloud on my_corpus
spacy_on_corpus.plot_token_cloud(my_corpus)
spacy_on_corpus.plot_entity_cloud(my_corpus)
spacy_on_corpus.plot_chunk_cloud(my_corpus)

The resulting file `token_wordcloud.png` should look like:

![token_wordcloud.png](answer_token_wordcloud.png)

The resulting file `entity_wordcloud.png` should look like:

![entity_wordcloud.png](answer_entity_wordcloud.png)

The resulting file `chunk_wordcloud.png` should look like:

![chunk_wordcloud.png](answer_chunk_wordcloud.png)

# Running `spacy_on_corpus.py` from the Terminal

Complete the implementation of `main` in `spacy_on_corpus.py`. 

Now run this in the terminal:
% `python spacy_on_corpus.py`

Give it `files.jsonl.zip` as the pattern. Get all of 'statistics', 'count' and 'cloud' for 'corpus', and for any document id, 'statistics', 'markdown' and 'table' for that document..

Copy the published statistics:
```
here
```

Insert the images generated when you run it.

### Token count plot


### Entity count plot

### Chunk count plot


### Token cloud

### Entity cloud

### Chunk cloud


### Publication year plot


### Page count plot


# Questions

Answer these questions.

1. *What are some simple rules for sentence segmentation (splitting text into sentences)?*
2. *What is an example sentence where your simple rules don't work?*
3. *Why might I care what the syntax of a sentence is?*
4. *Why might I care about the 'vector similarity' between two sentences?*
5. *What is one difference between `en_core_web_sm` and `en_core_web_md`?