# Project 2: Goals and Deliverables

The goals of this assignment are:
* To introduce you to some basic NLP tasks: tasks that operate over **words**.
* To introduce you to some of the ways NLP has been done over time, and how NLP illustrates the historical development of AI.

Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment.
2. Open and set up a code space (install a python kernal and select it).
3. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
4. Edit the README.md file. Provide your name, your class year and what you hope to get out of this course. Make sure to include the output from running second_nlp.py!
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

For extra credit:
* Read the spaCy docs (https://spacy.io/usage/models). Figure out how to make spaCy work for another language. Add a mercury drop down menu (https://runmercury.com/docs/input-widgets/select/) for choosing a language. Include a screenshot of your modified web app running.
* Read the spaCy docs for token-based matching (https://spacy.io/usage/rule-based-matching). Write a rule matcher for recognizing models of iPhone (iPhone, iphone, iPhone 6, etc). 
* ChatGPT vs spaCy! Pick a text, and run it through spaCy to get the tokens, lemmas, parts of speech and named entities. Then ask ChatGPT to give you the tokens, lemmas, parts of speech and named entities for the same text. Make a table listing the output from spaCy, the output from ChatGPT, and your analysis of each output.
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Sources

This notebook uses content from:
* The [spaCy docs](https://spacy.io/usage/linguistic-features)
* The [Mercury tutorials](https://runmercury.com/tutorials/web-app-python-jupyter-notebook/)

# Getting Started

Go to Moodle. Click on the link under the second lab session.

Click to create/open a Codespace.

Click on the notebook filename to open it (it ends with .ipynb).

The first time you run it it will ask you to install python extensions and set the kernel type. We will walk through this in lab. 

**Be Sure To Save Your Work!!!**

In Visual Studio, use the little blue dot on the left hand side to "commit" and "sync" your work.

# Review: Getting Started with NLP in Python

You will do step 0 for *each new codespace* (every time you have to install a python kernel).

You will do steps 1-2 *every time* you do some NLP with spaCy.

## Step 0: Install and import spaCy

First we have to **install** the spacy python package.

(The ! tells Jupyter to execute this code as a regular Unix command, not python code. You will learn how to use Unix commands soon!)

In [None]:
!pip install spacy

spaCy is a ML-driven NLP library so we also have to install a spaCy **model**.

By the way, there are lots of spaCy models. To see a full list, click [here](https://spacy.io/usage/models/).

In [None]:
!python -m spacy download en_core_web_sm

## Step 1: Set up spaCy

Now we must **import** (add) the [spacy NLP](https://spacy.io/) library.

In the cell below, type:

```import spacy```

In [None]:
import spacy

Now we must set up spaCy in python. We make a spaCy NLP **engine** and give it the **model**.

In the cell below, type:

```nlp = spacy.load("en_core_web_sm")```

In [None]:
# Let's make a spaCy engine!
nlp = spacy.load("en_core_web_sm")

## Step 2: Create a document and run spaCy on it

Now we will create a document and run the NLP on it.

In the cell below, type:

```text = " Cathy and Amanda went for a drive to Portland, ME last June. It was 146.0 miles round-trip. We visited the Roux Institute and learned about AI-related graduate school programs."```

```doc = nlp(text)```

In [None]:
# Let's make a document!

# Doing Things with Words

## Tokenization

### spaCy Tokenization

When spaCy processes a document, the first thing it does is split it into **tokens**. 

A spaCy **token** is an example of a specialized data type in python called a **class object**. We will learn more about how to create these specialized data types later in the semester. 

For now, just notice that:
1. Each spaCy token contains (has an attribute for) text: to get the text, we say ```token.text```.
2. spaCy does *not* make tokens for white space.

In [None]:
token_texts = [token.text for token in doc]
# Print the token_texts



### What is a Word? What is a Token?

We are used to talking about language in terms of **words**. *The Concise Oxford Dictionary of Linguistics* defines a "word" as follows:
*"Traditionally the smallest of the units that make up a sentence, and marked as such in writing...."* (We will talk about sentences in a few weeks!)

As you can tell, this definition is not very precise! For computer software (even AI) we need *precise* definitions of things. 

All sorts of interesting questions come up when we try to talk about words:
* Is a punctuation mark a word?
* What about a number?
* Sometimes hyphenated "words" are one word and sometimes more than one. Who decides?

When a NLP software like spaCy tries to split text into words, we call that **tokenization**. We call the results **tokens**. Most **tokens** are words, but sometimes the tokenization will be wrong from a linguistic perspective.

NLP researchers often spend a lot of time creating [guidelines](https://catalog.ldc.upenn.edu/docs/LDC2011T03/treebank/english-translation-treebank-guidelines.pdf) (see also [these guidelines](https://universaldependencies.org/u/overview/tokenization.html)) for tokenization, since tokens are the *most basic unit* of analysis for many NLP tasks.

### White Space Tokenization

A super-simple tokenizer would just split text on "white space". Let's compare that with spaCy's tokenization.


In [None]:
whitespace_tokens = text.split(' ')
# Print the whitespace_tokens

Questions:
1. *List two of the words from the document where you think the white space tokenizer tokenized correctly, and two where you think it didn't.*
2. *List two of the words from the document where you think spaCy tokenized correctly, and two where you think it didn't.*

**Important note**: For now, we will assume that we *know* the language of the input text. Later in the semester, we will look at how to use NLP to *find out* the language of the input text.

## Part of Speech Tagging


Quite often, we want to group words together by how they *behave*. For example, we have talked about:
* nouns
* verbs

We call these groups **word classes** and the labels of the groups **part of speech tags**. The Oxford Dictionary of Linguistics defines word class as follows: *"Any class of word established by similarities in syntax or in grammar generally."*

When a spaCy NLP engine runs on a document, it attaches a part of speech tag to each token. Actually, for our document it attaches two! 
* First, tags from a coarse-grained set of word classes described [here](https://universaldependencies.org/u/pos/index.html) and available via the token attribute `pos_`
* Second, for English, tags from a fine-grained set of word classes described [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and available via the token attribute `tag_`

The second set of tags was developed as part of a famous project called the *Penn Treebank*; however, it was only for English. 

The first set of tags was developed as part of a famous project called *Universal Dependencies*, which aims to create consistently *annotated* natural language data sets (called **corpora**) for all natural languages. Most of spaCy's non-English models are trained on Universal Dependencies data.

Let's look at the tags for each of our tokens.

In [None]:
# These tags represent a small set of word classes
token_coarse_tags = [token.pos_ for token in doc]
# Print the tags

In [None]:
# These tags represent a larger set of word classes
token_fine_tags = [token.tag_ for token in doc]
# Print the tags

These tags (especially the fine-grained ones) are not very descriptive! If you want more information about a tag, you can use spaCy's ```explain()``` function.

In the cell below, type ```spacy.explain(VBD)```.

In [None]:
# Let's get more information about a part of speech tag

Questions:

3. *Using the ```explain()``` function, look up the coarse and fine-grained part of speech tags and fill in this table:*

| Word class | Coarse | Fine |
| ---------- | ------ | ---- |
| Noun       |        |      |
| Verb       |        |      |
| Adjective  |        |      |
| Adverb     |        |      |

4. *Compare "NN" and "NNS". What is the difference?*

5. *Which of the tokens that spaCy says is a NOUN is not a noun?*
6. *The part of speech tagger runs after the tokenizer. If the tokenizer is wrong, what does that mean for the part of speech tagger?*

## Lemmatization and Morphology

We know that words have internal structure. For example, in your earliest English classes you learned about *prefixes* and *suffixes* like:

* 'un' - reverses the meaning of a noun
* 'ed' - makes a regular verb in the past tense

The field of linguistics that looks at the structure of words is called **morphology**. The Oxford Dictionary of Linguistics defines morphology as: *"The study of the grammatical structure of words and the categories realized by them."* It may be weird to think of a word as having (inside itself) a grammatical structure, but words definitely do. In English (and many other languages) this structure is realized through:

* prefixes
* suffixes
* infixes

When it processes an English document, spaCy attaches two types of information to each token that are related to its structure:

1. A **lemma** - a "quick and dirty" approximation of the root form of the word, available via the token attribute `lemma_`
2. A **morphological analysis**  - a more complete analysis of the structure of the word, available via the token attribute `morph`

In the code cell below, print the lemmas for the tokens in this document.

In [None]:
token_lemmas = [token.lemma_ for token in doc]
# Print the lemmas

Now print the morphological analyses for the tokens in this document.

In [None]:
token_morphs = [str(token.morph) for token in doc]
# Print the morphological analyses

There are many other token attributes spaCy provides that are related to morphology. A full description is [here](https://spacy.io/api/token/). 

Questions:

7. *What is one other type of token attribute you would like to use?*
8. *What is one type of token attribute that we can just get via python string methods?*

## Named Entities

Some tokens (or multi-token units) are special because they *name* things. For example:

* Names of people (like *Cathy Fan*)
* Names of organizations (like *Colby College*)
* Names of locations (like *Waterville*)

spaCy attaches information about named entities to the document. In the code cell below, print the named entities for this document.

In [None]:
entity_texts = [ent.text for ent in doc.ents]
entity_types = [ent.label_ for ent in doc.ents]
# Print the entity_texts

# Print the entity_types


# What's In a Word? Putting It All Together

Now we will practice our pretty printing, using the token attributes. We want to print a table, in python, with columns like this:

| Token | Lemma | Coarse-grained Part of speech | Fine-grained Part of speech | Token shape | Token is punctuation? | Token morphology |
| ----- | ----- | ----------------------------- | --------------------------- | ----------- | --------------------- | ---------------- |
|       |       |                               |                             |             |                       |                  |

I will give you the structure, which is a **for loop** (we will learn about this coming week!) and you write the print statement below.

In [None]:
for token in doc:
    # You pretty print the token attributes in order using the table above


Let's also make a table of the entities, text and label for each entity.

In [None]:
for ent in doc:
    # Pretty print the entity texts and labels 

Now let's look at our first web app! This app uses mercury, another python package. In order to use the web app, we have to install mercury.

In [None]:
!pip install mercury

Now you can go to your **Terminal** and type `mercury run`. The web app will open in another tab. The file containing the web app is `project2_webapp.ipynb`.

## Step 5: Finish the project

* Complete the file `second_nlp.py` following the comments in the file. Refer back to week1-2 as necessary.
* In the cell below, type: ```!python second_nlp.py```. Remember you will enter your input in the pop-down text box from the top of the browser window.

## Questions

9. *How many tokens are in the sentence "Natural language processing is a subfield of Artificial Intelligence!"?*
10. *Look at the [spacy quickstart](https://spacy.io/usage/spacy-101). What is one type of NLP other than tokenization that you would find useful?*