# Project 2: Goals and Deliverables

The goals of this assignment are:
* To introduce you to some basic NLP tasks: tasks that operate over characters and words.
* To introduce you to some of the ways NLP has been done over time, and how NLP illustrates the historical development of AI.

Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment.
2. O;en and set up a code space (install a python kernal and select it).
3. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
4. Edit the README.md file. Provide your name, your class year and what you hope to get out of this course. Make sure to include the output from running second_nlp.py!
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

For extra credit:
* Read the spaCy docs (https://spacy.io/usage/models). Figure out how to make spaCy work for another language. Add a mercury drop down menu (https://runmercury.com/docs/input-widgets/select/) for choosing a language. Include a screenshot of your modified web app running.
* Read the spaCy docs for token-based matching (https://spacy.io/usage/rule-based-matching). Write a rule matcher for recognizing models of iPhone (iPhone, iphone, iPhone 6, etc). 
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Sources

This notebook uses content from:
* The [Mercury tutorials](https://runmercury.com/tutorials/web-app-python-jupyter-notebook/)
* The [spaCy docs](https://spacy.io/usage/linguistic-features)

# Getting Started

Go to Moodle. Click on the link under the second lab session.

Click to create/open a Codespace.

Click on the notebook filename to open it (it ends with .ipynb).

The first time you run it it will ask you to install python extensions and set the kernel type. We will walk through this in lab. 

**Be Sure To Save Your Work!!!**

In Visual Studio, use the little blue dot on the left hand side to "commit" and "sync" your work.

# Review: Getting Started with NLP in Python

You will do steps 0-2 *every time* you do some NLP with spaCy.

## Step 0: Install and import spaCy

First we have to **install** the spacy python package.

(The ! tells Jupyter to execute this code as a regular Unix command, not python code. You will learn how to use Unix commands soon!)

In [25]:
!pip install spacy



spaCy is a ML-driven NLP library so we also have to install a spaCy **model**.

In [26]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Now we must **import** (add) the [spacy NLP](https://spacy.io/) library.

In the cell below, type:

```import spacy```

In [27]:
import spacy

## Step 1: Set up spaCy

Now we must set up spaCy in python. We make a spaCy NLP **engine** and give it the **model**.

In the cell below, type:

```nlp = spacy.load("en_core_web_sm")```

In [28]:
# Let's make a spaCy engine!

## Step 2: Create a document and run spaCy on it

Now we will create a document and run the NLP on it.

In the cell below, type:

```text = "I went for a drive recently. It was 23.2 miles round-trip. The road at the end was dirt and very narrow, so I had to make a sharp U-turn."```

```doc = nlp(text)```

In [29]:
# Let's make a document!

# Words

## spaCy Tokenization

When spaCy processes a document, the first thing it does is split it into **tokens**. 

A spaCy **token** is an example of a specialized data type in python called a **class**. We will learn more about how to create these specialized data types later in the semester. 

For now, just notice that:
1. Each spaCy token contains text: to get the text, we say ```token.text```.
2. spaCy does *not* make tokens for white space.

In [30]:
token_texts = [token.text for token in doc]
print(' # '.join(token_texts))

I # went # for # a # drive # recently # . # It # was # 23.2 # miles # round # - # trip # . # The # road # at # the # end # was # dirt # and # very # narrow # , # so # I # had # to # make # a # sharp # U # - # turn # .



## What is a Word? What is a Token?

We are used to talking about language in terms of **words**. *The Concise Oxford Dictionary of Linguistics* defines a "word" as follows:
""

As you can tell, this definition is not very precise! For computer software (even AI) we need *precise* definitions of things. 

All sorts of interesting questions come up when we try to talk about words:
* Is a punctuation mark a word?
* What about a number?
* Sometimes hyphenated "words" are one word and sometimes more than one. Who decides?

When a NLP software like spaCy tries to split text into words, we call that **tokenization**. We call the results **tokens**. Most **tokens** are words, but sometimes the tokenization will be wrong from a linguistic perspective.

NLP researchers often spend a lot of time creating [guidelines](https://catalog.ldc.upenn.edu/docs/LDC2011T03/treebank/english-translation-treebank-guidelines.pdf) (see also [these guidelines](https://universaldependencies.org/u/overview/tokenization.html)) for tokenization, since tokens are the *most basic unit* of analysis for many NLP tasks.

## White Space Tokenization

A super-simple tokenizer would just split text on "white space". Let's compare that with spaCy's tokenization.


In [31]:
text = "I went for a drive recently. It was 23.2 miles round-trip. The road at the end was dirt and very narrow, so I had to make a sharp U-turn."
whitespace_tokens = text.split(' ')
print(' # '.join(whitespace_tokens))

I # went # for # a # drive # recently. # It # was # 23.2 # miles # round-trip. # The # road # at # the # end # was # dirt # and # very # narrow, # so # I # had # to # make # a # sharp # U-turn.


Questions:
1. *List two of the words from the document where you think the white space tokenizer tokenized correctly, and two where you think it didn't.*
2. *List two of the words from the document where you think spaCy tokenized correctly, and two where you think it didn't.*

**Important note**: For now, we will assume that we *know* the language of the input text. Later in the semester, we will look at how to use NLP to *find out* the language of the input text.

# Word Classes

Quite often, we want to group words together by how they *behave*. For example, we have talked about:
* nouns
* verbs

**Dictionary**

We call these groups **word classes** and the labels of the groups **part of speech tags**. 

When a spaCy NLP engine runs on a document, it attaches a part of speech tag to each token. Actually, for our document it attaches two! 
* First, tags from a coarse-grained set of word classes described [here](https://universaldependencies.org/u/pos/index.html)
* Second, for English, tags from a fine-grained set of word classes described [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

The second set of tags was developed as part of a famous project called the *Penn Treebank*; however, it was only for English. 

The first set of tags was developed as part of a famous project called *Universal Dependencies*, which aims to create consistently *annotated* natural language data sets (called **corpora**) for all natural languages. Most of spaCy's non-English models are trained on Universal Dependencies data.

Let's look at the tags for each of our tokens.

In [32]:
# These tags represent a small set of word classes
token_coarse_tags = [token.pos_ for token in doc]
print(' # '.join(token_coarse_tags))

PRON # VERB # ADP # DET # NOUN # ADV # PUNCT # PRON # AUX # NUM # NOUN # ADJ # PUNCT # NOUN # PUNCT # DET # NOUN # ADP # DET # NOUN # AUX # NOUN # CCONJ # ADV # ADJ # PUNCT # CCONJ # PRON # VERB # PART # VERB # DET # ADJ # NOUN # NOUN # NOUN # PUNCT


In [33]:
# These tags represent a larger set of word classes
token_fine_tags = [token.tag_ for token in doc]
print(' # '.join(token_fine_tags))

PRP # VBD # IN # DT # NN # RB # . # PRP # VBD # CD # NNS # JJ # HYPH # NN # . # DT # NN # IN # DT # NN # VBD # NN # CC # RB # JJ # , # CC # PRP # VBD # TO # VB # DT # JJ # NN # NN # NN # .


These tags (especially the fine-grained ones) are not very descriptive! If you want more information about a tag, you can use spaCy's ```explain()``` function.

In the cell below, type ```spacy.explain(VBD)```.

In [34]:
# Let's get more information about a part of speech tag

Questions:

3. *Using the ```explain()``` function, look up "NN" and "NNS". What is the difference?*
4. *Let's say I only wanted to print the text for the tokens that have part of speech tag NOUN. In the code cell below, write the code to do that. Run your code.*

In [35]:
for token in doc:
    if token.pos_ == 'NOUN':
        print(token.text)

drive
miles
trip
road
end
dirt
U
-
turn


5. *Looking at the output of the code cell above, which of the tokens that spaCy says is a NOUN is not a noun?*
6. *The part of speech tagger runs after the tokenizer. If the tokenizer is wrong, what does that mean for the part of speech tagger?*

# Structure of Words

We know that words have internal structure. For example, in your earliest English classes you learned about *prefixes* and *suffixes* like:

* 'un' - reverses the meaning of a noun
* 'ed' - makes a regular verb in the past tense

The field of linguistics that looks at the structure of words is called **morphology**.

**Dictionary**

When it processes an English document, spaCy attaches two types of information to each token that are related to its structure:

1. A **lemma** - a "quick and dirty" approximation of the root form of the word
2. A **morphological analysis**  - a more complete analysis of the structure of the word

In [36]:
# MAke them write it
token_lemmas = [token.lemma_ for token in doc]
print(' # '.join(token_lemmas))

I # go # for # a # drive # recently # . # it # be # 23.2 # mile # round # - # trip # . # the # road # at # the # end # be # dirt # and # very # narrow # , # so # I # have # to # make # a # sharp # u # - # turn # .


In [37]:
# Make them write it
token_morphs = [str(token.morph) for token in doc]
print(' # '.join(token_morphs))

Case=Nom|Number=Sing|Person=1|PronType=Prs # Tense=Past|VerbForm=Fin #  # Definite=Ind|PronType=Art # Number=Sing #  # PunctType=Peri # Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs # Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin # NumType=Card # Number=Plur # Degree=Pos # PunctType=Dash # Number=Sing # PunctType=Peri # Definite=Def|PronType=Art # Number=Sing #  # Definite=Def|PronType=Art # Number=Sing # Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin # Number=Sing # ConjType=Cmp #  # Degree=Pos # PunctType=Comm # ConjType=Cmp # Case=Nom|Number=Sing|Person=1|PronType=Prs # Tense=Past|VerbForm=Fin #  # VerbForm=Inf # Definite=Ind|PronType=Art # Degree=Pos # Number=Sing # Number=Sing # Number=Sing # PunctType=Peri


Also ispha isstop etc

# What's In a Word? Putting It All Together

In [38]:
for token in doc:
    print(f'%10s %10s %6s %6s %6s %6s %6s %s' % (token.text, token.lemma_, token.pos_, token.tag_, token.shape_, token.is_alpha, token.is_stop, token.morph))

         I          I   PRON    PRP      X   True   True Case=Nom|Number=Sing|Person=1|PronType=Prs
      went         go   VERB    VBD   xxxx   True  False Tense=Past|VerbForm=Fin
       for        for    ADP     IN    xxx   True   True 
         a          a    DET     DT      x   True   True Definite=Ind|PronType=Art
     drive      drive   NOUN     NN   xxxx   True  False Number=Sing
  recently   recently    ADV     RB   xxxx   True  False 
         .          .  PUNCT      .      .  False  False PunctType=Peri
        It         it   PRON    PRP     Xx   True   True Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs
       was         be    AUX    VBD    xxx   True   True Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
      23.2       23.2    NUM     CD   dd.d  False  False NumType=Card
     miles       mile   NOUN    NNS   xxxx   True  False Number=Plur
     round      round    ADJ     JJ   xxxx   True  False Degree=Pos
         -          -  PUNCT   HYPH      -  False

Make a mercury app

and second_nlp.py

guppy puppy monkey gorilla

## Step 5: Finish the project

* Make a file called 'first_nlp.py'. (To do this, click on the top left icon in Visual Studio, then on the Document+ icon.) Copy all the code from this section (Getting Started with NLP in Python) into that file.
* Change the text in the line that starts with 'doc = ' to be ```doc = nlp("Natural language processing is a subfield of Artificial Intelligence!")```.
* Make sure to comment your code. (For now, aim for one comment per line of code.)
* Save the file.
* In the cell below, type: ```!python first_nlp.py```.

## Questions

8. *How many tokens are in the sentence "Natural language processing is a subfield of Artificial Intelligence!"?*
9. *Look at the [spacy quickstart](https://spacy.io/usage/spacy-101). What is one type of NLP other than tokenization that you would find useful?*
10. *Write down one question you have after learning the basics of Jupyter notebooks, Github and spacy.*