# LIN 353C: Introduction to Computational Linguistics,  Fall 2020, Erk

# Homework 7:  Chunking, evaluation, and context-free grammars

## Due: Thursday November 11, end of day

## Your name: Eloragh Espie
## Your EID: eae2273

This homework comes with the following files:

* Introcl_homework_7.ipynb: this notebook, which has the homework problems. **Please put your answers into this same notebook.**


Please record all your answers in the appropriate place in this notebook, and **do not forget to put your name and EID at the top of this notebook**.

For the part of the homework that requires you to write Python code,
we need to see the code.
You can omit statements that
produced an error or that did not form part of the eventual solution,
but please include all the Python code that formed part of your
solution. 

Please use comments to explain what your code does. Any code that seems complicated to you, or goes on for more than 2 lines, can probably use a comment. Just practice commenting more than you think the code needs. As you will see once you pull out an old piece of code you wrote and try to figure out what you were doing, code always needs more comments than you think.

### Important note: Please hit the fast-forward button on this notebook, and confirm "Restart and Run all cells", so the code included in this notebook will be executed on your machine. However, there is one command below (loading a gensim space) that may take a while, please plan for that. 


**If any of these instructions do not make sense to you, please get in
 touch with the instructor right away.**


A perfect solution to this homework will be worth *100* points. 



# Problem 1: Chunking (30 points)

For this problem, you will create a noun phrase chunker and evaluate it. 

You can use any of the chunking methods described in the NLTK book at
https://www.nltk.org/book_1ed/ch07.html

Call

`nltk.app.chunkparser()`

to enter the interactive chunker development and analysis platform that NLTK offers.

**Important: Please check soon that the NLTK chunk parser app works on your machine.**

Note: This app uses the *conll2000* data. If you haven't downloaded the data, you may get an error that mentions
"OSError: No such file or directory:...conll2000/train.txt". In that case, please run the command

```nltk.download('conll2000')```

If you still cannot get the chunk parser app to work, check the notebook `Chunking without the app.ipynb`, which shows you how to do the homework without the app.

For your chunker, use at least 5 rules that are not used in the NLTK book and that are not the rule
`{ <DT><JJS?>*<NNP?> }`
that we used in class. You are also allowed to use a single rule that extends the NLTK book rules, or extends the rule above from in class, in 5 ways. 

Note that you will need to put curly brackets around your rules.
Copy your chunker rules into the box below. Also report the precision and recall that your chunker achieves. (They are listed at the bottom of the chunker app window.)

In addition, describe, in the box below, at least two examples that your chunker mis-analyzes. Discuss what the problem is and what you might do to address it. You can view additional sentences by clicking the “Next example” button, or using Control-n.


*space for your text answer here*

```{<DT>?<JJS?>*<NNP?S?>+}```

- Precision: 69.41%
- Recall: 62.78%
    
```{<DT>?<PRP$>?<JJS?>?<CD>?<NNS?P?>+}``` 

- Precision: 76.21%
- Recall: 68.93%

```<DT>?<JJS?>?<NNP?S?><CC>?<NNP?S?>*}```

- Precision: 68.58%
- Recall: 60.68%



In [7]:
import nltk
nltk.download('conll2000')

nltk.app.chunkparser()

[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/eloraghespie/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!


# Problem 2: Evaluation with Precision and Recall (30 points)

The chunker that you used in the previous problem was evaluated
using Precision and Recall. They are computed as follows.

* True Positives (TP) are word sequences that are actually noun
phrases, and where the chunker says they are noun phrases.
* False Positives (FP) are word sequences that the chunker says are noun phrases but that are not.
* False Negatives (FN) are actual noun phrases that the chunker misses.
* TrueNegatives(TN)are word sequences that are not noun phrases and that the chunker does not consider noun phrases.

Then Precision is the fraction of actual noun phrases among the sequences that the chunker labels as noun phrases:

$Precision = \frac{TP}{TP+FP}$

Recall is the fraction of actual noun phrases that the parses recognizes as noun phrases:

$Recall = \frac{TP}{TP+FN}$

F-score combines Precision and Recall:

$Fscore = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$

The (hypothetical) XYZ-Chunker achieves the following results on the (hypothetical) ABC-corpus:

* True Positives: 511
* False Positives: 83
* False Negatives: 302 
* True Negatives: 2508

What are Precision, Recall, and F-score of the XYZ-Chunker? Show your work. (You don’t need to use Python to solve this problem, but you can if you like. Still, show your work.)



* space for your text answer here*

In [6]:
# or space for an answer in Python, if you like

# set the positives and negatives to variables

tp, fp, fn, tn = 511, 83, 302, 2508

# create functions for precision, recall, and f_score
# all these do is return the equation

def precision(tp, fp):

    return (tp/(tp + fp))

def recall(tp, fn):

    return (tp/(tp+fn))

def f_score(precision, recall):
    
    return (2 * precision * recall / (precision + recall))  

# calcuate using the functions in the print statement
# the final function is calculated with the precision and recal functions

print(f"Precision:{precision(tp, fp): .3%} \n\
Recall:{recall(tp, fn): .3%} \n\
F-score:{f_score(precision(tp,fp), recall(tp,fn)): .3%}")  

Precision: 86.027% 
Recall: 62.854% 
F-score: 72.637%


# Problem 3: Context-free grammar (40 points)

English has an impoverished system for marking case that only shows up on pronouns. For example, *I*, *he*, and *they* have nominative case, while *me*, *him*, and *them* have accusative case. 

Write a context-free grammar that accepts the sentences in (a) but not those in (b):

## (a) Sentences to accept:

* i. she sees him
* ii. they see him
* iii. they see the woman
* iv. they see the women
* v. I know her
* vi. I know the man whom she sees
* vii. I know the woman who sees him
* viii. the woman who sees him walks (that is, the woman is doing the walking)
* ix. the women who see him walk (that is, the women are doing the walking)
    
## (b) Sentences to not accept: 

(I am putting a star in front of each sentence to indicate that they are not grammatical.)

* i. *she sees he
* ii. *they sees him
* iii. *me know her
* iv. *I know the woman whom sees him
* v. *the woman who sees him walk 
* vi. *the woman who see him walks 
* vii. *the women who sees him walk viii. 
* *the women who see him walks

Please put your context-free grammar below.

Additionally, draw the tree structure for the sentence 
    I know the man whom she sees.
You can either do that here using ASCII art, or do it on paper and submit a clearly readable photo of the tree structure.


*space for your text answer here*