# Assignment a - Linguistics

## Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* For this assignment, we will be looking at tokenization, morphology, and syntax. 
* This will follow in a similar way as the notebook we did in class, though it will be a bit more work. 
* Answer each question (or, in some cases, follow the command)
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment.

#### We will be using **[Tamarian](https://www.youtube.com/watch?v=ANvlLcOTy6M)** as our example language: 

In [1]:
sentences = [
    'Sinda his face black his eyes red',
    'Tamak',
    'The river Tamak in winter',
    'Darmok and Jalad at Tanagra',
    'Darmok and Jalad on the ocean',
    'Socath his eyes opened',
#    'The beast of Tanagra Usani his army Jakka when the walls fell', # don't worry about this one
    'Picard and Dathan at Eladrel',
    'Marab with sails unfurled',
    'Timba his arms open',
    'Timba at rest'
]

### 1. Tokenize the sentences 

* you will need to make sure everything is lower case
* you will need to represent the sentences as a 2D array of words

In [2]:
sentences = list(map(lambda x: x.lower().split(), sentences))

### 2. Use a stemmer or lemmatizer 

- (NLTK has several) 
-  You will know your stemmer/lemmatizer did its job because plural words will no longer be plural (e.g., 'eyes' -> 'eye') and past-tense words will no longer be past-tense (e.g. 'unfurled' -> 'unfurl')


In [9]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
sentences = list(map(lambda x: [stemmer.stem(j) for j in x], sentences))

### 3. Write a grammar that can parse all of the sentences

* Try to write as few grammar rules as possible
* Use recursion where you can
* Use `S` as the start symbol
* All terminals need to be in quotes


In [10]:
sentences

[['sinda', 'hi', 'face', 'black', 'hi', 'eye', 'red'],
 ['tamak'],
 ['the', 'river', 'tamak', 'in', 'winter'],
 ['darmok', 'and', 'jalad', 'at', 'tanagra'],
 ['darmok', 'and', 'jalad', 'on', 'the', 'ocean'],
 ['socath', 'hi', 'eye', 'open'],
 ['picard', 'and', 'dathan', 'at', 'eladrel'],
 ['marab', 'with', 'sail', 'unfurl'],
 ['timba', 'hi', 'arm', 'open'],
 ['timba', 'at', 'rest']]

In [65]:
tamarian_grammar = nltk.CFG.fromstring("""
 S -> NP PP | ADJP | V | VP | N NP | NP
 NP -> N CC | NP N | NP PP | PP NP | N P | DT N | NP NP
 PP -> P N | N P | CC PP
 ADJP -> NP JJ | PP JJ | ADJP ADJP
 VP -> NP V |VP PP
 N -> 'face'|'sinda'|'darmok'|'jalad'|'tanagra'|'eye'|'river'|'winter'|'darmok'|'ocean'|'socath'|'picard'|'dathan'|'eladrel'|'marab'|'sail'|'timba'|'arm'
 P -> 'hi'|'at'|'in'|'on'|'with'
 CC -> 'and'
 JJ -> 'black'|'red'
 V -> 'tamak'|'open'|'unfurl'
 DT -> 'the'
""")
s = 'darmok and jalad at tanagra'.split()
parser = nltk.ChartParser(tamarian_grammar)
for s in sentences[0:9]:
    print(s)
    for tree in parser.parse(s):
        print(tree)
    print('---------')

['sinda', 'hi', 'face', 'black', 'hi', 'eye', 'red']
(S
  (ADJP
    (ADJP (NP (NP (N sinda) (P hi)) (N face)) (JJ black))
    (ADJP (PP (P hi) (N eye)) (JJ red))))
---------
['tamak']
(S (V tamak))
---------
['the', 'river', 'tamak', 'in', 'winter']
(S
  (VP (VP (NP (DT the) (N river)) (V tamak)) (PP (P in) (N winter))))
---------
['darmok', 'and', 'jalad', 'at', 'tanagra']
(S
  (NP
    (NP (N darmok) (CC and))
    (NP (NP (N jalad) (P at)) (N tanagra))))
(S
  (NP
    (NP (NP (N darmok) (CC and)) (PP (N jalad) (P at)))
    (N tanagra)))
(S
  (NP
    (NP (NP (N darmok) (CC and)) (NP (N jalad) (P at)))
    (N tanagra)))
(S
  (NP
    (NP (NP (N darmok) (CC and)) (N jalad))
    (PP (P at) (N tanagra))))
(S (NP (NP (N darmok) (CC and)) (N jalad)) (PP (P at) (N tanagra)))
---------
['darmok', 'and', 'jalad', 'on', 'the', 'ocean']
(S
  (NP
    (NP (NP (N darmok) (CC and)) (PP (N jalad) (P on)))
    (NP (DT the) (N ocean))))
(S
  (NP
    (NP (NP (N darmok) (CC and)) (NP (N jalad) (P on)))
    

ValueError: Grammar does not cover some of the input words: "'timba', 'arm'".

## 4. Show that your grammar parses all of the sentences

* Use a parser that can use a CFG (NLTK has several) 
* Make sure there is a parse tree for each of the sentences

In [26]:
parser = nltk.ChartParser(tamarian_grammar)
for tree in parser.parse(s):
    print(tree)

In [22]:
parser = nltk.ChartParser(tamarian_grammar)
for s in sentences:
    print(s)
    for tree in parser.parse(s):
        print(tree)
    print()
    print('------------')

['sinda', 'hi', 'face', 'black', 'hi', 'eye', 'red']

------------
['tamak']

------------
['the', 'river', 'tamak', 'in', 'winter']

------------
['darmok', 'and', 'jalad', 'at', 'tanagra']

------------
['darmok', 'and', 'jalad', 'on', 'the', 'ocean']

------------
['socath', 'hi', 'eye', 'open']

------------
['picard', 'and', 'dathan', 'at', 'eladrel']

------------
['marab', 'with', 'sail', 'unfurl']

------------
['timba', 'hi', 'arm', 'open']

------------
['timba', 'at', 'rest']

------------


For questions 5-7, just answer in marktown/raw text. No code necessary.

## 5. Does your parser have full coverage?

(answer here)

## 6. Does your parser over-generate?

(answer here)

## 7. Which sentences are ambiguous? How do you know?

(answer here)

## 8. Parse this sentence:

* If you wrote your grammar right, this should be covered. If this isn't covered, then you'll need to go back and change your grammar.

In [90]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'back', 'in', 'winter']

## 9. Was your result in Questions 8 ambiguous?

* Answer in markdown or raw text, no code necessary

(answer here)

## 10. How expressive is your language?

* Answer in markdown or raw text, no code necessary

(answer here)

## 11. Make the grammar more general by treating POS tags as the terminals

In [94]:
tamarian_grammar = nltk.CFG.fromstring("""
    S   -> 
""")

## 12. What is your set of POS tags?

* show the list of strings (e.g., ['Adj', ...])



## 13. Make a list for the POS tags that correspond to the sentence `s` below:

In [2]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'back', 'in', 'winter']
# p = ['PN',  ??, ... ]

## 14. Parse the sentence (represented as POS tags)

## Extra Credit! Do all of the above questions again, but add the sentence:

'The beast of Tanagra Usani his army Jakka when the walls fell'

*Done!*

## Submit

In [None]:
from client.api.notebook import Notebook
ok = Notebook('a4.ok')
ok.auth(inline=True)

In [None]:
ok.submit()