# Assignment a - Linguistics

## Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* For this assignment, we will be looking at tokenization, morphology, and syntax. 
* This will follow in a similar way as the notebook we did in class, though it will be a bit more work. 
* Answer each question (or, in some cases, follow the command)
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment.

#### We will be using **[Tamarian](https://www.youtube.com/watch?v=ANvlLcOTy6M)** as our example language: 

In [34]:
sentences = [
    'Sinda his face black his eyes red',
    'Tamak',
    'The river Tamak in winter',
    'Darmok and Jalad at Tanagra',
    'Darmok and Jalad on the ocean',
    'Socath his eyes opened',
    'The beast of Tanagra Usani his army Jakka when the walls fell', # don't worry about this one
    'Picard and Dathan at Eladrel',
    'Marab with sails unfurled',
    'Timba his arms open',
    'Timba at rest'
]
poses = [
    ['N', 'A', 'N', 'A', 'N', 'A'],
    ['N'],
    ['D', 'A', 'N', 'P', 'N'],
    ['N', 'C', 'N', 'P', 'N'],
    ['N', 'C', 'N', 'P', 'D', 'N'],
    ['N', 'A', 'N', 'V'],
    ['D', 'N', 'P', 'N', 'N', 'A', 'N', 'N', 'P', 'N', 'V'],
    ['N', 'C', 'N', 'P', 'N'],
    ['N', 'P', 'N', 'A'],
    ['N', 'A', 'N', 'A'],
    ['N', 'P', 'N']
]

### 1. Tokenize the sentences 

* you will need to make sure everything is lower case
* you will need to represent the sentences as a 2D array of words

In [35]:
split_list = [x.lower().split() for x in sentences]
split_list

[['sinda', 'his', 'face', 'black', 'his', 'eyes', 'red'],
 ['tamak'],
 ['the', 'river', 'tamak', 'in', 'winter'],
 ['darmok', 'and', 'jalad', 'at', 'tanagra'],
 ['darmok', 'and', 'jalad', 'on', 'the', 'ocean'],
 ['socath', 'his', 'eyes', 'opened'],
 ['the',
  'beast',
  'of',
  'tanagra',
  'usani',
  'his',
  'army',
  'jakka',
  'when',
  'the',
  'walls',
  'fell'],
 ['picard', 'and', 'dathan', 'at', 'eladrel'],
 ['marab', 'with', 'sails', 'unfurled'],
 ['timba', 'his', 'arms', 'open'],
 ['timba', 'at', 'rest']]

### 2. Use a stemmer or lemmatizer 

- (NLTK has several) 
-  You will know your stemmer/lemmatizer did its job because plural words will no longer be plural (e.g., 'eyes' -> 'eye') and past-tense words will no longer be past-tense (e.g. 'unfurled' -> 'unfurl')


In [36]:
import nltk
import nltk.stem.snowball as stem

wnl = stem.EnglishStemmer()
lemma_sentences = list()
for s in split_list:
    lemma_sentences.append([wnl.stem(t) for t in s])
lemma_sentences.append(['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter'])
words = set()
for s in lemma_sentences:
    words |= set(s)
words

{'and',
 'arm',
 'armi',
 'at',
 'beast',
 'black',
 'darmok',
 'dathan',
 'eladrel',
 'eye',
 'face',
 'fell',
 'his',
 'in',
 'jakka',
 'jalad',
 'marab',
 'ocean',
 'of',
 'on',
 'open',
 'picard',
 'red',
 'rest',
 'river',
 'sail',
 'sinda',
 'socath',
 'tamak',
 'tanagra',
 'the',
 'timba',
 'unfurl',
 'usani',
 'wall',
 'when',
 'winter',
 'with'}

### 3. Write a grammar that can parse all of the sentences

* Try to write as few grammar rules as possible
* Use recursion where you can
* Use `S` as the start symbol
* All terminals need to be in quotes


In [37]:
import nltk

tamarian_grammar = nltk.CFG.fromstring("""
 S -> M S | M
 M -> NP V | NP | P NP
 NP -> D A NL A
 NL -> N C NL | N
 V -> 'open' | 'fell'
 N -> 'arm' | 'armi' | 'beast' | 'darmok' | 'dathan' | 'eladrel' | 'eye' | 'face' | 'fell' | 'jakka' | 'jalad' | 'marab' | 'ocean' | 'picard' | 'rest' | 'sail' | 'sinda' | 'socath' | 'tamak' | 'tanagra' | 'timba' | 'usani' | 'wall' | 'winter'
 A -> 'black' | 'his' | 'red' | 'river' | 'unfurl' |
 D -> 'the' |
 C -> 'and'
 P -> 'at' | 'in' | 'of' | 'on' | 'when' | 'with'
""")

## 4. Show that your grammar parses all of the sentences

* Use a parser that can use a CFG (NLTK has several) 
* Make sure there is a parse tree for each of the sentences

In [38]:
parser = nltk.ChartParser(tamarian_grammar)
for s in lemma_sentences:
    print(s)
    for tree in parser.parse(s):
        print(tree)
    print()
    print()

['sinda', 'his', 'face', 'black', 'his', 'eye', 'red']
(S
  (M (NP (D ) (A ) (NL (N sinda)) (A his)))
  (S
    (M (NP (D ) (A ) (NL (N face)) (A black)))
    (S (M (NP (D ) (A his) (NL (N eye)) (A red))))))
(S
  (M (NP (D ) (A ) (NL (N sinda)) (A )))
  (S
    (M (NP (D ) (A his) (NL (N face)) (A black)))
    (S (M (NP (D ) (A his) (NL (N eye)) (A red))))))


['tamak']
(S (M (NP (D ) (A ) (NL (N tamak)) (A ))))


['the', 'river', 'tamak', 'in', 'winter']
(S
  (M (NP (D the) (A river) (NL (N tamak)) (A )))
  (S (M (P in) (NP (D ) (A ) (NL (N winter)) (A )))))


['darmok', 'and', 'jalad', 'at', 'tanagra']
(S
  (M (NP (D ) (A ) (NL (N darmok) (C and) (NL (N jalad))) (A )))
  (S (M (P at) (NP (D ) (A ) (NL (N tanagra)) (A )))))


['darmok', 'and', 'jalad', 'on', 'the', 'ocean']
(S
  (M (NP (D ) (A ) (NL (N darmok) (C and) (NL (N jalad))) (A )))
  (S (M (P on) (NP (D the) (A ) (NL (N ocean)) (A )))))


['socath', 'his', 'eye', 'open']
(S
  (M (NP (D ) (A ) (NL (N socath)) (A )))
  (S (M (NP 

For questions 5-7, just answer in marktown/raw text. No code necessary.

## 5. Does your parser have full coverage?

## 6. Does your parser over-generate?

## 7. Which sentences are ambiguous? How do you know?

## 8. Parse this sentence:

* If you wrote your grammar right, this should be covered. If this isn't covered, then you'll need to go back and change your grammar.

In [90]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']

In [27]:
for tree in parser.parse(s):
    print(tree)

(S
  (M (NP (D ) (A ) (NL (N timba)) (A his)))
  (S
    (M (NP (D ) (A ) (NL (N face)) (A red)))
    (S
      (M (NP (D ) (A his) (NL (N eye)) (A black)))
      (S (M (P in) (NP (D ) (A ) (NL (N winter)) (A )))))))
(S
  (M (NP (D ) (A ) (NL (N timba)) (A )))
  (S
    (M (NP (D ) (A his) (NL (N face)) (A red)))
    (S
      (M (NP (D ) (A his) (NL (N eye)) (A black)))
      (S (M (P in) (NP (D ) (A ) (NL (N winter)) (A )))))))


## 9. Was your result in Questions 8 ambiguous?

* Answer in markdown or raw text, no code necessary

## 10. How expressive is your language?

* Answer in markdown or raw text, no code necessary

I made it as limited as possible, you can only have a single adjective before and after a noun. This also decreases the ambiguity of the grammar.

## 11. Make the grammar more general by treating POS tags as the terminals

In [29]:
tamarian_grammar = nltk.CFG.fromstring("""
 S -> M S | M
 M -> NP 'V' | NP | 'P' NP
 NP -> D A NL A
 NL -> 'N' 'C' NL | 'N'
 D -> 'D' |
 A -> 'A' |
""")

## 12. What is your set of POS tags?

* show the list of strings (e.g., ['Adj', ...])



In [None]:
['D', 'A', 'N', 'C', 'V', 'P']

## 13. Make a list for the POS tags that correspond to the sentence `s` below:

In [32]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']
s = ['N', 'A', 'N', 'A', 'A', 'N', 'A', 'P', 'N']

## 14. Parse the sentence (represented as POS tags)

In [33]:
parser = nltk.ChartParser(tamarian_grammar)
for tree in parser.parse(s):
    print(tree)

(S
  (M (NP (D ) (A ) (NL N) (A A)))
  (S
    (M (NP (D ) (A ) (NL N) (A A)))
    (S
      (M (NP (D ) (A A) (NL N) (A A)))
      (S (M P (NP (D ) (A ) (NL N) (A )))))))
(S
  (M (NP (D ) (A ) (NL N) (A )))
  (S
    (M (NP (D ) (A A) (NL N) (A A)))
    (S
      (M (NP (D ) (A A) (NL N) (A A)))
      (S (M P (NP (D ) (A ) (NL N) (A )))))))


## Extra Credit! Do all of the above questions again, but add the sentence:

'The beast of Tanagra Usani his army Jakka when the walls fell'

*Done!*

## Submit

In [None]:
from client.api.notebook import Notebook
ok = Notebook('a4.ok')
ok.auth(inline=True)

In [None]:
ok.submit()