### Setup

In [1]:
#!pip install eng-to-ipa
#!pip install -U spacy
#!python -m spacy download en_core_web_sm

In [None]:
pip freeze | grep -E "spacy|ipa|nltk"

en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
eng-to-ipa==0.0.2
nltk==3.2.5
spacy==3.2.1
spacy-legacy==3.0.8
spacy-loggers==1.0.1


In [None]:
import spacy
import eng_to_ipa as ipa
import nltk

## Phonetics
Convert English text into the Phonetics using eng-to-ipa

**eng-to-ipa** library program utilizes the Carnegie-Mellon University (CMU) Pronouncing Dictionary to convert English text into the International Phonetic Alphabet.

**IPA:** International Phonetic Alphabet

*   **`convert()`** function converts a string into IPA
*   **`ipa_list()`** function returns the list of each word as a list of all its possible transcriptions
* **`get_rhymes()`** *italicised text* function returns a list of rhymes for a word or set of words
* **`syllable_count()`** function returns an integer, corresponding to the number of syllables in a word. Returns a list of syllable counts if more than one word is provided in the input string.

---

### Learn & play:

> https://en.wikipedia.org/wiki/International_Phonetic_Alphabet

> https://pypi.org/project/eng-to-ipa/

In [None]:
# Convert to phonetic transcript (IPA)
ipa.convert('apple')

'ˈæpəl'

In [None]:
ipa.convert('go home')

'goʊ hoʊm'

In [None]:
# Check for different pronunciations
ipa.convert('natural language processing', retrieve_all=True )

['ˈnæʧrəl ˈlæŋgwəʤ ˈprɑsɛsɪŋ',
 'ˈnæʧrəl ˈlæŋgwɪʤ ˈprɑsɛsɪŋ',
 'ˈnæʧərəl ˈlæŋgwəʤ ˈprɑsɛsɪŋ',
 'ˈnæʧərəl ˈlæŋgwɪʤ ˈprɑsɛsɪŋ']

In [None]:
ipa.ipa_list('natural language processing')

[['ˈnæʧrəl', 'ˈnæʧərəl'], ['ˈlæŋgwəʤ', 'ˈlæŋgwɪʤ'], ['ˈprɑsɛsɪŋ']]

In [None]:
# Find rhymes to a word
ipa.get_rhymes('data')

['abeyta',
 'argueta',
 'beta',
 'beteta',
 'gazeta',
 'goizueta',
 'ireta',
 'mineta',
 'peseta',
 'placeta',
 'prieta',
 'saitta',
 'seita',
 'sustaita',
 'theta',
 'ysleta',
 'zeta']

In [None]:
ipa.get_rhymes('science')

['alliance',
 'appliance',
 'bioscience',
 'compliance',
 'defiance',
 'non-compliance',
 'noncompliance',
 'pseudoscience',
 'reliance']

In [None]:
# Calculate the number of syllables per token
ipa.syllable_count('interesting Wednesday business')

[4, 2, 2]

## Morphology
Tokenization, lemmatization, stemming using Spacy.

spaCy provides a number of pre-trained model packages you can download using the spacy download command.
`python -m spacy download en_core_web_sm`

For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

The `spacy.load` method loads a model package by name and returns an `nlp` object. The package provides the binary weights that enable spaCy to make predictions. It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

**Tokenization**

`Token` objects represent the tokens in a document – for example, a word or a punctuation character. To get a token at a specific position, you can index into the doc.
`Token` objects also provide various attributes that let you access more information about the tokens:

**`.text`** attribute returns the verbatim token text.

**`like_num`** checks whether a token in the doc resembles a number

**`token.i + 1`** gets the token following the current token in the document. 


**Stemming**


---

Learn & play more:

> https://spacy.io/





In [None]:
# Load the spaCy language model, and create the nlp object.
nlp = spacy.load('en_core_web_sm') 

In [None]:
# Process text. SpaCy creates a Doc object when you process a text with the nlp object.. The Doc lets you access information about the text in a structured way, and no information is lost.
doc = nlp("I studied 3 languages.")

In [None]:
# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

I
studied
3
languages
.


In [None]:
# Index
studied = doc[1]

In [None]:
# Output the string for a given element
doc[0].text

'I'

In [None]:
# A slice the Doc to remove the dot
doc[0:4]

I studied 3 languages

In [None]:
# output the string for element "3"
number = doc[2]
number

3

In [None]:
number.like_num

True

In [None]:
studied.like_num

False

In [None]:
number.like_email

False

In [None]:
# ID of the verbatim text content.
number.orth

602994839685422785

In [None]:
# Get the parent document
number.doc

I studied 3 languages.

In [None]:
# ID of the base form of the token, with no inflectional suffixes.
studied.lemma

4251533498015236010

In [None]:
#Base form of the token, with no inflectional suffixes.
studied.lemma_

'study'

In [None]:
len(studied)

7

In [None]:
studied.text.capitalize()

'Studied'

In [None]:
studied.text.count('d')

2

In [None]:
 #Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:
# Print the token and the results of morphological analysis
    print(token.morph)

Case=Nom|Number=Sing|Person=1|PronType=Prs
Tense=Past|VerbForm=Fin
NumType=Card
Number=Plur
PunctType=Peri


In [None]:
studied.morph

Tense=Past|VerbForm=Fin

In [None]:
studied.morph.get('Tense')

['Past']

In [None]:
number.morph

NumType=Card

In [None]:
# Let's try some stemming
from nltk.stem.porter import *

In [None]:
stemmer = PorterStemmer()

In [None]:
tokens = ['compute', 'computer', 'computed', 'computing']


In [None]:
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))


compute --> comput
computer --> comput
computed --> comput
computing --> comput


Your turn: Compare lemma and stem for the same words.

## Syntax

POS:
```
    "ADJ": "adjective",
    "ADP": "adposition", <-- prepositions and postpositions together
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CONJ": "conjunction",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other",
    "EOL": "end of line",
    "SPACE": "space"
```

Noun chunks:
 ```
    "NP": "noun phrase",
    "PP": "prepositional phrase",
    "VP": "verb phrase",
    "ADVP": "adverb phrase",
    "ADJP": "adjective phrase",
    "SBAR": "subordinating conjunction",
    "PRT": "particle",
    "PNP": "prepositional noun phrase"
```

Dependency labels (selected):

```
    "advmod": "adverbial modifier",
    "amod": "adjectival modifier",
    "attr": "attribute",
    "aux": "auxiliary",
    "auxpass": "auxiliary (passive)",
    "case": "case marking",
    "conj": "conjunct",
    "csubj": "clausal subject",
    "det": "determiner",
    "dobj": "direct object",
    "expl": "expletive",
    "hyph": "hyphen",
    "infmod": "infinitival modifier",
    "meta": "meta modifier",
    "neg": "negation modifier",
    "nmod": "modifier of nominal",
    "nn": "noun compound modifier",
    "nsubj": "nominal subject",
    "nsubjpass": "nominal subject (passive)",
    "nounmod": "modifier of nominal",
    "npmod": "noun phrase as adverbial modifier",
    "num": "number modifier",
    "number": "number compound modifier",
    "nummod": "numeric modifier",
    "obj": "object",
    "pcomp": "complement of preposition",
    "pobj": "object of preposition",
    "poss": "possession modifier",
    "possessive": "possessive modifier",
    "prep": "prepositional modifier",
    
```
---


Learn & Play

Full list of tags: 
> https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

Displacy options:
> https://spacy.io/api/top-level#displacy_options

In [None]:
from spacy import displacy

In [None]:
doc = nlp('Lucerne is a beautiful town in Switzerland.')

In [None]:
for token in doc:
  print (token.text + '\n\t-->' + token.pos_)

Lucerne
	-->PROPN
is
	-->AUX
a
	-->DET
beautiful
	-->ADJ
town
	-->NOUN
in
	-->ADP
Switzerland
	-->PROPN
.
	-->PUNCT


In [None]:
# POS level granularity
displacy.render(doc, style="dep", jupyter=True)

In [None]:
# Phrase level granularity
displacy.render(doc, style="dep", jupyter=True, options={'fine_grained': True})

In [None]:
# Merge noun phrases into one token.
displacy.render(doc, style="dep", jupyter=True, options={'collapse_phrases': True})

In [None]:
# Get a description for a given POS tag, dependency label or entity type
spacy.explain(u'DET')

'determiner'

In [None]:
# Stop words
stopwords = nlp.Defaults.stop_words

# Default stop word list in Spacy
print(len(stopwords))
print(stopwords)

326
{'doing', 'seemed', 'side', "'m", 'before', 'together', 'unless', 'anyhow', 'afterwards', 'out', 'indeed', 'any', 'more', 'would', 'some', 'bottom', 'hence', 'anyway', 'whence', 'can', 'became', 'wherein', 'between', 'into', 'all', 'latterly', 'done', 'has', 'off', 'thus', '’s', 'elsewhere', 'how', 'below', 'then', 'nor', 'had', 'it', 'against', 'whoever', 'both', 'still', 'her', 'enough', 'nobody', 'per', 'from', 'each', 'themselves', 'amongst', 'former', 'herein', 'those', 'get', 'she', 'keep', 'two', 'about', 'within', '‘ve', 'empty', 'everywhere', 'there', '‘s', 'quite', 'few', 'whose', 'yours', 'he', 'not', 'if', 'except', 'whereupon', 'nine', 'cannot', 'being', 'put', 'seeming', 'further', 'its', 'ca', 'no', 'himself', 'make', 'whole', 'ever', 'herself', 'could', 'was', 'yourselves', 'none', 'either', 'hereafter', 'regarding', 'someone', 'alone', 'does', 'towards', 'among', 'whither', 'beside', 'due', 'during', 'anything', 'one', 'above', 'up', 'thru', 'might', 'us', 'been', 

In [None]:
# Check stop words in document
for token in doc:
    print(token.text, '\n-->' , token.is_stop)

Lucerne 
--> False
is 
--> True
a 
--> True
beautiful 
--> False
town 
--> False
in 
--> True
Switzerland 
--> False
. 
--> False


In [None]:
# Add stop words
nlp.Defaults.stop_words |= {"newstopword1","newstopword2"}
nwords = [i for i in stopwords if(i.startswith('n'))]
print(nwords)

['newstopword2', 'nor', 'nobody', 'not', 'nine', 'no', 'none', "n't", 'nowhere', 'n‘t', 'newstopword1', 'next', 'nevertheless', 'neither', 'now', 'nothing', 'name', 'n’t', 'never', 'namely', 'noone']


In [None]:
# Remove stop words from default list
nlp.Defaults.stop_words -= {"newstopword1","newstopword2"}
nwords = [i for i in stopwords if(i.startswith('n'))]
print(nwords)

['nor', 'nobody', 'not', 'nine', 'no', 'none', "n't", 'nowhere', 'n‘t', 'next', 'nevertheless', 'neither', 'now', 'nothing', 'name', 'n’t', 'never', 'namely', 'noone']


In [None]:
# Remove stop words from text
sentence_wo_stops = []
for token in doc:
  if token.text in stopwords:
    print('stop word \t-->', token)
  else:
    print('function word \t-->', token)
    sentence_wo_stops.append(token.text)

print('\nThe sentence after stop word removal contains only content words:', ' '.join(sentence_wo_stops))

function word 	--> Lucerne
stop word 	--> is
stop word 	--> a
function word 	--> beautiful
function word 	--> town
stop word 	--> in
function word 	--> Switzerland
function word 	--> .

The sentence after stop word removal contains only content words: Lucerne beautiful town Switzerland .


## Semantics
**WordNet**

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.

**Synset:** a set of synonyms that share a common meaning.

Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e. they are synonyms. We can explore these words with the help of WordNet:

**Senses**


**Similarity measures**

**Path similarity:**
> Hirarchy based similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. 
The score is in the range 0 to 1. By default, there is now a fake root node added to verbs so for cases where previously a path could not be found---and None was returned---it should return a value. 
The old behavior can be achieved by setting simulate_root to be False. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

**Leacock-Chodorow Similarity:**

> Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.

**Wu-Palmer Similarity:** 
> Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). 
Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. 
Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, 
the longer path is used for the purposes of the calculation.


The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.


In [None]:
s1_apple = nlp('She worked at Apple.')
s2_apple = nlp('She ate an apple.')

In [None]:
# WSD with POS
displacy.render(s1_apple, style="dep", jupyter=True, options={'fine_grained': True})

In [None]:
# WSD with POS
displacy.render(s2_apple, style="dep", jupyter=True, options={'fine_grained': True})

In [None]:
s1_java = nlp('She worked in Java.')
s2_java = nlp('She codes in Java.')

In [None]:
displacy.render(s1_java, style="dep", jupyter=True, options={'fine_grained': True})

In [None]:
displacy.render(s2_java, style="dep", jupyter=True, options={'fine_grained': True})

In [None]:
# WSD with WordNet
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
# The word 'java' is ambiguous, and has three synsets:
wn.synsets('crud')

[Synset('crud.n.01'), Synset('filth.n.01'), Synset('crud.n.03')]

In [None]:
# Inspect synsets' lemma names and definitions
for synset in wn.synsets('java'):
  print(synset.lemma_names(), synset.definition())

['Java'] an island in Indonesia to the south of Borneo; one of the world's most densely populated regions
['coffee', 'java'] a beverage consisting of an infusion of ground coffee beans
['Java'] a platform-independent object-oriented programming language


In [None]:
# Hyperonyms
for synset in wn.synsets('java'):
  print(synset.lemma_names(), synset.hypernyms())

['Java'] []
['coffee', 'java'] [Synset('beverage.n.01')]
['Java'] [Synset('object-oriented_programming_language.n.01')]


In [None]:
# All hyperonyms
for synset in wn.synsets('java'):
  print(synset.lemma_names(), synset._all_hypernyms)

['Java'] None
['coffee', 'java'] None
['Java'] None


In [None]:
# Hyponyms
for synset in wn.synsets('java'):
  print(synset.lemma_names(), synset.hyponyms())

['Java'] []
['coffee', 'java'] [Synset('cafe_au_lait.n.01'), Synset('cafe_noir.n.01'), Synset('cafe_royale.n.01'), Synset('cappuccino.n.01'), Synset('coffee_substitute.n.01'), Synset('decaffeinated_coffee.n.01'), Synset('drip_coffee.n.01'), Synset('espresso.n.01'), Synset('iced_coffee.n.01'), Synset('instant_coffee.n.01'), Synset('irish_coffee.n.01'), Synset('mocha.n.03'), Synset('turkish_coffee.n.01')]
['Java'] []


In [None]:
# Path similarity
cappuccino = wn.synset('cappuccino.n.01')
espresso = wn.synset('espresso.n.01')
hyson = wn.synset('hyson.n.01')                  #(a Chinese green tea with twisted leaves)

In [None]:
# Tell me how similar hyson is to capuccino and how similar espresso is to cappuccino. Please.
print(hyson.path_similarity(cappuccino), espresso.path_similarity(cappuccino))

0.09090909090909091 0.3333333333333333


In [None]:
 # Entities are always most similar to themselves. Almost true...
 print(espresso.path_similarity(espresso))

1.0


In [None]:
# Leacock-Chodorow Similarity
print(hyson.lch_similarity(cappuccino), espresso.lch_similarity(cappuccino))

1.2396908869280152 2.538973871058276


In [None]:
#Wu-Palmer Similarity
print(hyson.wup_similarity(cappuccino), espresso.wup_similarity(cappuccino))

0.5 0.9


We will talk about similarity more in Session 3 and 4. For now, just remember that you have already used it for comparing word senses.

If you want to understand Kornelia's comment on "Almost true..." watch: https://youtu.be/JQVmkDUkZT4