In [2]:
import spacy
import numpy as np
import pandas as pd

In [3]:
# Create NLP object (load the model that installed)
nlp = spacy.load("en_core_web_sm")

In [4]:
type(nlp)

spacy.lang.en.English

# 1 - Introduction

### What is Natural Language Processing
- Natural Language Processing is a subfield of artificial intelligence that tries to **process and analyze natural language data.**
  
NOTE: Natural language is a language that developed naturally through use.

The idea: Since we already know semantic and grammar rules of human language, then we can build applications that can progammatically understand utterances in that language.

### How can Computers Understand Language
- Since computer (or machine) only understand number, we need to convert language words into numbers. This process called **Word Embedding**.
- Word Embedding concept: **Mapping the words to vectors of real numbers that distribute the meaning of each word** between the coordinates of the corresponding word vector. NOTE:
    - Words that similar (in machine perspectives) if their word vector are nearby.
    - Two words are distributed nearby in the vector space based on **the contextual similarity** of their usage in a large corpus of text.
      
      NOTE: Key factors that influence are (1) Co-occurence in Similar Contexts and (2) Frequency of Co-occurence.

### Dependency Grammars vs Phrase Structure Grammars
> spaCy primarily uses dependency grammar for syntactic parsing. It builds dependency trees that represent the relationships between words in a sentence, showing how each word depends on another.

**Phrase Structure Grammars**
- Based on how words combine to form constituents in a sentence ==> Focus on **Relation between constituents**
  
    NOTE: Constituents is a group of words that functions as a single unit in a sentence. E.g: Morphem, Frasa, and Clause.

- Concept: The rule decompose a sentence into its constituent parts until reach one unit word in a hierarchical way.


Example:

![Images](data_spacy/images/ex-parse-structure-grammar.png)

**Dependency Structure Grammars**
- Based on the **relations between individual words.**
- Concept: Determine root which is a main word of sentence (usually verb, we called it main verb). Then all other words are directly or indirectly dependent on this root.
    NOTE:
      (1) We call the independent word as "HEAD" and the dependent word as "CHILD".
      (2) Each word in a sentence must be connected to exactly one HEAD (ROOT is connected to itself). The same word might have None, one, or several CHILDREN.

Example:

![Images](data_spacy/images/ex-dep-structure-grammar.png)

Explanation:
- "sat" is ROOT
- "sat" is HEAD of "cat", "on", and "mat"
- "cat" is HEAD of "the"
- "mat" is HEAD of "the"

### Common Grammar Concept

**Transitive Verbs and Direct Objects**
- A directly object is a noun (or a noun phrase) that **directly receives the action** of the verb.
- A transitive verb is an **action verb that needs something (or someone) to receive the action**. This "something" that receive the action is **direct object**.

For example:

    "She wrote a letter" ==> Direct object is "a letter" and Transitive verb is "wrote".

**Prepositional Objects**
- A preposition **connects noun phrases (or noun, pronoun) with other words in a sentence**.
    Example: "in", "above", "at", "to", "of", etc.
- Object of a Preprosition: A noun (or pronoun, noun phrase) that follows a preposition.

Example:

    "I wrote a series of articles." ==> Preposition is "of" and Object of preposition is "articles".

NOTE:
In some questions, extracting the object of preposition might give us the most informative word or phrase in terms finding the answer.

Example:

    "What can be done about climate change?" ==> The phrase “climate change” is the key phrase.

**Modal Auxiliary Verbs**
- Modal auxiliary verbs are **special verbs used alongside a main verb** to express various moods or modalities. Example: "may", "might", "can", etc.

    NOTE:
  
        Modal auxiliary verbs are special verbs used alongside a main verb to express various moods or modalities.

**Personal Pronoun**
- A personal pronoun refers to a specific person, object, or to multiple people or objects.

Forms according to their grammatical role in a sentence:
- The nominative form (I, you, he, she, it, we, they) is typically used as the nominal subject of a verb.
- The accusative form (me, you, him, her, it, us, them) is typically used as the object of a verb or preposition.
- The reflexive form (myself, yourself/yourselves, himself, herself, itself, ourselves, themselves) typically refers back to the subject specified within the same clause.

# 2 - Basic NLP Operations with Spacy

### Tokenization

> Parsing the text input into tokens.

- By default, SpaCy tokenize text into word-level tokens.

In [5]:
text = "I am flying to Frisco"
doc = nlp(text)

for idx, token in enumerate(doc):
    print(f"{idx + 1}. {token}")

1. I
2. am
3. flying
4. to
5. Frisco


### Lemmatization
> Process of reducing word forms to their lemma.

NOTE: Lemma is a base form of a token. For example "flying" is "fly".

In [6]:
text = "This product integrates both libraries for downloading and applying patches"
doc = nlp(text)
print("Text", "Lemma")
for token in doc:
    print(token.text, token.lemma_)

Text Lemma
This this
product product
integrates integrate
both both
libraries library
for for
downloading download
and and
applying apply
patches patch


**Case Application Lemmatization: Meaning Recognition.**

Suppose that we have NLP application that interacting with an online system that provides an API for booking tickets for trips.

The application processes a customer's request, extracting necessary information from it, and then passing on that information to the underlying API.

The two information that needed to extract:
1. Determine whether the customer wants an air ticket, a railway ticket, or a bus ticket.
2. Extract its destination.

Ideas:
1. Tokenization
2. Lemmatization
3. Match each token into a predefined list of keywords that help for mapping token into word that represent the customer needed.

NOTE: Lemmatization helps programmer to create simple predifined list of keywords.

![Images](data_spacy/images/ex-app-lemma.png)

In [7]:
from spacy.symbols import ORTH, NORM
from spacy.language import Language

# Example simple program

input_text = "I am flying to Frisco"

# Modeling
nlp = spacy.load("en_core_web_sm")

# Add special case rules (for special token)
special_case = [{ORTH: "Frisco", NORM: "San Francisco"}]
nlp.tokenizer.add_special_case("Frisco", special_case)

# Custom pipeline component to adjust lemmas
@Language.component("custom_lemma_setter")
def custom_lemma_setter(doc):
    for token in doc:
        if token.text == "Frisco":
            token.lemma_ = "San Francisco"  # Set custom lemma
    return doc

# Add the component after the default lemmatizer in the pipeline
nlp.add_pipe("custom_lemma_setter", after="lemmatizer")

doc = nlp(input_text)

print("Text", "Lemma")
for token in doc:
    print(token.text, token.lemma_)

Text Lemma
I I
am be
flying fly
to to
Frisco San Francisco


### Part-of-Speech Tagging
> Telling part-of-speech of a given word in a given sentence (noun, verb, and so on).

In SpaCy, there are two types parts of speech:
1. Coarse-gained parts of speech
> Using Token.pos (int) and Token.pos_ (unicode) attributes. 

2. Fine-grained parts of speech
> Using Token.tag (int) and Token.tag_ (unicode) attributes.

NOTE: In English, the core of parts of speech include noun, pronoun, determiner, adjective, verb, adverb, preposition, conjunction, and injection.


Detail about pos tagging: https://v2.spacy.io/api/annotation#pos-tagging

In [8]:
# Example

text = "The United States is a country primarily located in North America"
doc = nlp(text)

for idx, token in enumerate(doc):
    print(f"{idx + 1}. {token.text}", token.pos_, token.tag_)

1. The DET DT
2. United PROPN NNP
3. States PROPN NNP
4. is AUX VBZ
5. a DET DT
6. country NOUN NN
7. primarily ADV RB
8. located VERB VBN
9. in ADP IN
10. North PROPN NNP
11. America PROPN NNP


**Case Application POS Tags: Find Relevant Verbs.**

Continue our case ticket online system NLP.Since there are possible way to express a word (for example in past, present, future), we need to filter it into what we expect. For example:

- I flew to LA.
- I have flown to LA.
- I need to fly to LA.
- I am flying to LA.
- I will fly to LA.

Notice that although all of these sentences would include the “fly to LA” combination if reduced to lemmas, only some of them imply the customer’s intent to book a plane ticket to LA. The first two definitely aren’t suitable.

In [9]:
# According to the table, we can select token with tag_ into 'VBG' or 'VB'
# The location will expected to recognize as PROPN

input_text = "I have flown to LA. Now I am flying to Frisco."
doc = nlp(input_text)
print("Text", "Lemma")
current_idx = 0
for token in doc:
    if token.tag_ == "VBG" or token.tag_ == 'VB':
        print(token.text, '->', token.lemma_)
        current_idx = token.i
    if token.pos_ == "PROPN" and current_idx < token.i and current_idx != 0:
        print(token.text, '->', token.lemma_)

Text Lemma
flying -> fly
Frisco -> San Francisco


NOTE: We need to improve our model by adding context of the input text.

For example “I’m already in the sky, flying to LA.” or “I’m going to fly to LA.” When submitted to the ticket booking NLP application, the application should interpret only one of these sentences as “I need an air ticket to LA.” 

### Syntactic Relations: Dependency Parser

> Dependency parser helps discover syntactic relations between individual tokens in a sentence and connects syntactically related pairs of words with a single arc.


Concept:
- Head and Child ==> Since it describes syntactic relation between two words, then one word is called Head (or Parent) and the other is child (or Dependent).
  
      NOTE:
  - Each word in a sentence has exactly one head. Consequently, a word can be a child only to one head.
  - If a token head is itself, then it is labeled as ROOT.
  - A dependency label is always assigned to the child (the arrow in graphical representation always start from head to child).
 
NOTE:
- Every complete sentence should have a **verb** with the **ROOT tag** and a subject with the **nsubj tag**. The other elements are optional. 
  

Detail info about Dependency: https://v2.spacy.io/api/annotation#dependency-parsing

In [10]:
# NOTES:
# using Token.head to access head token object.
# using Token.dep_ to access its dependency

# Example
text = "I need a plane ticket"

doc = nlp(text)

print("Token", "Dependency", "Head")
for token in doc:
    print(token, token.dep_, token.head)

Token Dependency Head
I nsubj need
need ROOT need
a det ticket
plane compound ticket
ticket dobj need


In [11]:
# Visualize it
from spacy import displacy
# Style as dependency
displacy.render(doc, style='dep')

**Case Application Syntactic Relations: Find Relevant Verbs.**

Continue our case ticket online system NLP. Since there are possible the text input is conjugation of two sentences or more, we need to improve our filter. For example:

"I have flown to LA. Now I am flying to Frisco"

NOTE: In this case, we need a ticket into San Fransisco instead LA.

In [12]:
# Concept: From the pattern of text,
#  The verb that represent what transportation do customer needed and the Location as preposition object.

text = "I have flown to LA. Now I am flying to Frisco"
doc = nlp(text)


# Tokenize into sentence level
sentences = list(doc.sents)
for sent in sentences:
    storage = []
    for word in sent:
        if word.dep_ == 'ROOT' or word.dep_ == 'pobj':
            storage.append(word)
    print(storage)

[flown, LA]
[flying, Frisco]


### Named Entity Recognition
> A named entity is a real object that you can refer to by a proper name.

In [13]:
# NOTES:
# - Using Token.ent_type_ to return Entity type of token
# - Using Tojen.ent_type to return its int format

text = "I have flown to LA. Now I am flying to Frisco"
doc = nlp(text)

print("Token", "Entity")
for token in doc:
    if token.ent_type != 0:
        print(token.text, token.ent_type_)

Token Entity
LA GPE
Frisco PERSON


# 3 - Working with Container Objects and Customizing Spacy

### Container Objects

- The main objects of SpaCy:
    1. Container Object: Object that grouping multiple element into a single unit. It can be collected of objects (like tokens or sentences) or a set of annotation. 
    2. Pipeline Components (such as part-of-speech tagger and named entity recognizer). 

In [14]:
# Create Doc object

## Explicity way
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab

doc = Doc(Vocab(), words=["Hi", "there"])
# NOTES:
#  - Vocab object: Storage container that provides vocabulary data, such as lexical types
#      (adjective, verb, noun, and so on).
#  - words argument: list of tokens to add to the Doc object being created.

print("Explicit way: ", doc)

## By default, SpaCy model return Doc object as we input the text into the nlp model
doc = nlp("Hi there")
print("Return from Spacy NLP Model: ", doc)

Explicit way:  Hi there 
Return from Spacy NLP Model:  Hi there


In [15]:
# Iterating over token

doc = nlp("I want a green apple")
for token in doc:
    print(f"{token.i}. {token.text}")

0. I
1. want
2. a
3. green
4. apple


In [101]:
# Get children
# NOTE: By default, SpaCY uses dependency grammar parsing.
# - Token.lefts ==> Return generator (The element is children on the left side, in position, of current token)
# - Token.children ==> Return generator

doc = nlp("I want a green apple")
print(f"The children of {doc[4]}: {list(doc[4].lefts)}")
print(f"The children of {doc[4]}: {list(doc[4].children)}")


from spacy import displacy
# Style as dependency
displacy.render(doc, style='dep')

The children of apple: [a, green]
The children of apple: [a, green]


In [17]:
# Get sentence object
# Doc.sents ==> Return generator, each element in span object

doc = nlp("A severe storm hit the beach. It started to rain.")
for idx, s in enumerate(doc.sents):
    print(idx + 1, s.text)

print("Type element Doc.sents: ", type(s))

# Example: check whether the first word in the second sentence of the text being processed is a pronoun
for idx, s in enumerate(doc.sents):
    if s[0].pos_ == 'PRON' and idx == 1:
        print("The second sentence begins with a pronoun.")

# Example: find out how many sentences in the text end with a verb
count = 0
for idx, s in enumerate(doc.sents):
    if s[len(s) - 2].pos_ == 'VERB':
        count += 1
print("Number of sentences end with a verb: ", count)

1 A severe storm hit the beach.
2 It started to rain.
Type element Doc.sents:  <class 'spacy.tokens.span.Span'>
The second sentence begins with a pronoun.
Number of sentences end with a verb:  1


In [18]:
# A noun (or phrase noun) that consists of a noun and its immediate dependents, such as adjectives, determiners, or other words
#  that modify the noun.

# Using Doc.noun_chunks ==> return Span object.

doc = nlp("The quick brown fox jumped over the lazy dog.")

print("Chunk noun: ")
for s in doc.noun_chunks:
    print(s.text)

# Manual code for noun_chunks
store = []
for token in doc:
    if token.pos_ == "NOUN":
        chunk = ""
        for w in token.children:
            if w.pos_ == "DET" or w.pos_ == "ADJ":
                chunk = chunk + w.text + " "
        chunk = chunk + token.text
        store.append(chunk)

print("\nManual way: ")
for s in store:
    print(s)
# NOTE: 
# - noun chunks can also include some other parts of speech, say, adverbs.
# - the words used to modify noun (determiners and adjectives) are always the leftward syntactic children of the noun.
#    So basically, we do not have to add "if w.pos_ == "DET" or w.pos_ == "ADJ" part.
# - the manual code returns string, but doc.noun_chunks returns Span object.

Chunk noun: 
The quick brown fox
the lazy dog

Manual way: 
The quick brown fox
the lazy dog


In [19]:
# Modification manual code that follows the previous notes.
store = []
for token in doc:
    if token.pos_ == 'NOUN':
        chunk = ""
        for w in token.lefts:
            chunk = chunk + w.text + " "

        chunk = chunk + token.text
        store.append(chunk)
print(store)

['The quick brown fox', 'the lazy dog']


In [20]:
# Span object: a slice from a Doc object.
#  Some example Span object: (1) slice doc: Doc[start:end], (2) element Doc.sents, (3) element Doc.noun_chunks.

### Pipeline Object

In [21]:
# Check what pipeline components are available
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'custom_lemma_setter', 'ner']


In [22]:
# Disabling pipeline components (from model that already contain specific component)

nlp = spacy.load("en_core_web_sm", disable=['parser']) # Disable dependency parser
print(nlp.pipe_names)

# Try to get dependency label of each token (it won't appears since it is disable)
doc = nlp("I want a green apple.")

print("Text", "POS", "Dependency (disable)")
for token in doc:
    print(token.text, token.pos_, token.dep_)

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'ner']
Text POS Dependency (disable)
I PRON 
want VERB 
a DET 
green ADJ 
apple NOUN 
. PUNCT 


# 4 - Extracting and Using Linguistic Features

### Extracting and Generating Text with Part-of-Speech Tags

In [23]:
# Numeric, Symbolic, and Punctuation Tags
nlp = spacy.load("en_core_web_sm")
doc = nlp("The firm earned $1.5 milion in 2017.")
## Intro: Extract coarse-grained part-of-speech features
print("Text", "Pos", "Explain")
for token in doc:
    print(token.text, token.pos_, spacy.explain(token.pos_))

# NOTE:
# - spacy.explain method returns a description for a given linguistic feature.
# - Coarse-grained tagger distinguishes numerals, symbol and punctuation marks as
#     individual categories. (It even recognizes million spelled out).

print()
## Intro: Extract Fine-grained part-of-speech features
print("Text", "Pos", "Tag", "Explain Tag")
for token in doc:
    print(token.text, token.pos_, token.tag_, spacy.explain(token.tag_))

# NOTE:
# - The fine-grained tagging divides each category into subcategories.
# - The coarse-grained category "SYM" has 3 subcategories: (1) $ for currency,
#     (2) # for the number sign, and (3) SYM for all the other symbols (such as +, 
#     -, x, =,).


Text Pos Explain
The DET determiner
firm NOUN noun
earned VERB verb
$ SYM symbol
1.5 NUM numeral
milion NOUN noun
in ADP adposition
2017 NUM numeral
. PUNCT punctuation

Text Pos Tag Explain Tag
The DET DT determiner
firm NOUN NN noun, singular or mass
earned VERB VBD verb, past tense
$ SYM $ symbol, currency
1.5 NUM CD cardinal number
milion NOUN NN noun, singular or mass
in ADP IN conjunction, subordinating or preposition
2017 NUM CD cardinal number
. PUNCT . punctuation mark, sentence closer


**Study Case: Extracting Descriptions of Money**

Suppose we are interested in phrases that refer to an amount of money and start with a currency symbol. For example our script should pick out the phrase "$1.5 milion" from the sentence, not "1.5".

In [24]:
# Open file
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [25]:
input_text = "The firm earned $1.5 million in 2017"
input_text = "The firm earned $1.5 million in 2017, in comparison with $1.2 million in 2016."

doc = nlp(input_text)
storage = []
for token in doc:
    phrase = ""
    if token.tag_ == '$':
        phrase = token.text
        i = token.i + 1
        while doc[i].tag_ == 'CD':
            phrase += doc[i].text + " "
            i += 1
    phrase = phrase.rstrip()
    if len(phrase) > 0:
        storage.append(phrase)

print(storage)

['$1.5 million', '$1.2 million']


**Study Case: Turning Statements into Questions**

Suppose your NLP application must be able to generate a question from a submitted statement. 

In this case the input:
> I can promise it is worth your time.

The output:
> Can you really promise it is worth my time?

The algorithm:
1. Change the order of words in the original sentence from “subject + modal auxiliary verb + infinitive verb” to “modal auxiliary verb + subject + infinitive verb.”
2. Replace the personal pronoun “I” (the sentence’s subject) with “you.”
3. Replace the possessive pronoun “your” with “my.”
4. Place the adverbial modifier “really” before the verb “promise” to emphasize the latter.
5. Replace the punctuation mark “.” with “?” at the end of the sentence.

In [26]:
# Analyze
doc = nlp("I can promise it is worth your time.")
data = {'text': [],
        'pos': [],
        'tag': [],
        'explain': []}

for token in doc:
    data['text'].append(token.text)
    data['pos'].append(token.pos_)
    data['tag'].append(token.tag_)
    data['explain'].append(spacy.explain(token.tag_))

pd.DataFrame.from_dict(data)

Unnamed: 0,text,pos,tag,explain
0,I,PRON,PRP,"pronoun, personal"
1,can,AUX,MD,"verb, modal auxiliary"
2,promise,VERB,VB,"verb, base form"
3,it,PRON,PRP,"pronoun, personal"
4,is,AUX,VBZ,"verb, 3rd person singular present"
5,worth,ADJ,JJ,"adjective (English), other noun-modifier (Chin..."
6,your,PRON,PRP$,"pronoun, possessive"
7,time,NOUN,NN,"noun, singular or mass"
8,.,PUNCT,.,"punctuation mark, sentence closer"


In [27]:
doc = nlp("I can promise it is worth your time.")
sent = ''

# Inversion Auxiliary Verb with Personal Pronoun
for i, token in enumerate(doc):
    if token.tag_ == 'PRP' and doc[i+1].tag_ == 'MD' and doc[i+2].tag_ == 'VB':
        sent = doc[i+1].text.capitalize() + " " + doc[i].text + " " + doc[i+2:].text
        break
doc = nlp(sent)

# Replace Personal Pronouns: I into you
for i, token in enumerate(doc):
    if token.tag_ == 'PRP' and token.text == 'I':
        sent = doc[:i].text + ' you ' + doc[i+1:].text
        break
doc = nlp(sent)

# Replace Possessive Pronouns: your into my
for i, token in enumerate(doc):
    if token.tag_ == 'PRP$' and token.text == 'your':
        sent = doc[:i].text + ' my ' + doc[i+1:].text
        break
doc = nlp(sent)

# Adding "really" in the middle
for i, token in enumerate(doc):
    if token.tag_ == "VB":
        sent = doc[:i].text + ' really ' + doc[i:].text
        break
doc = nlp(sent)

# Adding "?" at the end
sent = doc[:len(doc)-1].text + '?'

print(sent)

Can you really promise it is worth my time?


NOTE:

This script is a good start, but it won’t work with every submitted statement. For example, the statement might contain a personal pronoun other than “I,” but our script doesn’t explicitly check for that. Also, some sentences don’t contain auxiliary verbs, like the sentence “I love eating ice cream.” In those cases, we’d have to use the word “do” to form the question instead of a word like “can” or “should,” like this: “Do you really love eating ice cream?” But if the sentence contains the verb “to be,” as in the sentence “I am sleepy,” we’d have to move that verb to the front, like this: “Are you sleepy?”

### Using Syntactic Dependency Labels in Text Processing

In [28]:
# Example explain something more deeply by looking its dependency
doc = nlp("I know you. You know me.")

data = {
    'text': [],
    'pos': [],
    'tag': [],
    'dependency': [],
    'explain': [],
}
for token in doc:
    data['text'].append(token.text)
    data['pos'].append(token.pos_)
    data['tag'].append(token.tag_)
    data['dependency'].append(token.dep_)
    data['explain'].append(spacy.explain(token.dep_))

pd.DataFrame.from_dict(data)

Unnamed: 0,text,pos,tag,dependency,explain
0,I,PRON,PRP,nsubj,nominal subject
1,know,VERB,VBP,ROOT,root
2,you,PRON,PRP,dobj,direct object
3,.,PUNCT,.,punct,punctuation
4,You,PRON,PRP,nsubj,nominal subject
5,know,VERB,VBP,ROOT,root
6,me,PRON,PRP,dobj,direct object
7,.,PUNCT,.,punct,punctuation


NOTE: In some case, we need to see the dependency of each token to get token that we really needed. For example in this case, if we want to extract "you" that represent as direct object we by looking only its pos tag is not enough.

**Study Case: Deciding What Question a Chatbot Should Ask**

Suppose that we want to create chatbot that asks a yes/no question. The flow of program is shown as below:

![images](data_spacy/images/chatbot-green-apple.png)

In [29]:
# Analyze
doc = nlp("I want a green apple")

data = {
    'text': [],
    'pos': [],
    'tag': [],
    'dependency': [],
    'explain': [],
}
for token in doc:
    data['text'].append(token.text)
    data['pos'].append(token.pos_)
    data['tag'].append(token.tag_)
    data['dependency'].append(token.dep_)
    data['explain'].append(spacy.explain(token.dep_))

pd.DataFrame.from_dict(data)

Unnamed: 0,text,pos,tag,dependency,explain
0,I,PRON,PRP,nsubj,nominal subject
1,want,VERB,VBP,ROOT,root
2,a,DET,DT,det,determiner
3,green,ADJ,JJ,amod,adjectival modifier
4,apple,NOUN,NN,dobj,direct object


In [33]:
# Find chunk
def find_chunk(doc):
    chunk = ''
    for i, token in enumerate(doc):
        if token.dep_ == 'dobj':
            shift = len([w for w in token.children])
            # Slicing from leftmost child into current token.
            chunk = doc[i-shift:i+1]
            break
    return chunk

def determine_question_type(chunk):
    """
    Determine if the question yes/no or information type.

    If the chunk contains adjective modifier "amod", then the
      question is yes/no type. Otherwise information type.
    """
    question_type = 'yesno'
    for token in chunk:
        if token.dep_ == 'amod':
            question_type = 'info'
    return question_type

def generate_question(doc, question_type):
    sent = ''
    for i, token in enumerate(doc):
        if token.tag_ == 'PRP' and doc[i+1].tag_ == 'VBP':
            sent = 'do ' + doc[i].text
            sent = sent + ' ' + doc[i+1:].text
            break
    doc = nlp(sent)

    for i, token in enumerate(doc):
        if token.tag_ == 'PRP' and token.text == 'I':
            sent = doc[:i].text + ' you ' + doc[i+1:].text
            break
    doc = nlp(sent)
    
    if question_type == 'info':
        for i, token in enumerate(doc):
            if token.dep_ == 'dobj':
                sent = 'why ' + doc[:i].text + ' one ' + doc[i+1:].text
                break
    if question_type == 'yesno':
        for i, token in enumerate(doc):
            if token.dep_ == 'dobj':
                sent = doc[:i-1].text + ' a red ' + doc[i:].text
                break

    doc = nlp(sent)
    sent = doc[0].text.capitalize() + ' ' + doc[1:len(doc)].text + '?'

    return sent


In [35]:
# text_input = input()
text_input = "I want a green apple"

doc = nlp(text_input)
chunk = find_chunk(doc)
if str(chunk) == '':
    print('The sentence does not contain a direct object.')
else:
    question_type = determine_question_type(chunk)
    question = generate_question(doc, question_type)
    print(question)

# Try input:
# I want a green apple => Why do you want a green one?
# I want an apple => Do you want a red apple?
# I want... => The sentence does not contain a direct object.
# empty string => You did not submit a sentence!

Why do you want a green one?


# 5 - Working with Word Vectors

- Word vector space can be imagine as a cloud in which the vectors of words with similar meanings are located nearby.
- A word vector space uses the distance between vectors to quantify and categorize semantic similarities.

  NOTE, recall that main key factors that influence position of a word vector are:
  
        - Co-occurence in similar contexts during training model.
        - Frequency of co-occurence during training model.

- To compare a single token with an entire sentence, spaCy averages the sentence’s word vectors to generate an entirely new vector.

In [36]:
# Using Doc.similarity method (It can be used in Span and Token too)
#  similarity(object) ==> return float; object can be Doc, Span, and Token.
nlp = spacy.load("en_core_web_md")
# NOTE: since en_core_web_sm do not contain similarity method (because it's small version)
#        we use its medium version.

# Example Doc
doc = nlp('I want a green apple')
doc1 = nlp("I want a yellow manggo")
print(doc, "<->", doc1, doc.similarity(doc1))

# Example Span
span = doc[3:]
span1 = doc1[3:]
print(span, "<->", span1, span.similarity(span1))

# Example Token
token = doc[3]
token1 = doc1[3]
print(token, "<->", token1, token.similarity(token1))

# NOTE: 
#   - The similarity method will calculate semantic similarity for you, 
#       but for the results of that calculation to be useful, 
#       you need to choose the right keywords to compare.
#   - We can extract the keyword and then checking the similarity and the search phrases.

I want a green apple <-> I want a yellow manggo 0.9698261546195504
green apple <-> yellow manggo 0.6900869011878967
green <-> yellow 0.7846790552139282


**Study Case: Semantic Similarity for Categorization Tasks**


Suppose that we want to know that if a sentence is related with "fruits". The input is:


"I want to buy this beautiful book at the end of the week. Sales of citrus have increased over the last year. How much do you know about this type of tree?"



In [37]:
doc = nlp("I want to buy this beautiful book at the end of the week." 
          "Sales of citrus have increased over the last year." 
          "How much do you know about this type of tree?")
token = nlp('fruit')[0]

for sent in doc.sents:
    print(sent.text)
    print('similarity to', token.text, 'is', token.similarity(sent), '\n')

I want to buy this beautiful book at the end of the week.
similarity to fruit is 0.26244229078292847 

Sales of citrus have increased over the last year.
similarity to fruit is 0.2754635810852051 

How much do you know about this type of tree?
similarity to fruit is 0.3160061538219452 



NOTE:
- Remember that spaCy will average the sentence vector based on each words.
    This can lead a problem. If the text that we are averaging is very large,
    the most important words might have little to no effect on the syntactic
    similarity value.
- To get more accurate results, we could extract the important parts of sentence then see its similarity.

**Study Case: Extracting Nouns as a Preprocessing Step**

In [38]:
token = nlp('fruit')[0]
doc = nlp("I want to buy this beautiful book at the end of the week." 
          "Sales of citrus have increased over the last year." 
          "How much do you know about this type of tree?")

similarity = {}
for i, sent in enumerate(doc.sents):
    noun_span_list = [sent[j].text for j in range(len(sent)) if sent[j].pos_ == 'NOUN']
    noun_span_str = ' '.join(noun_span_list)
    noun_span_doc = nlp(noun_span_str)
    similarity.update({i: token.similarity(noun_span_doc)})
print(similarity)

{0: 0.15416097214471128, 1: 0.32236731321399387, 2: 0.49644641541472323}


NOTES:
- this time the level of the similarity with the word “fruits” is higher for each sentence. But the overall results look similar: the similarity of the first sentence is the lowest, whereas the similarity of the other two are much higher.

In [39]:
# Try using the highest level of similarity of each extracted noun.

## Extract noun
words = []
similarity = {}
for i, sent in enumerate(doc.sents):
    noun_span_list = [sent[j]. text for j in range(len(sent)) if sent[j].pos_ == 'NOUN']
    words.append((i, noun_span_list))

for i, ws in words:
    max_value = 0
    current = 0

    for w in ws:
        w_token = nlp(w)
        current = token.similarity(w_token)
        print(w_token, "<->", token, current)
        if current > max_value:
            max_value = current

    similarity.update({i: max_value})

print(similarity)
print(words)

book <-> fruit 0.10111276732078336
end <-> fruit 0.12615228452479088
week <-> fruit 0.07047647225810715
Sales <-> fruit 0.041778367824821624
citrus <-> fruit 0.7326242077139979
year <-> fruit 0.07115306346492278
type <-> fruit 0.2448628913815218
tree <-> fruit 0.5153199440186008
{0: 0.12615228452479088, 1: 0.7326242077139979, 2: 0.5153199440186008}
[(0, ['book', 'end', 'week']), (1, ['Sales', 'citrus', 'year']), (2, ['type', 'tree'])]


**Study Case: Extracting and Comparing Named Entities**

The idea is extracting only the words marked as named entities.

In [40]:
# Get sample data 1
text1 = "Google Search, often referred to as simply Google, is the most" +\
        " used search engine nowadays. It handles a huge number of searches each day."
doc1 = nlp(text1)

# Get sample data 2
text2 = "Microsoft Windows is a family of proprietary operating systems" +\
        " developed and sold by Microsoft. The company also produces a wide range of" +\
        " other software for desktops and servers."
doc2 = nlp(text2)

# Get sample data 3
text3 = "Titicaca is a large, deep, mountain lake in the Andes." +\
        " It is known as the highest navigable lake in the world."
doc3 = nlp(text3)

docs = [doc1, doc2, doc3]
spans = {}

# Extracting keywords each document
for j, doc in enumerate(docs):
    named_entity_span = [doc[i].text for i in range(len(doc)) if doc[i].ent_type != 0]
    print(named_entity_span)
    named_entity_span = " ".join(named_entity_span)
    named_entity_span = nlp(named_entity_span)
    spans.update({j: named_entity_span})

# Similarity calculation
print('doc1 is similar to doc2:',spans[0].similarity(spans[1]))
print('doc1 is similar to doc3:',spans[0].similarity(spans[2]))
print('doc2 is similar to doc3:',spans[1].similarity(spans[2]))

['Google', 'Search', 'Google', 'each', 'day']
['Microsoft', 'Windows', 'Microsoft']
['Titicaca', 'Andes']
doc1 is similar to doc2: 0.5329812636743593
doc1 is similar to doc3: 0.12925641941282406
doc2 is similar to doc3: 0.001502677465755575


NOTE:
- It probably the words “Google” and “Microsoft” have been found more often in the same texts of the training text corpus rather than in the company of the words “Titicaca” and “Andes.”

# 6 - Finding Patterns and Walking Dependency Trees

### Word Sequence Patterns

> By searching for word sequence patterns, we recognize word sequences with similar linguistic features, making it possible to categorize input and handle it properly.

Why is it important?

Since a text is composed of different sentences, it's impractical to write the code that process each sentence specifically. However some sentences look completely different might follow the same word sequence patterns.

In [43]:
# Introduction (example - Fine-grained)
# Sentence 1: We can overtake them ==> nsubj + aux + verb + dobj
# Sentence 2: You must specify it ==> nsubj + aux + verb + dobj

doc1 = nlp("We can overtake them.")
doc2 = nlp("You must specify it.")

print("Fine-grained:")
for i in range(len(doc1) - 1):
    if doc1[i].dep_ == doc2[i].dep_:
        print(doc1[i].text, doc2[i].text, doc[i].dep_, spacy.explain(doc1[i].dep_))

# (example - Coarse-grained)
print("\nCoarse-grained:")
for i in range(len(doc1) - 1):
    if doc1[i].pos_ == doc2[i].pos_:
        print(doc1[i].text, doc2[i].text, doc[i].pos_, spacy.explain(doc1[i].pos_))

Fine-grained:
We You nsubj nominal subject
can must ROOT auxiliary
overtake specify det root
them it amod direct object

Coarse-grained:
We You PROPN pronoun
can must AUX auxiliary
overtake specify DET verb
them it ADJ pronoun


**Checking an Utterance for a Pattern (Manual Function)**

Suppose that we are trying to find utterances in user input that express one of the following: ability, possibility, permission, or obligation (as opposed to utterances that describe real actions that have occurred, are occurring, or occur regularly). For instance, we want to find “I can do it.” but not “I’ve done it.”

In this case, we might check whether an utterance satisfies the following pattern: “subject + auxiliary + verb + . . . + direct object . . .”.

NOTE:
The ellipses indicate that the direct object isn’t necessarily located immediately behind the verb, making this pattern a little different from the one in the preceding example.

Here is example illustration:

![images](data_spacy/images/check-utt-pattern.png)

In [46]:
def dep_pattern(doc):
    for i in range(len(doc)):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_=='aux' and doc[i+2].dep_ == 'ROOT':
            for token in doc[i+2].children:
                if token.dep_ == 'dobj':
                    return True
    return False

# text = "I might send them a card as a reminder."
text = "We can overtake them."
doc = nlp(text)
if dep_pattern(doc):
    print('Found')
else:
    print('Not found')

Found


**SpaCy Matcher**

- Matcher, a tool that is specially designed to find sequences of tokens based on pattern rules.

- Matcher allows us to find a pattern in a text without iterating explicitly over the text’s tokens, thus hiding implementation details from us. 

- As a result, we can obtain the start and end positions of the words composing a sequence that satisfies the specified pattern.

NOTE:

It can't be used if the pattern is not in a sequences, for example “subject + auxiliary + verb + . . . + direct object . . .”.

In [52]:
from spacy.matcher import Matcher

# Suppose we want to find pattern: subject + auxiliary + verb

# Generate matcher object
matcher = Matcher(nlp.vocab)
print(type(matcher))

pattern = [[{'DEP': 'nsubj'}, {'DEP': 'aux'}, {'DEP': 'ROOT'}]]
# NOTE: 'DEP' key means that it's dependency label.
matcher.add("NsubjAuxRoot", pattern)
doc = nlp("We can overtake them.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print('Span:', span.text)
    print('The positions in the doc are:', start, '-', end)

<class 'spacy.matcher.matcher.Matcher'>
Span: We can overtake
The positions in the doc are: 0 - 3


**Applying Several Patterns**

We can make our pattern more complex by using more than one linguistic feature pattern.

In this example, suppose that we want to find the pattern  “subject + auxiliary + verb + . . . + direct object . . .” and make sure that the direct object and subject in an utterance is a personal pronoun.

Here is example illustration:

![images](data_spacy/images/app-sev-pattern.png)

In [53]:
def pos_pattern(doc):
    for token in doc:
        if token.dep_ == 'nsubj' and token.tag_ != 'PRP':
            return False
        if token.dep_ == 'aux' and token.tag_ != 'MD':
            return False
        if token.dep_ == 'ROOT' and token.tag_ != 'VB':
            return False
        if token.dep_ == 'dobj' and token.tag_ != 'PRP':
            return False
        return True


doc = nlp(u'We can overtake them.')
if dep_pattern(doc) and pos_pattern(doc):
    print('Found')
else:
    print('Not found')

Found


**Creating Patterns Based on Customized Features**

Suppose that we want to recognize pattern that distinguishes pronouns according to number (singular or plural).

How to recognize it? 

> We can looking at direct object pronoun of sentence or the next sentence. Then determine if the pronoun is plural or singular.

Here's some example:

Consider the following discourse,

> "The trucks are traveling slowly. We can overtake them."

    If we can establish that the direct object "them" in the second sentence is a plural pronoun, we will have reason to believe that it refers to the plural noun "trucks" in the first sentence.

    NOTE: We often use this technique to recognize a pronoun's meaning based on the context.

In [54]:
def pron_pattern(doc):
    plural = ['we', 'us', 'they', 'them']
    for token in doc:
        if token.dep_ == 'dobj' and token.tag_ == 'PRP':
            if token.text in plural:
                return 'plural'
            else:
                return 'singular'
    return "not found"


doc = nlp('We can overtake them.')
if dep_pattern(doc) and pos_pattern(doc):
    print('Found:', 'the pronoun in position of direct object is', pron_pattern(doc))
else:
    print('Not Found')


Found: the pronoun in position of direct object is plural


**Study Case: Using Word Sequence Patterns in Chatbots to Generate Statements**

In order to make a chatbot understand a user's input and then generate a proper response to it. We want to make our chatbot reply this conversation:

Input:
> The symbols are clearly distinguishable. I can recognize them promptly.

Output:
> I can recognize symbols promptly too.

Algorithms:
1. Check the conversational input against the dep_pattern and pos_pattern functions defined previously to find an utterance that follows the “subject + auxiliary + verb + . . . + direct object . . .” and “pronoun + modal auxiliary verb + base form verb + . . . + pronoun . . .” patterns, respectively.
2. Check the utterance found in step 1 against the pron_pattern pattern to determine whether the direct object personal pronoun is plural or singular.
3. Find the noun that gives its meaning to the pronoun by searching for a noun that has the same number as the personal pronoun.
4. Replace the pronoun that acts as the direct object in the sentence located in step 1 with the noun found in step 3.
5. Append the word “too” to the end of the generated utterance.

In [61]:
# Define dep_pattern, pos_pattern, and pron_pattern
def dep_pattern(doc):
    for i in range(len(doc)):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_=='aux' and doc[i+2].dep_ == 'ROOT':
            for token in doc[i+2].children:
                if token.dep_ == 'dobj':
                    return True
    return False

def pos_pattern(doc):
    for token in doc:
        if token.dep_ == 'nsubj' and token.tag_ != 'PRP':
            return False
        if token.dep_ == 'aux' and token.tag_ != 'MD':
            return False
        if token.dep_ == 'ROOT' and token.tag_ != 'VB':
            return False
        if token.dep_ == 'dobj' and token.tag_ != 'PRP':
            return False
        return True

def pron_pattern(doc):
    plural = ['we', 'us', 'they', 'them']
    for token in doc:
        if token.dep_ == 'dobj' and token.tag_ == 'PRP':
            if token.text in plural:
                return 'plural'
            else:
                return 'singular'
    return "not found"


# Define function to find noun
def find_noun(sents, num):
    """
    sents = list
    num = 'plural' or 'singular'
    """
    if num == 'plural':
        taglist = ['NNS', 'NNPS']
    if num == 'singular':
        taglist = ['NN', 'NNP']
    for sent in reversed(sents):
        for token in sent:
            if token.tag_ in taglist:
                return token.text
    return 'Noun not found'

# Define function for generate a relevant statement from the question.
def gen_utterance(doc, noun):
    sent = ''
    # Elaborate by adding article 'the' if it required
    first_word = noun.split()[0]
    if first_word.lower() not in ['a', 'an', 'the']:
        noun = 'the ' + noun
        
    for i, token in enumerate(doc):
        if token.dep_ == 'dobj' and token.tag_ == 'PRP':
            sent = doc[:i].text + ' ' + noun + ' ' + doc[i+1:len(doc)-2].text + 'too.'
            return sent
    return 'Failed to generate an utterance'

doc = nlp(u'The symbols are clearly distinguishable. I can recognize them promptly.')
sents = list(doc.sents)
response = ''
noun = ''
for i, sent in enumerate(sents): 
    if dep_pattern(sent) and pos_pattern(sent):
        noun = find_noun(sents[:i], pron_pattern(sent))
        if noun != 'Noun not found':
            response = gen_utterance(sents[i],noun)
        break
print(response)

I can recognize the symbols too.


### Extracting Keywords from Syntactic Dependency Trees

> We are walking through the dependency tree of the sentence to obtain necessary pieces of information.

NOTE: Not necessarily from the first token to the last one.

IDEA: If two words is semanticly connected, they will have HEAD-CHILD relation directly or indirectly.

**Study Case**

Suppose that in the ticket-booking application, a user might submit a sentence like this:
> I need an air ticket to Berlin

We can searching the pattern "to + GPE" where GPE is named entity for countries, cities, and states. But what if the submitted are these:

> I am going to the conference in Berlin. I need an air ticket.

> I am going to the conference, which will be held in Berlin. I would like to book an air ticket.

> I want to book a ticket on a direct flight without landing in Berlin.

As you can see, the “to + GPE” pattern wouldn’t find the destination in either example.

In [64]:
# Analyze the sentence

# sentence 1
doc1 = nlp("I need an air ticket to Berlin.")
displacy.render(doc1, style='dep')

# sentence 2
doc2 = nlp("I am going to the conference in Berlin. I need an air ticket.")
displacy.render(doc2, style='dep')

# sentence 3
doc3 = nlp("I am going to the conference, which will be held in Berlin. I would like to book an air ticket.")
displacy.render(doc3, style='dep')

# sentence 4
doc4 = nlp("I want to book a ticket on a direct flight without landing in Berlin.")
displacy.render(doc4, style='dep')


Notice that, If we walk through the dependency tree, moving to the child to the immediate right of each word, you’ll finally reach “Berlin.” This shows that there’s a semantic connection between “to” and “Berlin” in this sentence.

In [68]:
def det_destination(doc):
    for i, token in enumerate(doc):
        if token.ent_type != 0 and token.ent_type_ == 'GPE':
            while True:
                token = token.head
                if token.text == 'to':
                    return doc[i].text
                if token.head == token:
                    return 'Failed to determine'
    return 'Failed to determine'

print("Sentence 1:", doc1)
print(det_destination(doc1))

print("Sentence 2:", doc2)
print(det_destination(doc2))

print("Sentence 3:", doc3)
print(det_destination(doc3))

print("Sentence 4:", doc4)
print(det_destination(doc4))

Sentence 1: I need an air ticket to Berlin.
Berlin
Sentence 2: I am going to the conference in Berlin. I need an air ticket.
Berlin
Sentence 3: I am going to the conference, which will be held in Berlin. I would like to book an air ticket.
Berlin
Sentence 4: I want to book a ticket on a direct flight without landing in Berlin.
Failed to determine


**Study Case: Condensing a Text Using Dependency Trees**

Suppose that in report-processing applications we need to develop an application that has to condense retail reports by extracting only the most important information from them.

As a quick example, consider the following sentence:
> The product sales hit a new record in the first quarter, with 18.6 million units sold.

After processing, it should look like this:
> The product sales hit 18.6 million units sold.


The algorithm:
1. Extract the entire phrase containing the number (it’s 18.6 in this example) by walking the heads of tokens, starting from the token containing the number and moving from left to right.
2. Walk the dependency tree from the main word of the extracted phrase (the one whose head is out of the phrase) to the main verb of the sentence, iterating over the heads and picking them up to be used in a new sentence.
3. Pick up the main verb’s subject, along with its leftward children, which typically include a determiner and possibly some other modifiers.

It represent as below:
![images](data_spacy/images/cond-sent.png)

In [87]:
# Analysis

doc = nlp('The product sales hit a new record in the first quarter, with 18.6 million units sold.')

displacy.render(doc, style='dep')

In [100]:
doc = nlp('The product sales hit a new record in the first quarter, with 18.6 million units sold.')

# 1. Extract phrase containing the number
phrase = ''
for token in doc:
    if token.pos_ == 'NUM':
        while True:
            # print(token)
            # print(token.head)
            # print(list(token.head.head.lefts))
            phrase = phrase + ' ' + token.text
            token = token.head
            # if token == token.head:
            #     break
            if token not in list(token.head.lefts):
                # To make sure the head of the next token is to the right of that token, 
                #   we check whether the token is in the list of its head’s left children
                phrase = phrase + ' ' + token.text
                break
        break

print(phrase.strip())

# 2. Walk the dependency tree from the main word of the extracted phrase to the main verb of the sentence
while True:
    token = doc[token.i].head
    if token.pos_ != 'ADP':
        phrase = token.text + phrase
    if token.dep_ == 'ROOT':
        break

print(phrase.strip())

# 3. Pick up the main verb’s subject, along with its leftward children, 
#      which typically include a determiner and possibly some other modifiers.
for t in token.lefts:
    if t.dep_ == 'nsubj':
        phrase = ' '.join([t.text for t in t.lefts]) + ' ' + t.text + ' ' + phrase
        break

print(phrase)

18.6 million units
hit 18.6 million units
The product sales hit 18.6 million units


**Study Case: Condensing a Text Using Dependency Trees II**

Write a script that:
- Extracting only those sentences that contain phrases referring to an amount of money.
- It need to condense the selected sentences so they include only the subject, the main verb, the phrase referring to an amount of money, and the tokens you can pick up when walking the heads starting from the main word of the money phrase up to the main verb of the sentence.

Example:
> The company, whose profits reached a record high this year, largely attributed to changes in management, earned a total revenue of $4.26 million.

Output:
> The company earned revenue of $4.26 million.

In [102]:
# Analysis
doc = nlp("The company, whose profits reached a record high this year, largely attributed to changes in management, earned a total revenue of $4.26 million.")

displacy.render(doc, style='dep')

In [198]:
# Extract number
doc = nlp("The company, whose profits reached a record high this year, largely attributed to changes in management, earned a total revenue of $4.26 million.")

phrase = ""
for token in doc:
    if token.tag_ == '$':
        phrase = token.text
        i = token.i + 1
        while doc[i].tag_ == 'CD':
            phrase += doc[i].text + " "
            # Change current token
            token = doc[i]
            i += 1
        phrase = phrase.rstrip()
        break

print(phrase)

# Walk to the root
while True:
    token = token.head
    if token.dep_ == 'ROOT':
        phrase = token.text + ' ' + phrase
        break
    phrase = token.text + ' ' + phrase

# Get the noun of root
for t in token.lefts:
    if t.dep_ == 'nsubj':
        phrase = t.text + ' ' + phrase
        if len(list(t.lefts)) > 0:
            for c in t.lefts:
                phrase = c.text + ' ' + phrase
        break
        

print(phrase)

$4.26 million
The company earned revenue of $4.26 million


### Using Context to Improve the Ticket-Booking Chatbot

Problem: There are no general code to solve all intelligent text-processing tasks. One way to make these scripts more useful is to take context into account to determine an appropriate response.

Suppose that we want to increase the functionality of the ticket-booking script so it can handle a wider set of user input, including utterances that don’t contain a “to + GPE” pair in any combination.

Let break into situation, look at the following utterance:
> I am attending the conference in Berlin.

Here, the user has expressed an intention to go to Berlin without “to.” Only the GPE entity “Berlin” is in the sentence. In such cases, it would be reasonable for a chatbot to ask a confirmatory question, such as the following:
> You want a ticket to Berlin, right?

The improved ticket-booking chatbot should produce different outputs based on three different situations:
- The user expresses a clear intention to book a ticket to a certain destination.
- It’s not immediately clear whether the user wants a ticket to the destination mentioned.
- The user doesn’t mention any destination.


Here is illustration:

![images](data_spacy/images/improve-ticket-booking.png)

In [200]:
# Extracting destination from pattern "to" + GPE
def det_destination(doc):
    for i, token in enumerate(doc):
        if token.ent_type != 0 and token.ent_type_ == 'GPE':
            while True:
                token = token.head
                if token.text == 'to':
                    return doc[i].text
                if token.head == token:
                    return 'Failed to determine'
    return 'Failed to determine'

# Guessing destination function
def guess_destination(doc):
    for token in doc:
        if token.ent_type != 0 and token.ent_type_ == 'GPE':
            return token.text
    return 'Failed to determine'

# Generate response function
def gen_response(doc):
    # Try to extract "to" + GPE pattern
    dest = det_destination(doc)
    if dest != 'Failed to determine':
        return 'When do you need to be in ' + dest + '?'
    # Try to guessing destination by extract GPE
    dest = guess_destination(doc)
    if dest != 'Failed to determine':
        return 'You want a ticket to ' + dest +', right?'
    # Fail to extract any GPE
    return 'Are you flying somewhere?'

# Testing
doc = nlp(u'I am going to the conference in Berlin.') 
print(gen_response(doc))

When do you need to be in Berlin?


### Making a Smarter Chatbot by Finding Proper Modifier

> By making chatbot understand about modifier, we can add modifier as a response of chatbot to the user.

    For example,

    If user ask:
        > “I’d like to read a book,”

    The response:
        > “Would you like a fiction book?”

- A modifier is an optional element in a phrase or a clause used to change the meaning of noun.
- Removing a modifier does not typically change the basic meaning of the sentence, but it does make it less specific.
- There are two types of modifier:
    1. Pre-modifier: a word or phrase that **comes before the noun it modifies**, usually adding descriptive detail. Pre-modifiers are typically adjectives, adverbs, or noun phrases.
       
           Example:
               - Adjective as pre-modifier:
                   The red car sped down the street. (Here, "red" is a pre-modifier describing "car".)
       
               - Noun as pre-modifier:
                   The kitchen table is clean. (Here, "kitchen" is a noun acting as a pre-modifier for "table".)

       
    2. Post-modifier: a word, phrase, or clause that **comes after the noun it modifies**, providing additional detail or clarification.

           Example:
               - Prepositional phrase as post-modifier:
                   The car with the red paint sped down the street. (Here, "with the red paint" is a post-modifier providing more detail about "car".)
       
               - Relative clause as post-modifier:
                   The car that I bought yesterday is fast. (Here, "that I bought yesterday" is a relative clause modifying "car".)
       
               - Participle phrase as post-modifier:
                    The car parked outside is mine. (Here, "parked outside" is a participle phrase acting as a post-modifier of "car".)

- Consider the following example
> That exotic fruit from Africa.

    The visualization dependency:
    
    ![images](data_spacy/images/ex-modifier.png)

**Study Case**

Suppose you want to determine possible adjectival modifiers for the word “fruit.” (Adjectival modifiers are always premodifiers.) Also, you want to look at what GPE entities you can find in the postmodifiers of this same word. This information could later help you generate an utterance during a conversation on fruits.

In [201]:
doc = nlp("Kiwono has jelly-like flesh with a refreshingly fruity taste. This is a nice exotic fruit from Africa. It is definitely worth trying.")

fruit_adjectives = []
fruit_origins = []
for token in doc:
    if token.text == 'fruit':
        # Extract adjective from premodifiers
        fruit_adjectives = fruit_adjectives + [modifier.text for modifier in token.lefts if modifier.pos_ == 'ADJ']
        # Extract location from postmodifiers
        fruit_origins = fruit_origins + [doc[modifier.i + 1].text for modifier 
                                         in token.rights if modifier.text == 'from' and doc[modifier.i + 1].ent_type != 0]

print('The list of adjectival modifiers for word fruit:', fruit_adjectives)
print('The list of GPE names applicable to word fruit as postmodifiers:', 
      fruit_origins)

The list of adjectival modifiers for word fruit: ['nice', 'exotic']
The list of GPE names applicable to word fruit as postmodifiers: ['Africa']


# 7 - Visualization

**Visualizing Dependency Parsing**

In [None]:
from spacy import displacy

doc = nlp(u"I want a Greek pizza.")
displacy.render(doc, style='dep')

**Visualizing Named Entity Recognizer**

In [None]:
from spacy import displacy

doc = nlp(u"I want a Greek pizza.")
displacy.render(doc, style='ent')

# 8- Intent Recognition