## Introduction to Computational Linguistics and Dependency Trees in data science

https://www.analyticsvidhya.com/blog/2017/12/introduction-computational-linguistics-dependency-trees/?utm_source=blog&utm_medium=introduction-information-extraction-python-spacy

For example, a conversation system which is trained using recurrent neural network produces the following results in two scenarios:

User: Hi, I took a horrible picture in a museum, can you tell where is it located?
Bot Reply 1: The museum is located at a horrible place

User: Hi, I took a horrible picture in a museum, can you tell where is it located?
Bot Reply 2: The horrible museum is located at this place

The two responses have virtually similar tokens but different structures which completely changes the context. In this article, I will discuss the interdisciplinary field of Computational Linguistics which deals with the structural aspects of the text that are used to solve common text related problems. Some examples are Named entity extraction, coreference resolution, and machine translation.

Computational linguistics

Computational linguistics often overlaps with the field of Natural Language Processing as most of the tasks are common to both the fields. While Natural Language Processing focuses on the tokens/tags and uses them as predictors in machine learning models, Computational Linguistics digs further deeper into the relationships and links among them.

Structural aspects of the text refer to the organization of tokens in a sentence and the how the contexts among them are interrelated. This organization is often depicted by the word-to-word grammar relationships which are also known as dependencies. Dependency is the notion that syntactic units (words) are connected to each other by directed links which describe the relationship possessed by the connected words.

These dependencies map directly onto a directed graph representation, in which words in the sentence are nodes in the graph and grammatical relations are edge labels. This directed graph representation is also called as the dependency tree. For example, the dependency tree of the sentence is shown in the figure below:

Another way to represent this tree is following:

-> community-NN (root)
-> AnalyticsVidhya-NNP (nsubj)
-> is-VBZ (cop)
-> the-DT (det)
-> largest-JJS (amod)
-> scientists-NNS (pobj)
-> of-IN (prep)
-> data-NNS (case)
-> and-CC (cc)
-> provides-VBZ (conj)
-> resources-NNS (dobj)
-> best-JJS (amod)
-> understanding-VBG (pcomp)
-> for-IN (mark)
-> data-NNS (dobj)
-> and-CC (cc)
-> analytics-NNS (conj)

    In this graphical representation of sentence, each term is represented in the pattern of  “  -> Element_A – Element_B (Element_C) “.
    Element_A represents the word, Element_B represents the part of speech tag of word, Element C represents the grammar relation among the word and its parent node, and the indentation before the symbol “ -> “ represents the level of a word in the tree. Here is the reference list to understand the dependency relations.
    The tree shows that the term “community” is the structural centre of the sentence and is represented as root linked by 7 nodes (“AnalyticsVidhya“, “is“, “the“, “largest“, “and“, “scientists“, “provides“).
    Out of these 7 connected nodes, the terms “scientists” and “provides” are the root nodes of two other sub-trees.
    Each subtree is a itself a dependency tree with relations such as – (“provides” <-> “resources” <by> “dobj” relation), (“resources” <-> “best” <by> “amod” relation).

These trees can be generated in python using libraries such as NLTK, Spacy or Stanford-CoreNLP and can be used to obtain subject-verb-object triplets, noun and verb phrases, grammar dependency relationships, and part of speech tags

Applications of Dependency Trees

Named Entity Recognition

Coreference Resolution or Anaphora Resolution

Question Answering

Machine Translation

Text Summarization

Natural Language Generation

Natural Language Understanding

Speech to Text


Dependency Trees using Spacy

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'The team is not performing well in the match')
for token in doc:
    print(str(token.text),  str(token.lemma_),  str(token.pos_),  str(token.dep_))

The the DET det
team team NOUN nsubj
is be AUX aux
not not PART neg
performing perform VERB ROOT
well well ADV advmod
in in ADP prep
the the DET det
match match NOUN pobj


In [None]:
from spacy import displacy
displacy.serve(doc, style='dep')