<br>
<u>Notebook Four</u> | 
<a href=https://github.com/andrealeone/NLP target=_blank>Repository</a>
<br><br>
<b>Linguistic features</b><br><br>
Andrea Leone<br>
University of Trento<br>
February 2022
<br><br>

In [1]:
import project 

import spacy
import numpy  as np
import pandas as pd
import pickle
import os

from tqdm.notebook import tqdm

In [2]:
records = project.sql_query(""" 
    SELECT * FROM talks
    WHERE transcript IS NOT NULL
    ORDER BY slug ASC;
""")

df = project.create_dataframe_from(records) 

<br>

Load the "classic" English `nlp` [pre-trained language processing pipeline](https://spacy.io/models/en#en_core_web_lg), optimised for CPU. Like in the first notebook, the pipeline ingests a raw string, i.e. the talk transcript, and returns a `spaCy.Doc` object that comprises the result of several different steps.
<br><img src="https://spacy.io/pipeline-design-46d249f6f048cda4c8a8f8147d332bb5.svg" alt="SpaCy NLP pipeline" width="75%"><br>

<b>Features</b>
 * <b>Tokenisation</b>: Segmenting text into words, punctuations marks etc.
 * <b>Part-of-speech (POS) Tagging</b>: Assigning word types to tokens, like verb or noun.
 * <b>Dependency Parsing</b>: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
 * <b>Lemmatization</b>: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “mice” is “mouse”.
 * <b>Sentence Boundary Detection (SBD)</b>: Finding and segmenting individual sentences.
 * <b>Named Entity Recognition (NER)</b>: Labelling named “real-world” objects, like persons, companies or locations.
 * <b>Entity Linking (EL)</b>: Disambiguating textual entities to unique identifiers in a knowledge base.
 * <b>Similarity</b>: Comparing words, text spans and documents and how similar they are to each other.
 * <b>Text Classification</b>: Assigning categories or labels to a whole document, or parts of a document.
 * <b>Rule-based Matching</b>: Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
 * <b>Training</b>: Updating and improving a statistical model’s predictions.
 * <b>Serialization</b>: Saving objects to files or byte strings.
<br>

<b>Statistical models</b>
While some of SpaCy’s features work independently, others require trained pipelines to be loaded, enabling SpaCy to predict linguistic annotations. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data, like in the case of `en_core_web_lg`. It includes the following components:
 * <b>Binary weights</b> for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
 * <b>Lexical entries</b> in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
 * <b>Data files</b> like lemmatization rules and lookup tables.
 * <b>Word vectors</b>, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
 * <b>Configuration options</b>, like the language and processing pipeline settings and model implementations to use, to put SpaCy in the correct state when you load the pipeline.

<br>
Load the pipeline.

In [3]:
nlp = spacy.load("en_core_web_lg") 

<br>
Process all transcripts.

In [4]:
docs_pkl = "data/docs.v4.pkl" 

if not os.path.exists( docs_pkl ): 

    docs = list() 

    for _,record in tqdm( list(df.iterrows()) ):
        docs.append( nlp( record["transcript"] ) )

    with open( docs_pkl, "wb" ) as file: 
        pickle.dump(docs, file)

else: 

    with open( docs_pkl, "rb" ) as file: 
        docs = pickle.load(file)

  0%|          | 0/4828 [00:00<?, ?it/s]

<br>

Sample the pipeline, on an excerpt of [Dave Isay's talk](https://www.ted.com/talks/dave_isay_everyone_around_you_has_a_story_the_world_needs_to_hear)

In [5]:
doc = nlp(""" 
    I have learned about the poetry and the wisdom and the grace 
    that can be found in the words of people all around us 
    when we simply take the time to listen.
""")

<br>

<b>Tokenisation</b>

The first step in the process is tokenising the text: SpaCy segments it into words, punctuation and so on. This is done by applying rules specific to each language. First, the raw text is split on whitespace characters, similar to `text.split(' ')`. Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
 1. <b>Does the substring match a tokenizer exception rule?</b> For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
 2. <b>Can a prefix, suffix or infix be split off?</b> For example punctuation like commas, periods, hyphens or quotes.

If there is a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, SpaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.<br>
<img src="https://spacy.io/tokenization-9b27c0f6fe98dcb26239eba4d3ba1f3d.svg" alt="SpaCy Tokenization" width="50%">

<br>

<b>Linguistic annotations</b>

SpaCy provides a variety of linguistic annotations to give insights into a text’s grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if we are analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.

Extract and show annotations for each token in the sample

In [6]:
ladf = pd.DataFrame([ 
    [
        token.text, token.pos_, token.tag_, token.dep_,
        token.shape_, token.lemma_, token.is_stop
    ] for token in doc if token.pos_ not in ["SPACE", "PUNCT"]
], columns = [
    "", "Part-of-Speech", "Tag", "Dependency", "Shape", "Lemma", "Stop"
])

ladf = ladf.set_index("")
ladf.drop_duplicates()
print(ladf)

        Part-of-Speech  Tag Dependency Shape   Lemma   Stop
                                                           
I                 PRON  PRP      nsubj     X       I   True
have               AUX  VBP        aux  xxxx    have   True
learned           VERB  VBN       ROOT  xxxx   learn  False
about              ADP   IN       prep  xxxx   about   True
the                DET   DT        det   xxx     the   True
poetry            NOUN   NN       pobj  xxxx  poetry  False
and              CCONJ   CC         cc   xxx     and   True
the                DET   DT        det   xxx     the   True
wisdom            NOUN   NN       conj  xxxx  wisdom  False
and              CCONJ   CC         cc   xxx     and   True
the                DET   DT        det   xxx     the   True
grace             NOUN   NN       conj  xxxx   grace  False
that              PRON  WDT  nsubjpass  xxxx    that   True
can                AUX   MD        aux   xxx     can   True
be                 AUX   VB    auxpass  

<br>

<b>Morphological Features</b>

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech.

In [7]:
mfdf = pd.DataFrame([ 
    [
        token.text, str(token.morph).replace("|", ", ")
    ] for token in doc if token.pos_ not in ["SPACE", "PUNCT"] and not token.morph.to_dict() == {}
], columns = [ "", "Morphological features" ])

mfdf = mfdf.set_index("")
print(mfdf)

                                Morphological features
                                                      
I        Case=Nom, Number=Sing, Person=1, PronType=Prs
have                Mood=Ind, Tense=Pres, VerbForm=Fin
learned         Aspect=Perf, Tense=Past, VerbForm=Part
the                         Definite=Def, PronType=Art
poetry                                     Number=Sing
and                                       ConjType=Cmp
the                         Definite=Def, PronType=Art
wisdom                                     Number=Sing
and                                       ConjType=Cmp
the                         Definite=Def, PronType=Art
grace                                      Number=Sing
that                                      PronType=Rel
can                                       VerbForm=Fin
be                                        VerbForm=Inf
found           Aspect=Perf, Tense=Past, VerbForm=Part
the                         Definite=Def, PronType=Art
words     

<br>

<b>Linguistic dependencies</b>

Visualizing a dependency parse or named entities in a text is not only a fun NLP demo – it can also be incredibly helpful in speeding up development and debugging your code and training process. Displacy spins up a simple web server and enables us to view the result straight in the notebook.

In [8]:
spacy.displacy.render( 
    doc[1:-2], style="dep", jupyter=True,
    options={"bg":"transparent", "arrow_width": 5, "arrow_spacing": 30}
) 

<br><br>

<b>Vectors and Semantic Similarity</b>

Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec.

In [9]:
compute_similarity = lambda w1, w2: nlp(w1).similarity( nlp(w2) ) 

<br>

Get similarity scores from word tuples

In [10]:
compute_similarity ( "creativity", "innovation" ) 

0.6643053677545637

In [11]:
compute_similarity ( "creativity", "kindness" ) 

0.4634827277856967

In [12]:
compute_similarity ( "creativity", "intelligence" ) 

0.4278424226247513

In [13]:
compute_similarity ( "fairness", "justice" ) 

0.5730289299368784

<br/>

Under the hood, we compute the cosine distance between the two word vectors.

In [14]:
nlp("justice").vector

array([-3.4381e-01,  3.4376e-01,  4.9534e-01, -3.5717e-02, -8.8070e-02,
       -4.7633e-01, -3.4325e-02,  5.9654e-01, -1.9067e-01,  3.3019e+00,
       -4.3143e-01, -5.7032e-01,  3.1515e-01,  7.7676e-02, -5.9649e-01,
        1.9499e-01, -2.4696e-02,  5.1572e-02,  2.2789e-02,  2.9140e-01,
        7.3238e-02, -7.5725e-02,  2.8049e-01, -7.1488e-02,  6.4580e-01,
       -3.2782e-01,  3.5153e-01,  1.2905e-01,  1.2300e-01,  3.3861e-01,
       -1.1412e-01,  1.3384e-01, -3.7455e-02, -3.4492e-02, -1.3803e-01,
       -5.2819e-01, -3.8170e-01, -5.6324e-01, -2.7196e-01,  3.6408e-01,
        1.9637e-01, -3.9655e-02,  1.0042e-01, -7.8966e-02, -1.3110e-01,
        1.0679e-02, -4.2275e-01,  2.0671e-01, -2.8614e-01, -3.1247e-01,
        3.3054e-02,  6.1553e-01,  2.1825e-01,  2.9132e-01,  7.1687e-02,
        4.5740e-01, -8.6655e-02,  1.2583e-01, -2.0430e-01, -1.9350e-01,
       -3.3587e-01, -1.8023e-01, -5.6906e-04, -5.0594e-01,  1.8982e-01,
       -2.2953e-01, -8.4605e-01, -1.0636e-01,  6.0445e-01, -9.34

Computing similarity scores can be helpful in many situations, but it’s also important to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single “similarity” score will always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose. Here are some important considerations to keep in mind:

There’s no objective definition of similarity. Whether “I like burgers” and “I like pasta” is similar depends on your application. Both talk about food preferences, which makes them very similar – but if you’re analyzing mentions of food, those sentences are pretty dissimilar, because they talk about very different foods.
The similarity of Doc and Span objects defaults to the average of the token vectors. This means that the vector for “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”.
Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.

<br/>