# Local Homology NLP Use Cases: unsupervised word disambiguation

In this tutorial we apply **local homology** to study natural language processing (NLP) data. 

## The context

Most modern machine learning techniques dealing with NLP data need to preprocess text data and trasform them into more standard objects: arrays. The process of transforming a word (or token, i.e. a word stripped out of its ending) into an array is called *word embedding*. 

There are of course more general techniques than simply transforming the single tokens: sentence embedding is one of the generalisations. However, for the purpose of this notebook, we will stick to word embeddings and in particular we will use a technique called [word2vec](https://en.wikipedia.org/wiki/Word2vec).

The disadvantage of any word embedding techniques is that the same token is mapped to the same array. Hence, if two words have the same written form but different meanings (i.e., *homographic words*), such words will anyway be mapped to the same array!

## The task 

Given the above introduction, our task is to *disambiguate words*, i.e. finding a way to differentiate homographic words by their meaning. Our approach consists of analysing the whole sentence in which the word appears and try to deduce the meaning of the word from its context (i.e. neighbouring words). The core idea of our proposal is based on the algebro-topological description of the space of word embeddings.

## The main idea

We are structuring our analysis on the assumptions that are clearly explained in [this paper](https://arxiv.org/pdf/2011.09413.pdf). In few words, the idea is that a word with multiple meaning sits on the sigular loci of the stratified space of the word embeddings. This sentence is in truth not formally correct, as there is yet no clear notion of what is the canonical topology in the word embedding space; nonetheless, the intuition behind these concepts can be explained pictorially:

![sing](images/local_singularity.png)



This picture represents the local shape of the word-embedding space: the context words are located over the four cones tipping at the word `mole` (the singularity). Hence, a sentence from a thriller containing the word `mole` would most probably be located on the north-west branch, and so on.

## The goal of our exploration

We would first like to stress that this notebook is merely exploratory and that there is no aim at making it a fully fledge ML pipeline for word disambiguation. 
We would really like to understand if the shape of the cones around the singularity (i.e. the `mole` point) can be distinguished with **local homology**: if that is the case, then the geometric shape of a sentence lying on the word embedding stratified space correlates with the meaning of a word! This entails a new disambiguation technique.

In [None]:
# Import needed libraries
from gtda.plotting import plot_point_cloud
from gtda.local_homology.simplicial import KNeighborsLocalVietorisRips

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from gensim.models import Word2Vec
from gensim.test.utils import common_texts
from gensim.parsing.preprocessing import remove_stopwords, stem

In this notebook we showcase how to use local homology to disambiguate words. In particular, we focus our analysis to disambiguate the word "note": either a musical frequency, a short text, the verb, or money.

## The dataset and its preprocessing

The dataset contains many occurrences of the word "note" with the meanings stated above. The format is `.xml`, out of which we extract the plain text and do the standard preprocessing:

In [None]:
# Preprocess the data
with open("data/note.n.xml","r") as f:
    content = f.read()
    
    # split the sentences
    temp_list = content.split("<note.n.")  
    
    # remove stopwords
    list_of_text = list(map(remove_stopwords, temp_list))  
    
    # make lower case
    list_of_text = list(map(stem, temp_list))  
    refined_list = [list_of_text[i][1+len(str(i)):-12 - len(str(i))] for i in range(2,len(list_of_text))]

In the next cell we use the `word2vec` technique to vectorise each word.

In [None]:
# extract the words from the sentences
from sklearn.base import BaseEstimator, TransformerMixin

class PreprocessingText(BaseEstimator, TransformerMixin):
    """A basic class to transform a list of sentences (strings)
    into a list of arrays, each with 2 dimensions: (n_words, dim_emb_space)
    
    Note that the output is a list of arrays, each of two dimensions: the
    first dimension is the number of words in a sentence, while the 
    second dimension the word embedding dimension.
    
    The ``item``parameter is useful to select, for the transform, which item
    to select.
    Set ``item = None`` to get the whole list
    """
    def __init__(self, vector_size=30, window=5, min_count=1, 
                 workers=4):
        self.vector_size = vector_size
        self.window = window
        self.min_count = min_count
        self.workers = workers
    
    def fit(self, X, y=None):
        all_words_in_sentences = list(map(str.split, X))
        self.word2vec = Word2Vec(sentences=all_words_in_sentences, 
                            vector_size=self.vector_size, 
                            window=self.window, 
                            min_count=self.min_count, 
                            workers=self.workers,
                            seed=11
                           )
        return self
    
    def transform(self, X, y=None):
        all_words_in_sentences = list(map(str.split, X))
        list_of_vect_sentences = [self.word2vec.wv[all_words_in_sentences[i]] for i in range(len(all_words_in_sentences))]
        return list_of_vect_sentences

pt = PreprocessingText()
pt.fit(refined_list)
list_of_vect_sentences = pt.fit_transform(refined_list) # all vectorized sentences

## The exploratory results

Here below we see a couple of sentences containing the word "note" with the different meanings described above.

We display the sentence as well as the `modified_persistence_entropy` of the sentence, and then visualize their embeddings with umap.

In order to better interpret the persistence diagram, we introduce some helper functions:

In [None]:
from gtda.diagrams.features import PersistenceEntropy

class ModifiedPersistenceEntropy(BaseEstimator, TransformerMixin):
    """This class respects the sklearn paradigm and is useful to 
    vectorize the persistence diagrams coming out of the local
    homology class
    """
    def __init__(self):
        self.pe = PersistenceEntropy()
    
    def fit(self, X, y=None):
        self.pe.fit(X)
        return self

    def transform(self, X, y=None):
        return 2**self.pe.transform(X)

In [None]:
# example of a preprocessed sentence where "note" is used as a verb
print(refined_list[0])

lh = KNeighborsLocalVietorisRips(n_neighbors=(5, 15),
                                 homology_dimensions=(0, 1),
                                 collapse_edges=True, 
                                 n_jobs = -1)
lh.fit(list_of_vect_sentences[0])
ModifiedPersistenceEntropy().fit_transform(lh.transform(pt.word2vec.wv["note"].reshape(1, -1)))

In [None]:
# example of preprocessed sentence where "note" refers to music
print(refined_list[1])

lh = KNeighborsLocalVietorisRips(n_neighbors=(5, 15),
                                 homology_dimensions=(0, 1),
                                 collapse_edges=True, 
                                 n_jobs = -1)
lh.fit(list_of_vect_sentences[1])
ModifiedPersistenceEntropy().fit_transform(lh.transform(pt.word2vec.wv["note"].reshape(1, -1)))

We now visualize the word2vec embedding of two different sentences, using umap dimensionality reduction:

In [None]:
# Imports that will help us visualize the data
from sklearn.preprocessing import StandardScaler
import umap
def plotting_the_embedding(i, string):
    """this function displays the word embedding space reduced
    to two dimensions by the UMAP algorithm. In yellow the word
    `note` is highlighted. 
    """
    value = None
    # for loop to find the instance of note
    for k in range(len(list_of_vect_sentences[i])):
        if (list_of_vect_sentences[i][k] == pt.transform(("note",))[0]).all():
            value = k
    temp = np.zeros((len(list_of_vect_sentences[i])))
    temp[value] = 1

    reducer = umap.UMAP()

    scaled_point_cloud = StandardScaler().fit_transform(list_of_vect_sentences[i])

    embedding = reducer.fit_transform(scaled_point_cloud)

    plt.scatter(
        embedding[:, 0],
        embedding[:, 1], c = temp)
    plt.gca().set_aspect('equal', 'datalim')
    plt.title('Use of "note" as a ' + string, fontsize=24)
    
    # Example of sentence with the word "note"
    print("Preprocessed sentence: ")
    print(refined_list[i])
    lh.fit(list_of_vect_sentences[i])

    print("First and second Betti numbers:")
    print(ModifiedPersistenceEntropy().fit_transform(lh.transform(np.array(pt.transform(("note",))[0], dtype=float))))

In [None]:
# Note as a verb
i=0 
plotting_the_embedding(i, "verb")

In [None]:
# Musical note
i=1
plotting_the_embedding(i, "musical note")

The technique seems promising: the results above show a clear distinction between the local shape of the neighborhoods for the word "note" with different meanings.

## A small scale statistical exploration

We now look at 30 sentences where the word "note" takes different meanings which we have hand labelled.
For each 30 sentences, we look at the 0th and 1st dimension local homology around note and plot the obtained 2 dimensional pointcloud coloured by the meaning of "note" in the sentence.

In [None]:
note_emb = pt.word2vec.wv["note"]
note_loc_hom = []
for i in range(30):
    lh = KNeighborsLocalVietorisRips(n_neighbors=(5, 15),
                                 homology_dimensions=(0, 1),
                                 collapse_edges=True, 
                                 n_jobs = -1)
    sentence_emb = list_of_vect_sentences[i]
    lh.fit(sentence_emb)
    note_loc_hom.append(ModifiedPersistenceEntropy().fit_transform(lh.transform(note_emb.reshape(1, -1)))[0])

note_loc_hom = np.array(note_loc_hom)

plt.scatter(
    note_loc_hom[:, 0],
    note_loc_hom[:, 1],
    c = [0, 1, 2, 2, 1, 1, 2, 2, 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 3, 3, 3, 2, 1, 1, 3, 2, 2, 1])
plt.gca().set_aspect('equal', 'datalim')
plt.title('Local dimension around "note"', fontsize=24)

## Conclusion

The method seems promising! In particular, the meaning "money" (in yellow) seems to have very varying local dimensions, whereas the other classes seem to be more clustered together. However a lot more work is needed: especially, systematising the anylsis, finding the proper vectorisation of local homology, etc...

We really hope that this notebook will tinkle your attention and suggest you new relevant research dierctions.

## Appendix: Unique pipeline

All the steps above can be merged together into a `Pipeline`: note however the trick we used with the `item` parameter in the `PreprocessingText` class (due to the non-standard dimensions of text embeddings). With the next cell we conclude our notebook

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("preprocessing", PreprocessingText()),
                 ("local homology", KNeighborsLocalVietorisRips(n_neighbors=(5, 15),
                                 homology_dimensions=(1,2),
                                 collapse_edges=True, 
                                 n_jobs = -1)), 
                ("vectorizer", ModifiedPersistenceEntropy())])

# The training dataset, where each item is composed of word embeddings. 
# Here below we select only the first item:
X_train = refined_list

# this is a simple example to test the disambiguation of the word "note"
X_test = ("note", )  

pipe.fit(X_train)

pipe.transform(X_test)