# Lexical Clustering

**(C) 2016-2020 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Version:** 1.2, September 2020

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This notebook provides simple examples of vectorization of ditributional properties of lexical items using Python 3.x. The applied examples show how lexical properties can be derived using common clustering methods on the resulting distributional vector space. This material is used in my graduate classes on Corpus Linguistics and Computational Linguistics at Indiana University at Bloomington.

# Vectorization of Distributional Properties

We will map out lexical distributional properties in the following. With lexical distributional properties we might refer to various kinds of positional or contextual features of words in text.

## Loading a Text into Memory

We will use a collection of fairy tales "The House of Pomegranates" by Oscar Wilde. The following code will read the text into memory. We open a file, read from it, and close the file again:

In [8]:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()

## Using NLTK

We will use the [NLTK](https://www.nltk.org/) module to generate frequency profiles and [n-gram models](https://en.wikipedia.org/wiki/N-gram) using the tokens in the text.

In [11]:
import nltk

We will use the tokenization and lemmatization modules from [NLTK](https://www.nltk.org/). These are not the most accurate and best performing components. For more efficient lemmatizers consider using Python modules like [spaCy](https://spacy.io/).

## Tokenization

We will need the tokens from the text, that is mainly all individual words and punctuation marks separated as individual elements in a token list:

In [12]:
tokens = nltk.word_tokenize(text)

Tokens will contain all tokens as they occur in text. This means that we will find in the token list a *the*, a *The*, maybe even a *THE*. To conflate all occurrences of these variants of "the" to one token representation *the*, we will use lemmatization in the next section.

## Lemmatization for Dimensionality Reduction

NLTK provides a WordNet-based lemmatizer. In the follwoing we import the NLTK *WordNetLemmatizer* module:

In [13]:
from nltk.stem import WordNetLemmatizer

We instantiate a lemmatizer:

In [14]:
lemmatizer = WordNetLemmatizer()

The lemmatizer correctly converts the plural form *dogs* to the lemmatized form, as shown in the example below:

In [20]:
print(lemmatizer.lemmatize("dogs"))

dog


Unfortunately, the lemmatizer does not correct a capitalized *the*, 

In [21]:
print(lemmatizer.lemmatize("The"))

The


Independent of this problem, we could use the lemmatizer for the basic tokens with some morphological structure and attachment in the following way:

In [22]:
lemmas = [ lemmatizer.lemmatize(token) for token in tokens ]

We can print out the first 100 lemmas:

In [23]:
print(lemmas[0:100])

['A', 'HOUSE', 'OF', 'POMEGRANATES', 'Contents', ':', 'The', 'Young', 'King', 'The', 'Birthday', 'of', 'the', 'Infanta', 'The', 'Fisherman', 'and', 'his', 'Soul', 'The', 'Star-child', 'THE', 'YOUNG', 'KING', '[', 'TO', 'MARGARET', 'LADY', 'BROOKE', '--', 'THE', 'RANEE', 'OF', 'SARAWAK', ']', 'It', 'wa', 'the', 'night', 'before', 'the', 'day', 'fixed', 'for', 'his', 'coronation', ',', 'and', 'the', 'young', 'King', 'wa', 'sitting', 'alone', 'in', 'his', 'beautiful', 'chamber', '.', 'His', 'courtier', 'had', 'all', 'taken', 'their', 'leave', 'of', 'him', ',', 'bowing', 'their', 'head', 'to', 'the', 'ground', ',', 'according', 'to', 'the', 'ceremonious', 'usage', 'of', 'the', 'day', ',', 'and', 'had', 'retired', 'to', 'the', 'Great', 'Hall', 'of', 'the', 'Palace', ',', 'to', 'receive', 'a', 'few']


## Using Functional Items as Distributional Features

Distributional properties of lexical items can be associated with various contextual cues. In a *Distributional Semantics* approach the core hypothesis is that the meaning of a specific word is determined by the meaning of the words in its context. Imagine the two different uses of *bats*:

*The bats were flying out of the cave.*

*The bats were made of solid wood.*

For baseball bats it is more likely to be made of solid wood than to fly out of caves. On the other hand, the mammals of the order Chiroptera live in caves, and fly in and out of those.

The general idea in Distributional Semantics is that the meaning of *bat* can be determined by the words in the context. If a word would only have one specific meaning, its meaning could in principle be defined by the words frequently occuring in its context. We could think of it also in another way. The meaning of a word could be defined to be a probability function that predicts words in its context. This is a common interpretation in word-embedding approaches. This is obviously an oversimplification and conceptually wrong, but an approximation that appeared to be helpful in some NLP applications and models.

The core problem is of course that *bat* can refer to many things, at least two, and that the context can help us determine which meaning is most appropriate in a specific context.