# Lab2.5: Using the Natural Language Toolkit (NLTK) to create concordances for polysemous words

Concordancers are powerful tools that show patterns and structures in languages centered around words. In this lab session, you learn how to create a concordance or Key Word In Context (KWIC).

The goals of this notebook are:

* demonstrate how different meanings of words are reflected by their contexts
* how to process a series of text files with the NLTK toolkit
* how to build a concordance or KWIC (Key Word In Context) index using NLTK


## Loading the text content from a number of files in a folder

We are going to build a corpus with texts in two different domains to obtain the concordances of the word *mouse*. We collected 5 texts that contain the word *mouse* in a computer domain and 5 texts in an animal domain. These texts are in this Github.


In order to load data, it is good to know where you are running this notebook and what the path is to your data. For this we use the folowing code:

In [1]:
import os
os.getcwd()

'/Users/piek/Downloads/ma-hlt-labs/lab2.word_meaning'

This should show the full path of where this notebook resides on your disk.

Now we can specifiy the path to our folder 'mouse_texts' with mouse text data and load it in our program.

In [2]:
from pathlib import Path

# The path to the folder with the text files. 
# Here I use a relative path from where I run the notebook
# Adapt the path accordingly to where your data is and/or where you run your notebook
# You can also specify the absolute path

mousepath = Path(os.getcwd()+'/mouse_texts')
print(mousepath)
files_in_mousepath = mousepath.iterdir()
print(files_in_mousepath)
for item in files_in_mousepath:
    if item.is_file():  # check of the item is not a subdirectory!!
        print(item.name)

/Users/piek/Downloads/ma-hlt-labs/lab2.word_meaning/mouse_texts
<generator object Path.iterdir at 0x7fe5701e2b30>
comp_mouse3.txt
comp_mouse2.txt
comp_mouse1.txt
comp_mouse5.txt
comp_mouse4.txt
animal_mouse3.txt
animal_mouse2.txt
animal_mouse1.txt
animal_mouse5.txt
animal_mouse4.txt


In the next cell, we iterate over all the files in the folder and read these one by one. If the filename starts with "animal_", we add the text to the 'animal_corpus'. If it starts with the filename 'computer_' it goes to the 'computer_corpus'. All other files are ignored.

In [3]:
animal_corpus = ""
computer_corpus = ""
files_in_mousepath = mousepath.iterdir()
for item in files_in_mousepath:
    if item.is_file():  # check of the item is not a subdirectory!!
        
        with open(item) as infile:
            if item.name.startswith("animal_"):
                animal_corpus += infile.read()
            elif item.name.startswith("comp_"):
                computer_corpus += infile.read()

 
print('animal_corpus:', len(animal_corpus))
print('computer_corpus:', len(computer_corpus))

animal_corpus: 1985
computer_corpus: 2990


If you carried out the instructions correctly, the variables "animal_corpus" and "computer_corpus" now contain the concatenated content of all files with a single meaning of the word "mouse".

Note that it is not so difficult to imagine that you can divide texts from two domains over two different folders and read all text from each folder into a different variable. You can easily create your own corpus in this way.

### Get the separate word tokens from the content

In [5]:
import nltk
animal_tokens=nltk.word_tokenize(animal_corpus)
computer_tokens=nltk.word_tokenize(computer_corpus)

The tokenize function takes a full text as input as a string and splits the words using spaces and punctuation. Note that we did not split the text into sentence this time. This is because we want to create concordances that can exceed sentence borders.

Tokenization is not trivial but essential to do anything with a text. To see what the function generates we use the following command to show the first ten tokens from the list of tokens.

In [6]:
print(animal_tokens[:10])
print(computer_tokens[:10])

['Mouse', ',', '(', 'genus', 'Mus', ')', ',', 'the', 'common', 'name']
['Some', 'tablet', 'and', 'laptop', 'manufacturers', 'may', 'say', 'that', 'the', 'best']


NLTK has specific function to process and analyse a text. To use these functions it is necessary to create  a NLTK text object from a list of tokens. We do that as follows:

In [7]:
animal_text_object=nltk.Text(animal_tokens)
computer_text_object=nltk.Text(computer_tokens)

Try out "text_object." and TAB to see what functions are available. Use the NLTK book to learn more: http://www.nltk.org/book/ch01.html. 

* text_object.collocations()
* lexicon=text_object.vocab()
* lexicon.keys()
* lexicon.items()

One of these function is the concordance function, which is what we are going to use next.

### Call the NLTK concordance function

The concordance function takes a word as input to create the concordance surrounding this word based on the text. We are going to use the word 'mouse'

In [8]:
animal_text_object.concordance('mouse')

Displaying 8 of 8 matches:
 Mouse , ( genus Mus ) , the common name ge
s ) long . In a scientific context , mouse refers to any of the 38 species in t
us Mus , which is the Latin word for mouse . The house mouse ( Mus musculus ) ,
the Latin word for mouse . The house mouse ( Mus musculus ) , native to Central
 . Their research was conducted in a mouse model of the disease.A mouse , plura
ed in a mouse model of the disease.A mouse , plural mice , is a small rodent ch
 high breeding rate . The best known mouse species is the common house mouse ( 
wn mouse species is the common house mouse ( Mus musculus ) . It is also a popu


As shown in the instructions of http://www.nltk.org/book/ch01.html the concordance function shows the lines from the text in which the target word occures with the left and right context. By centralizing the text around the target word a nice overview is obtained of the contexts.

In [9]:
computer_text_object.concordance('mouse')

Displaying 15 of 15 matches:
 manufacturers may say that the best mouse is a finger tap , but we just don ’ 
There ’ s nothing quite like using a mouse to navigate your PC . No matter whet
r as unruly as you prefer.A computer mouse is a handheld hardware input device 
olders . For desktop computers , the mouse is placed on a flat surface such as 
s placed on a flat surface such as a mouse pad or a desk and is placed in front
 is an example of a desktop computer mouse with two buttons and a wheel.A compu
h two buttons and a wheel.A computer mouse is a hand-held pointing device that 
 The first public demonstration of a mouse controlling a computer system was in
itive games . The new Logitech G Pro mouse retains most of the design cues of t
able ) , which is really light for a mouse . To put it into context , the Steel
woops around the bottom curve of the mouse , and the the Logitech G logo still 
lights up , as well.With Apple Magic Mouse 2 Space Gray you swipe through web p
multi-touch

Concordances show how we use words in different contexts. By dividing the text a priori on the basis of different contexts, we can now see how the same word *mouse* exhibits different meanings in different contexts.

# End of this Notebook