# Tokenising Text

Questions
- What is tokenisation?
- How can a string of raw text be tokenised?

Objectives

- Learn how to tokenise text

Key Points

 - Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.
- Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.

### Catchup of syntaxt:

In [2]:
my_letters = ["A","B","C","D","E"] # create a list
print(my_letters[0]) # first item
print(my_letters[:-2]) # slice of items: from beginning until second from the end
print([letter.lower() for letter in my_letters]) # 'comprehend' a list as its lowercase values 

A
['A', 'B', 'C']
['a', 'b', 'c', 'd', 'e']


## Tokenising text

### But first … importing packages

Python has a selection of pre-written code that can be used. These come as in built functions and a library of packages of modules. We have already used the in-built function `print()` and `len()`. In-built functions are available as soon as you start python. 

There is also a bundles of other useful functions that you can **IMPORT** into your code and use. These colections of useful code are called libraries/packages/modules and allow us to gain access to very powerful features very quickly.

For this course we need to `import` a few libraries into Python. We will need to do it once in every notebook.

Sometimes we will only import a particular method/function from a package with `from some_package import some_method` and and sometimes we will give a package a nickname with `import something_long as nickname`. But do not worry, we will guide you through it.

### What we'll need:

We'll import 4 very popular libraries: nltk, numpy, string and matplotlib.pyplot

**NLTK** is the tool which we’ll be using to do much of the text processing in this workshop so we need to run import nltk. We will also use numpy to represent information in arrays and matrices, string to process some strings and matplotlib to visualise the output.

If there is a problem importing any of these modules you may need to revisit the appropriate install in the prerequisites list.

In [None]:
# run this cell now, but also we will copy it into every notebook we work on:

import nltk
import numpy
import string
import matplotlib.pyplot as plt

### Tokenising a string: Splitting it into individual Tokens (words etc.)

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation which is contained in a string.

To tokenise we first need to import the word_tokenize method from the tokenize package from NLTK which allows us to do this without writing the code ourselves.

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTK’s in-built punkt tokeniser by calling:

In [None]:
# run this cell now. (it's fine if you see some pink warnings underneath it)

from nltk.tokenize import word_tokenize
nltk.download('punkt')

Now we can assign text as a string variable and tokenise it. We will save the tokenised output in a list using the humpty_tokens variable. We can inspect this list by inspecting the humpty_tokens variable.

In [None]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
print(humpty_tokens)

In [None]:
# Show first 10 entries of the tokens list
print(humpty_tokens[0:10])

As you can see, some of the words are uppercase and some are lowercase. To further analyse the data, for example counting the occurrences of a word, we need to normalise the data and make it all lowercase.

### Task: changing a list or tokens to their lowercase versions

Hints:

- You can lowercase the strings in the list by calling the `.lower()` on each entry
- You can do this by using the simplified for loop `[output for item in items]`

In [None]:
# solution

lower_humpty_tokens = [word.lower() for word in humpty_tokens]
lower_humpty_tokens[0:6]

In [None]:
Pre-processing Data Collections

# Pre-processing Data Collections 

Overview Teaching: 0 min Exercises: 0 min Questions How can I load a file and tokenise it?

How can I load a text collection made up of multiple text files and tokenise them?

Objectives Learn how to tokenise a text file and a collection of text files

Key Points To open and read a file on your computer, the open() and read() functions can be used.

To read an entire collection of text files you can use the PlaintextCorpusReader class provided by NLTK and its words() function to extract all the words from the text in the collection.

## Data Preparation 

Text data comes in different forms. You might want to analyse a document in one file or an entire collection of documents (a corpus) stored in multiple files. In this part of the lesson we will show you how to load a single document and how to load the text of an entire corpus into Python for further analysis.

Download some data Firstly, please download a dataset and make a note of where it is saved on your computer. We need the path to dataset in order to load and read it for further processing.

We will use the Medical History of British India collection provided by the National Libarry of Scotland as an example:

This dataset forms the first half of the Medical History of British India collection, which itself is part of the broader India Papers collection held by the Library. A Medical History of British India consists of official publications varying from short reports to multi-volume histories related to disease, public health and medical research between circa 1850 to 1950. These are historical sources for a period which witnessed the transition from a humoral to a biochemical tradition, which was based on laboratorial science and document the important breakthroughs in bacteriology, parasitology and the developments of vaccines in a colonial context.

This collection has been made available as part of NLS’s DataFoundry platform which provides access to a number of their digitised collections.

We are only interested in the text the Medical History of British India collection for this course so at the bottom of the website, download the “Just the text” data or download it directly here.

Note that this dataset requires approx. 120 MB of free file space on your computer once it has been unzipped. Most computers automatically uncompress .zip files as the one you have downloaded. If your computer does not do that then right-click on the file and click on uncompress or unzip.

You should be left with a folder called nls-text-indiaPapers containing all the .txt files for this collection. Please check that you have that on your computer and find out what its path is. In my case it is /Users/balex/Downloads/nls-text-indiaPapers/.

Loading and tokenising a single document You can use the open() function to open one file in the Medical History of British India corpus. You need to specify the path to a file in the downloaded dataset and the mode of opening it (‘r’ for read). The path will be different to the one below depending on where you saved the data on your computer.

The read() function is used to read the file. The file’s content (the text) is then stored as a string variable called india_raw.

You can then tokenise the text and convert it to lowercase. You can check it has worked by printing out a slice of the list lower_india_tokens.

In [None]:
file = open('/Users/balex/Downloads/nls-text-indiaPapers/74457530.txt','r') 
# replace the path with the one on your computer or your noteable

india_raw = file.read() 
india_tokens = word_tokenize(india_raw) 
lower_india_tokens = [word.lower() for word in india_tokens] 
lower_india_tokens[0:10] 

Loading and tokenising a corpus We can do the same for an entire collection of documents (a corpus). Here we choose a collection of raw text documents in a given directory. We will use the entire Medical History of British India collection as our dataset.

To read the text files in this collection we can use the PlaintextCorpusReader class provided in the corpus package of NLTK. You need to specify the collection directory name and a wildcard for which files to read in the directory (e.g. .* for all files) and the text encoding of the files (in this case latin1). Using the words() method provided by NLTK, the text is automatically tokenised and stored in a list of words. As before, we can then lowercase the words in the list.

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/balex/Downloads/nls-text-indiaPapers/' 
wordlists = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 
corpus_tokens = wordlists.words() 
print(corpus_tokens[:10]) 

In [None]:
lower_corpus_tokens = [str(word).lower() for word in corpus_tokens] 
lower_corpus_tokens[0:10] 

Task 1: Print slice of tokens in list Print out a larger slice of the list of tokens in the Medical History of British India collection, e.g. the first 30 tokens.

Answer print(corpus_tokens[:30]) 

Task 2: Print slice of lowercase tokens in list Print out the same slice but for the lower-cased version.

Answer 

print(lower_corpus_tokens[0:30]) 