

# Badge 1: Tokenising Text and Text in Context <img src="../images/badge01.png" width=140 height=140 style="float: right;" />

In this notebook, you will learn how to:

1. [**Load a Text File** in Your Notebook](#1)
2. [**Tokenise** Strings: Splitting Them into Tokens (Words, etc.)](#2)
3. [Create **Concordance** Lists (tokens in context)](#3)
4. [Search Through Text Using **Regular Expressions** (wildcards syntax)](#4)

### Recap of syntax:

In [None]:
# Create a list, get its first item and get its length.
my_letters = ["A","B","C","D","E"] 
print(my_letters[0])
print(len(my_letters))

In [None]:
# Slice of items: from the first item up to (not including) the second-to-last item
print(my_letters[:-2])

In [None]:
# 'Comprehend' a list as its lowercase values:
my_letters_lowercased = [letter.lower() for letter in my_letters]
print(my_letters_lowercased) 

In [1]:
# Also: run this cell now. It's the usual imports of text mining libraries.
import nltk
import numpy
import string
import matplotlib.pyplot as plt

<a id="1"></a>
# 1. Loading a Text File in Your Notebook

#### Questions & Objectives

- How can I load a text file from my hard drive?

#### Key Points

- To open and read a file on your computer, we use `open()`, `read()` and `close()`
- Once the file is opened, you can store its contents in a variable

Broadly speaking there are two contexts in which we load a text file for analysis:

- Local file:  you have your file on your [virtualized] computer or hard drive because you created or downloaded it earlier
- Remote file: you access the file directly from some website, 'on the fly', processing it with your code but never really saving it as your own (e.g., for copyright or convenience reasons)

We are only covering the first case in this badge as the text transcript file as we provide that to you.

### Loading a local file:

First let's load a file from your 'hard drive' - because we are working inside of Noteable, it acts as your hard drive. There's a file you downloaded called `Babysitting-transcript.txt` and it is in the data folder, so we reference it with the path of where to find it using the notebook folder as a reference `../data/Babysitting-transcript.txt` (the `../` means 'up one folder from this one').

In [21]:
file_name = "../data/Babysitting-transcript.txt"
my_file = open(file_name) # open the file
transcript = my_file.read()   # read contents of the file and put them in a variable
my_file.close()           # close the file

# After that you have access to the file as text in the transcript variable you created.
print("number of characters:", len(transcript)) 
print(transcript[:955])       # first 955 characters
print("------------")
print(transcript[-955:])      # last 955 characters

number of characters: 47819
**Babysitting
*Prologue

@Mother	All right, guys. We're here. Don't forget your stuff, OK? And Dylan, grab your snow pants.

@Male Child	OK.

@Ira Glass	Here's a ritual that happens in millions of American families every day, parents dropping off kids at the babysitter's.

@Cristiana	Good morning.

@Mother	Good morning.

@Cristiana	Hi, sweetie. I haven't seen you guys in such a long time.

@Ira Glass	Sarah, age 9, and Dylan, who's 6, are being left at a friend's house where there are two other kids, Elliott and Emma, and their regular babysitter, Cristiana, who meets them at the door, who hasn't seen them since before Christmas.These kids have known Cristiana longer than they've known almost anyone. Four years she's been their sitter, an eternity. Cristiana takes care of them after school every day. Cristiana knows everything about them. And their such old pros at being left with the sitter, they don't think twice about it. Mom leaves, no te
------------

@I

### So now we have a long string ... what's next?

As we can see it is not particularly useful to operate on **characters** as the main measure of length and to access parts of text. It would be more meaningful to ask for the first 10 words or last 10 words. Indeed, we might want to consider punctuation and symbols too.

This is where tokens come in:

<a id="2"></a>
# 2. Tokenising Strings: Splitting Text into Tokens (Words, etc.)

#### Questions & Objectives

- What is tokenisation?
- How can a string of raw text be tokenised?

#### Key Points

- Tokenisation means to split a string into separate words and punctuation marks, to be able to, for example, count them.
- Text can be tokenised using a tokeniser, e.g., the `punkt` tokeniser in NLTK.

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation mark which is contained in a string.

To tokenise we first need to import the `word_tokenize` method from the `tokenize` package of NLTK, which allows us to tokenise text without writing the code ourselves.

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTK’s built-in `punkt` tokeniser by calling:

In [22]:
# Run this cell now (it's fine if you see some pink messages underneath it).

from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/balex/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

As an example, let's tokenise (split into tokens) the nursery rhyme "Humpty Dumpty".

We will save the tokenised output in a list using the `humpty_tokens` variable so we can inspect it.

In [23]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
print(humpty_tokens) # print all tokens

['Humpty', 'Dumpty', 'sat', 'on', 'a', 'wall', ',', 'Humpty', 'Dumpty', 'had', 'a', 'great', 'fall', ';', 'All', 'the', 'king', "'s", 'horses', 'and', 'all', 'the', 'king', "'s", 'men', 'could', "n't", 'put', 'Humpty', 'together', 'again', '.']


In [24]:
# Let's print just a few of them to have a closer look:
print(humpty_tokens[0:10])

['Humpty', 'Dumpty', 'sat', 'on', 'a', 'wall', ',', 'Humpty', 'Dumpty', 'had']


### 🐛 Minitask

Write some Python code to tokenise the transcript document.  Remember you have already saved that in the ```transcript``` variable.

In [None]:
# Write your answer here:

In [None]:
<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION

    ...
    
    ### END SOLUTION
    
</details>

### Unifying and cleaning up the text

To further analyse the data, we'll first learn how to perform some clean-up tasks. 

As you can see in the above example, some of the words are uppercase and some are lowercase. But Python is case-sensitive, which means that 'hope' and 'Hope' are considered two completely different strings.

For example, when searching for a word or counting the occurrences of a word, we most likely will want to consider both the lowercase and uppercase versions of the word (e.g., `company` and `Company` ). That's why, to simplify the analysis, we often normalise the data by making it all lowercase. This way, both of the above words would simply become `company`, making the text easier to comprehend.

Since our list of tokens is a list of strings (words and punctuation) we can apply the `list comprehension loop` we learned about before to transform our list of mixed-case words into a list of lowercase words. 

As you might remember, a syntax for such loop is `[output_format for item in items ]` where:

- `output_format` is some operation we perform on item, like `item.lower()` or `len(item)`
- `items` is the list with all the elements we want to transform
- `item` is a temporary name we give to each element of `items`, for the purposes of using that name inside of `output_format`

Let's modify above example, so that we only work with lowercased tokens of the nursery rhyme:

In [None]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
lowercase_tokens = [token.lower() for token in humpty_tokens]
print(lowercase_tokens)

### Preparing Your Tokens for Analysis (e.g., making all words lowercase)
### A few words about overwhelming your computer with processing: DO NOT PANIC!

Note that this dataset is quite large: it contains almost 500 files, 30,000,000 words, and 130,000,000 characters! It's 120MB of data.

Still, loading all the files and turning text into tokens takes about 1-2 seconds!

However, when we lowercase all of the words in this corpus, we run a piece of code which turns every single character of the 130,000,000 characters into its lowercased equivalent.

`lower_corpus_tokens = [word.lower() for word in corpus_tokens]` 

This is not going to happen instantly. Run the below code cell now, and then keep reading.

You'll need to be patient; it might take a minute or more to run. It will look as if your notebook has FROZEN (it might stop responding), but it's just busy at work. You will know that the cell is running, because there will be an `In [*]` on the left top of that busy cell, and (on some browsers) the icon on the browser tab will turn into an hour glass.

If you have not done it yet, run the below cell now and see what happens.

In [None]:
lower_corpus_tokens = [word.lower() for word in corpus_tokens] 
print(lower_corpus_tokens[400:450]) 

Here are some things you can do to not have to wait all the time for your code to finish:

**We try to do the hardest processing only once, and store the result in a variable, which we use later** (instead of re-doing the processing all the time).

This is exactly what we do above: after the above line of code had finished running, the variable `lower_corpus_tokens` contains all of your tokens as lowercase characters.

Accessing the tokens in this variable will take a very short time now, because all the time-consuming processing (making things lowercase) is already done, and all the tokens are now lowercased.  Moving forward, we'll just want to view them.

In [None]:
# This will execute almost immediately now, because all the processing 
# (changing millions of characters) is done, and we are just requesting a subset of a (very large) list:

print(lower_corpus_tokens[400:450]) 

Overall, when processing needs to be done, it needs to be done. So don't be scared of it, and when you see the `In [*]` indicator that a cell is processing, just get a cup of tea :)

<a id="3"></a>
# 2. Concordance Lists (tokens in context)

#### Questions & Objectives

- How can I load a text collection made up of multiple text files and tokenise them?

#### Key Points

- A concordance list is a list of all contexts in which a particular token appears in a corpus or text.
- A concordance list can be created using the `concordance()` method of the `Text` class in NLTK.

Words make a lot of sense in context. A concordance list is a fantastic way to glimpse into the text and see how a particular token is used.

Next, we will display concordances for a particular token, i.e., all contexts a particular token appears in. We can do this using the `Text` class in NLTK’s `text` package. We can represent our list of lowercased tokens in the document collection loaded previously using the `Text` class. The concordance list of a token can be displayed using the `.concordance()` method on this class as shown below.

Note that the process of acquiring concordance data will take about 10 seconds, depending on how busy your machine currently is.

In [None]:
from nltk.text import Text

In [None]:
text_of_the_corpus = Text(lower_corpus_tokens)
print(text_of_the_corpus.concordance('woman'))

In the output for the next bit of code, which creates a concordance list for the word 'he', we can see that there are many more results in the list than displayed on screen ('Displaying 25 of 22830 matches'). The `concordance()` method only prints the first 25 results by default (or less if there are less).

In [None]:
print(text_of_the_corpus.concordance('he'))

In [None]:
# You can specify the number of lines using 
# an additional lines parameter, e.g.,
print(text_of_the_corpus.concordance('he', lines=170))

# Notice that when the result of a cell is too long, it will become its own little scrollable window.

#### 🐛Minitask: combining what we learned so far

This task will require you to copy-paste and adjust various lines of code from this notebook. We will load and analyse a collection of Barack Obama speeches. 

- Load a new corpus: a few selected speeches of Barack Obama located in the folder `./data/barack_obama_speeches`. 
- Turn corpus data into tokens (with `.words()`)
- Lowercase all the tokens using a list comprehension loop (`[output for something in list_of somethings]`) and `.lower()`
- Create a Text object from the lowercased tokens with  `Text( your_lowercased_tokens )`
- Create concordance lists for some words that you find interesting, e.g., 'hope', 'can', 'people'

In [None]:
# Write your answer here:

<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    corpus_root = "./data/barack_obama_speeches"
    corpus_data = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 
    corpus_tokens = corpus_data.words()
    corpus_text_object = Text(lower_india_tokens)
    print(corpus_text_object[:50])
    print(corpus_text_object.concordance('people'))
    ### END SOLUTION

</details>