# Text Mining

## Objective

After completing this lab you will be able to understand several basic techniqes for analysing text data to find useful information.

---
---

### Why Do Text-Mining By Programming?

You could do a lot of text mining manually, by searching and counting through text by eye, or using the 'find and replace' function in a text editor. You could count and input numbers on a spreadsheet and do your analysis with formulae. However, with a large corpus of many thousands or millions of words these tasks are error-prone, boring and mind-boggling. It may even be impossible in the time available.

You could use some of the specialist software tools out there for cleaning and exploring a corpus, and while these are definitely an option — and they could be used in combination with manual and programming techniques — here are several advantages to programming your text-mining:

* Automation: coding automates boring and difficult tasks that are hard for humans to do.
* Reproducibility: code both executes and unambiguously documents the steps to your results.
* Clarity: coding forces you to understand exactly what you are doing with your text, promoting deep knowledge of the techniques you are using.
* Bespoke: coding your own solution means you can design it to meet exactly what your research questions demand.
* Advanced: coding may be the only way to do certain advanced analysis techniques or analyse extremely large datasets.

---
---

## Simple String Manipulation in Python

This section introduces some basic things you can do in Python to create and manipulate strings. A string is a simple *sequence of characters*, for example, the string `coffee` is a sequence of the individual characters `c` `o` `f` `f` `e` `e`. Strings are the way that Python (and most programming languages) deal with text.

### Creating and Storing Strings with Names
Strings are simple to create in Python. You can simply write some characters in quote marks (either single `'` or double `"` is fine in general).

In [None]:
'Butterflies are important as pollinators.'

In order to do something useful with this string, other than print it out, we need to store it by using the assignment operator `=` (equals sign). Whatever is on the right-hand side of the `=` is stored with the _name_ on the left-hand side.

In [None]:
my_sentence = 'Butterflies are important as pollinators.'

*Notice that nothing is printed to the screen.*

That's because the string is stored with the name `my_sentence` rather than being printed out. In order to see what is 'inside' `my_sentence` we can simply write `my_sentence` in a code cell, run it, and the interpreter will print it out for us.

In [None]:
my_sentence

### Slicing Bits of Strings

#### Accessing Individual Characters
A string is just a sequence (list) of characters. You can access **individual characters** in a string by specifying which ones you want in square brackets.

In [None]:
my_sentence[1]

**Hang on a minute!** Did you notice something unexpected?

Why did it give us `u` instead of `B`?

In programming, everything tends to be *zero indexed*, which means that things are counted from 0 rather than 1. Thus, in the example above, `1` gives us the *second* character in the string, not the first like you might expect.

If you want the first character in the string, you need to specify the index `0`! 

In [None]:
my_sentence[0]

#### Accessing a Range of Characters

You can also pick out a **range of characters** from within a string, by giving the *start index* followed by the *end index* with a semi-colon (`:`) in between.

The example below gives us the character at index `0` all the way up to, *but not including*, the character at index `20`.

In [None]:
my_sentence[0:20]

### Changing Whole Strings with Methods
Python strings have some built-in *methods* that allow you to change a whole string at once. You can change all characters to lowercase or uppercase:

In [None]:
my_sentence.lower()

In [None]:
my_sentence.upper()

NB: These functions do not change the original string but create a new one. Our original string is still the same as it was before:

In [None]:
my_sentence

### Testing Strings with Methods

You can also test a string to see if it is passes some test, e.g. is the string all alphabetic characters only?

In [None]:
my_sentence.isalpha()

Why does this produce this particular result?

Here's another. Does the string have the letter `p` in it?

In [None]:
'p' in my_sentence

---

#### Going Further with Python Documentation

Everything you can do in Python is well-documented online. It is a skill and art to read code documentation, and you should start to learn it as soon as you can on your code journey.

Here is a link to all the methods you can use with strings: 
https://docs.python.org/3/library/stdtypes.html#string-methods

Why not try a method we have not used here so far?

---
---

## 5 Steps of Text-Mining
There is no set way to do text-mining, but typically a workflow will involve steps like these:
1. Choosing and collecting your text
2. Cleaning and preparing your text
3. Exploring your data
4. Analysing your data
5. Presenting the results of your analysis

You may go through these steps more than once to refine your data and results, and frequently steps may be merged together. The important thing to realise is that steps 1-2 are critical in ensuring your data is capable of actually answering your research questions. You are likely to spend significant time on cleaning and preparing your text.

> **Rubbish in = rubbish out**

---
---
## Step 1: Choosing and Collecting Your Text
No matter your research subject, you need to be aware of the many issues of electronic data collection. We cannot cover them all here, but you should ask yourself some questions as you start to collect data, such as:
* What sort of data do I need to answer my research questions?
* What data is available?
* What is the quality of the data?
* How can I get the data?
* Am I allowed to use it for text-mining?

### A Simple Example: Top Words Used in Homer's Iliad

Our research question will be:

> What are the top 10 words used in Homer's Iliad in English translation?

#### What sort of data do I need to answer my research questions?

I need a copy of Homer's Iliad in English translation. In this instance, I am not bothered by which translation.

#### What data is available?

[Project Gutenberg](http://www.gutenberg.org/) is the first provider of free electronic books and has over 58,000. "You will find the world's great literature here, with focus on older works for which U.S. copyright has expired. Thousands of volunteers digitized and diligently proofread the eBooks, for enjoyment and education."

Here is Homer's Iliad, translated by Alexander Pope in 1899: http://www.gutenberg.org/ebooks/6130

#### What is the quality of the data?

Potentially variable. When some books are digitised by OCR ([Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)) they don't get corrected before being published online, but a quick look at this file shows that it is excellent quality.

#### How can I get the data?

We can access text file at:
http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-0.txt

#### Am I allowed to use it for text-mining?

Project Gutenberg says in their [Permission: How To](http://www.gutenberg.org/wiki/Gutenberg:Permission_How-To) that "The vast majority of Project Gutenberg eBooks are in the public domain in the US." However, since UK copyright is different from US copyright, we still have to check for ourselves. This is a complicated area, but broadly we can say that UK copyright expires 70 years after the death of the author. Since [Alexander Pope](https://en.wikipedia.org/wiki/Alexander_Pope) died in 1744, we are probably ok to use his work.

### Getting a Copy of the Homer's Iliad Text
We can use a Python library called `requests` to get content of Web pages. The  We can therefore get a copy of the text file like this:

In [None]:
import requests
iliad_url = 'http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-0.txt'
response = requests.get(iliad_url)
iliad = response.text
iliad[18254:18621]

We can find out how many characters the file has by using the `len()` function.

In [None]:
len(iliad)

We can search for a particular string in the file. The function `find()` returns the index of the _first_ matching string it finds.

In [None]:
word = 'shield'
iliad.find(word)

---
---

## Steps 2 and 3: Cleaning and Exploring Your Data
We are going to combine these two steps in this workshop.
### Inspecting and Preparing the Text
The first thing to do is inspect the text and see what might need sorting out. Looking again at the text by eye (http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-0.txt) you can see that the book starts with a load of front matter we don't want.

The book actually starts after the text "`***START OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`":

In [None]:
iliad[894:1045]

There is also unwanted matter at the end after "`***END OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`" that we should get rid of too.

Why does the text have all these `\r` and `\n` in them?

### Creating and Preparing a Local Copy

It is not very efficient to keep making Web requests to the webpage, especially with a very large corpus. I have therefore downloaded a copy for us (`Iliad.txt`). We will use this local copy instead from now on.

I have also taken some steps to prepare the file on your behalf, to save us some time. In the spirit of full transparency and documentation here is what I have done:

* Removed the unwanted Gutenberg-related matter at the front and back of the book.
* Converted the character encoding from 'ISO 8859-1' to 'UTF-8'.

You don't need to worry about the details of _character encoding_ for this workshop. You only need to know that Python works most easily with UTF-8 files and so we must have the file in that encoding to avoid problems.

### Tokenising the Text
Now we are ready to start preparing and exploring our text. _Tokenising_ means splitting a text into meaningful elements, such as **words, sentences, or symbols**.

To do this we use a simple facility provided by the Natural Language Toolkit (NLTK) to read in the file and a function to do the tokenising for us. The code example below takes a single file and tokenises it. Remember NLTK is a library we need to `import` in order to use it in our code.

> **Important!**

> **The following code is the hardest code that will be presented in this notebook. You do not need to understand everything here so please don't lose heart at this point! 💖**

In [None]:
!pip install nltk

In [None]:
import nltk

# Download the tokeniser
nltk.download('punkt')

In [None]:
# Get a plain text reader
from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader('.', '')

# Read the text file
iliad_file = 'Iliad.txt'
text = reader.raw(iliad_file)
text[0:700]

In [None]:
# Import the tokeniser
from nltk import word_tokenize

# Tokenise the text and print the first 20 characters
tokens = word_tokenize(text)
tokens[0:20]

You can also `import` and use the sentence tokeniser `sent_tokenize` instead. Try this yourself.

In [None]:
# Write your code here
from nltk import sent_tokenize

# Tokenise the text and print the first 20 sentences
sent_tokens = sent_tokenize(text)
sent_tokens[0:20]

There are a number of problems with these tokens: the capitalisation of the words has been preserved, and some of the tokens have unwanted special characters or comprise single items of punctuation.

### Normalising to Lowercase
Normalising all words to lowercase ensures that the same word in different cases can be recognised as the same word, e.g. we want 'Shield', 'shield' and 'SHIELD' to be recognised as the same word.

However, whether you choose to do this depends on the nature of your corpus and the questions you are investigating. For example, in another case, you may be not want the word 'Conservative' to be conflated with the word 'conservative'.

In our case, we will lowercase the whole file immediately before tokenising it:

In [None]:
text_lower = text.lower()
tokens = word_tokenize(text_lower)
tokens[0:20]

### Removing Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, the words "poet" and "poet," might be considered to be different words.

This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask.

It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.

We will replace *all* punctuation with the empty string ''.

In [None]:
# Import a module that helps with string processing
import string

# Make a table that 'translates' all punctuation to None (i.e. empty) 
table = str.maketrans('', '', string.punctuation)
punc_table = {chr(key):value for (key, value) in table.items()}
punc_table

In [None]:
tokens_nopunct = [token.translate(table) for token in tokens]
tokens_nopunct[0:20]

### Removing Non-Word Tokens

We are still left with some problematic tokens that are not useful words, such as empty tokens `''` and tokens that may be chapter numbers:

In [None]:
tokens_empty = [word for word in tokens_nopunct if word == '']
tokens_empty[0:10]

In [None]:
tokens_nonwords = [word for word in tokens_nopunct if word.isnumeric()]
tokens_nonwords[0:10]

In [None]:
words = [word for word in tokens_nopunct if word.isalpha()]
words[0:20]

In [None]:
# Store clean words in to a file.
clean_text = " ".join(words)

iliad_clean_file = open("iliad_clean.txt", "w")
 
#write string to file
iliad_clean_file.write(clean_text)
 
#close file
iliad_clean_file.close()

---
---
# Step 4-5: Analysing Data and Visualising Results using Python

In [None]:
# Loading data from file
text_file = 'iliad_clean.txt'
words = []

# Open the text file and append all the words to a list of words
with open(text_file) as file:
    for word in file.read().split():
        words.append(word)

words[0:20]

---
---
## Step 4: Analysing your Data with Frequency Analysis
Well done on making it this far! Let's take a moment to remember our research question:

> What are the top 10 words used in Homer's Iliad in English translation?

In order to answer this question we need to _count_ the number of _each unique word_ in the text. Then we can see which are the most popular, or frequent, 10 words. This metric is called a **frequency distribution**. 

### English Stopwords
Before we start, we need to take a moment to think about what sort of words we are actually interested in counting. 

We are not interested in common words in English that carry little meaning, such as 'the', 'a' and 'its'. These are called **stopwords**. There is no definitive list of stopwords, but a commonly-used list is provided by the Natural Language Toolkit (NLTK).

Let's do this in 4 steps:

1. We start by downloading the the NLTK list of all stopwords:

In [None]:
import nltk

nltk.download('stopwords')

2. Then we import the list of stopwords we just downloaded, and get just the English stopwords.

In [None]:
from nltk.corpus import stopwords

english_stops = stopwords.words('english')

sorted_english_stops = sorted(english_stops)
sorted_english_stops[0:20]

3. Before using the stopwords, we will also remove all the punctuation so that it matches the text we already cleaned:

In [None]:
import string

# Make a table that 'translates' all punctuation to None (i.e. empty) 
table = str.maketrans('', '', string.punctuation)

english_stops_nopunct = {stopword.translate(table) for stopword in english_stops}
english_stops_nopunct

4. Finally, we filter out all the English stopwords from the tokens:

In [None]:
words_nostops = [word for word in words if word not in english_stops_nopunct]
words_nostops[:20]

**Beautiful!** 😃

### Creating a Frequency Distribution
At last, we are ready to create a frequency distribution. We will use another NLTK facility called `FreqDist` to count the frequency of each unique word in the text.

First, we create a frequency distribution:

In [None]:
from nltk.probability import FreqDist
freqdist = FreqDist(words_nostops)

Here are the top 10 most frequent words (the numbers are the absolute word count):

In [None]:
freqdist.most_common(10)

> Rather amazingly, that is it! We have now answered our research question and can submit our report. Congratulations! 🎉

---
---
## Step 5: Presenting the Results of Your Analysis Visually
But wait, we need a pretty graph for the examiners! Let's display our results as a simple line plot using the library [Matplotlib](https://matplotlib.org/).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [15, 10]
plt.style.use('fivethirtyeight')
plt.suptitle("Top 10 Words used in Homer's Iliad in English translation")

FreqDist.plot(freqdist, 10)
plt.show()

Now that you have seen the data and graph we have generated, no doubt you can see many ways we should improve. What immediately jumps out at you?

The process of text-mining a corpus (or individual text) is an iterative process. As you clean and explore the data, you will go back over your workflow again and again -- from the collection stage, through to cleaning, analysis and presentation.

**Fortunately, as you have done all your text-mining in code, you know exactly what you did and can rerun and modify the process.**

---
### Going Further: Libraries Libraries Libraries

By now, you will be getting the idea that much of what you want to do in Python involves importing libraries to help you. Remember, libraries are _just code that someone else has written_.

As reminder, here are some of the useful libraries we have used or mentioned in this lab:
* [Requests](http://docs.python-requests.org/en/master/) - HTTP (web) requests library
* [Natural Language Tool Kit (NLTK)](http://www.nltk.org/) - natural language analysis library
* [Matplotlib](https://matplotlib.org/) - 2D plotting library

---


---
---
## Summary

Finally, we have achieved text-mining nirvana! Let's recap.

We have: 

* Loaded our clean text data from a file into a list
* Removed English stopwords from the list of tokens
* Created a frequency distribution and found the 10 most frequent words
* Visualised the frequency distribution in a line plot

🎉🎉🎉