# Loading and Tokenising Text

In this notebook:

1. Loading a text file from your notebook or from a website
2. Tokenising strings - splitting it into tokens (words etc.)

### Recap of syntax:

In [None]:
# create a list and get the first item and its length
my_letters = ["A","B","C","D","E"] 
print(len(my_letters))
print(my_letters[0])

In [None]:
# slice of items: from beginning until second from the end
print(my_letters[:-2]) 

In [None]:
# 'comprehend' a list as its lowercase values 
my_letters_lowercased = [letter.lower() for letter in my_letters]
print(my_letters_lowercased) 

In [None]:
# also: run this cell now. It's the usual imports of text mining libraries

import nltk
import numpy
import string
import matplotlib.pyplot as plt

# 1. Loading a text file from your notebook or from a website

#### Questions & Objectives:

- How can I load a text file from my harddrtive or a website?

#### Key Points

- To open and read a file on your computer, we use `open()`, `read()` and `close()`
- To open and read a file from the internet, we use `urllib.request.urlopen()` and `.read().decode('utf-8')`
- once the file is opened, you can store it's contents in a variable

Broadly speaking there are two contexts in which we load a text file for analysis:

- Local file:  you have your file on your 'virtual computer/haddrive' because you created or downloaded it earlier
- Remove file: you access the file directly from some website, 'on the fly', processing it with your code, but never really saving it as your own. (eg. for copyright or convenience reasons)

### Loading a local file:

First let's load some file from your 'hard drive' - because we are working inside of noteable, it acts as your harddrive. There's a file you downloaded called `file_inaugural_speech_obama.txt` and it is in the same folder as this notebook, so we're refering to it as `./file_inaugural_speech_obama.txt` (the `./` means 'same folder as this notebook')

In [None]:
file_name = "./data/barack_obama_speeches/inaugural_speech.txt"
my_file = open(file_name) # open the file. 
speech = my_file.read() # read content of it and put them in a variable
my_file.close() # close the file

# after that you have access to that file as text in the speech variable you created
print("number of characters:", len(speech)) 
print(speech[:50]) # first 30 words
print(speech[-50:]) # last 30 words

### Loading a remote (online) file:
To read the same file from an online source (like from the white house website) we need to import a url-handling library urllib, but the process is very simmilar

In [None]:
import urllib # you have to only do this once
link = "https://raw.githubusercontent.com/drpawelo/efi_text_mining_bootcamp/master/data/inaugural_speech_obama.txt"
my_file = urllib.request.urlopen(link) # download the file (no need to open-close)
speech = my_file.read().decode('utf-8') # read and decode content and save it

# after that you have access to that file as text.
print(len(speech)) # how long is it?
print(speech[:50]) # first 30 words
print(speech[-50:]) # last 30 words

### What's simmilar/different:

Notice simmilarities and differences in both methods:

**GET LIBRARIES/TOOLS** on top of what python already gives you. You do this only once per notebook.

- 

**OPEN**:Both methods need a mame/address of the file (in folders, or on the website)

- local:  `open(file_name)`
- remote: `urllib.request.urlopen(link)`

**READ**: In both methods once you have access to the file, you need to READ the content of it and put it in a string variable. Notice that remote files can come in various 'encodings' (ways to understand special characters and punctuation), so we usually specify the `UTF-8` (Unicode Transformation Format) for plain english. Another common one is `latin1`

- local: `my_file.read()`
- remote: `my_file.read().decode('utf-8')`

**CLOSE**: only in the local file we need to close it once we've read it. It's so that another script or user can open it later. This works like with all the files on a computer: they can be opened just in one instance at a time.

- local: `my_file.close()`

### 🐛Minitask

Write some python to open the following online file and display the characters between indeses 42380 and 42869 in that file. (don't peak what's in the file). Do you recognise what play this text is from?

http://www.gutenberg.org/files/1513/1513-0.txt

In [None]:
# white your answer here:





### So now we have a long string... what's next?

But as we can see it is not particularly useful to operate on **Characters** as the main measure of length and to access parts of text. It would be more meaningful to ask for the first 10 words, or last 10 words. Indeed, we might want to include puctuation and symbols too.

This is where tokens come in:

# 2. Tokenising strings - splitting it into tokens (words etc.)

#### Questions & Objectives:

- What is tokenisation?
- How can a string of raw text be tokenised?

#### Key Points

- Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.
- Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation which is contained in a string.

To tokenise we first need to import the word_tokenize method from the tokenize package from NLTK which allows us to do this without writing the code ourselves.

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTK’s in-built punkt tokeniser by calling:

In [None]:
# run this cell now. (it's fine if you see some pink warnings underneath it)

from nltk.tokenize import word_tokenize
nltk.download('punkt')

Let's tokenise (split into tokens) a nursery rhyme "Humpty Dumpty".

We will save the tokenised output in a list using the `humpty_tokens` variable and can inspect it.

In [None]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
print(humpty_tokens) # print all tokens

In [None]:
# let's print just a few of them to have a closer look:
print(humpty_tokens[0:10])

### Unifying and cleaning up the text

To further analyse the data, we'll first learn how to perforn some cleanup tasks. 

As you can see in the above example, some of the words are uppercase and some are lowercase. But Python is case-sensitive, which means that 'hope' and 'Hope' are considered to be two completely different strings.

For example, when searching for a word or counting the occurrences of a word, we most likely will want to seek both for the lowercase and uppercase versions of the word (eg. `company` and `Company` ). That's why simplify the analysis, we often normalise the data and make it all lowercase. This way both of the above words would simply become `company` and will make the text easier to comprehend.

Since our list of tokens is basically a list of strings (words) we can apply the `List comprehention Loop` we learned about before to transform our list of mixed-case words, into a list of lower-case words. 

As you might remember a syntax for such loop is `[output_format for item in items ]` where:

- 'output_format' is some operation we perform on item, like `item.lower()` or `len(item)`
- 'items' is the List with all the elements we want to transform
- 'item' is a temporary name we give to each element of items, for the purposes of using that name inside of output_format

Let's modify above example, so that we only aquire lowe-case tokens of the nursery rhyme:

In [None]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
lowercase_tokens = [token.lower() for token in humpty_tokens]
print(lowercase_tokens)


### 🖇💬Buddy discussion: What would be the coolest text dataset to analyse?

#### Ask your buddy now if they reached the **BUDDY TASK**. Once you both did, complete this task:

Can each of you come up with ONE EXAMPLE of a text source that you would LOVE to have access to and analyse. Don't worry if it would be very hard (or impossible) to aquire, just imagine you have a magic wand. 

- eg. all the chats in Edinburgh taxis
- eg. 1000 most popular recipies for apple pie
- eg. transcripts of all job interviews for academic jobs in UK this year

Don't spend too much time on this (max 2 mins) but take note of your favourite idea.

### 🐛Minitask

Let's do our first, very simple piece of analysis. Do you think there were more mentions of "We" or "They" in the innaugural speech we looked at before?

Let's try to re-use some pieces of code we wrote before and do our first very simple analysis:

First without the lowercasing:

- copy-paste your code from before to load the speech of president Obama.
- use `word_tokenize()` on that variable, to turn it into a list of tokens.
- count all occurances of a word 'we'. You can use the `a_list.count( a_word )` method like this:  `how_many_we = speech_tokens.count('we')`.
- Print how many there were.
- also count occurances of 'they'

Which is the proportion of the usage of these words?

- Now add the list-comprehention after you tokenised the text into a list, that will change list items into their lower-case equivalents. Do this after you tokenise the string, but before you do the counting.

Now which word is more frequent?

In [None]:
# write yoru solution here 


### 🦋 Extra task (optional): if you have finished everything else already:

What other words could you look for? Do you think you could create a list of words, like `['hope', 'fear' ,'can', 'cannot']` and use a for loop to print counts of all of these words in the speech?

You can try to illustrate a particular point using data.