# In this notebook:

1. Simple **Counting Tokens**
2. Visualising **Frequency Distributions** (but first, **cleaning up the data**)
3. Advanced visualisation: **Wordclouds**

# 1. Counting Tokens

#### Questions & Objectives:

- How can I count tokens in text?

#### Key Points

- To count tokens, one can make use of NLTK’s FreqDist class from the probability package. The N() method can then be used to count how many tokens a text or corpus contains.
- Counts for a specific token can be obtained using fdist["token"].

In [None]:
# run this cell now. It's the usual imports of text mining libraries

import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')

In [None]:
#let's load the India corpus and lowercase it. this will take a minute to run

from nltk.corpus import PlaintextCorpusReader
corpus_root = "./data/Medical_History_of_British_India"
corpus_reader = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 
corpus_tokens = corpus_reader.words() 
print("loaded tokens:", len(corpus_tokens) )

corpus_tokens = [word.lower() for word in corpus_tokens] 
print("finished lowercasing")

## Counting tokens in text

You can also do other useful things like count the number of tokens in a text, determine the number and percentage count of particular tokens and plot the count distributions as a graph. To do this we have to import the FreqDist class from the NLTK probability package. When calling this class, a list of tokens from a text or corpus needs to be specified as a parameter in brackets.

In [None]:
# thiw will take a minute too
from nltk.probability import FreqDist
fdist = FreqDist(corpus_tokens)
print(fdist)

You can print the counts of most common tokens with `freq_dist_object.most_common( how_many )`

eg. `fdist.most_common(100)` for the most common 100 words

The results will be arranged from top most frequent tokens, with and their frequency count.

In [None]:
print(fdist.most_common(100))

# fun fact: notice that this is a list of tupples [('word1',233), ('word2', 2324), ...]

In [None]:
# instead of printing, you can just return the value from the cell,
# it will be easier to read, but very long:
fdist.most_common(100)

with `frequency_distribution.N()` we can count the total number of tokens in a corpus.

eg. `fdist.N()`

In [None]:
print(fdist.N())

With `fdist[ your_word ]` you can count the number of times a token appears in a corpus:

eg. `fdist['hospital']` returns `28280` which means that `word 'hospital' appears 28280 times in reports`

In [None]:
print(fdist['hospital'])
print(fdist['he'])
print(fdist['she'])

With `fdist.freq( your_word )` you can also determine the relative frequency of a token in a corpus, so what % of the corpus a term is:

eg. `fdist.freq('hospital')` returns `0.000997673635341749` which means that `0.1% or all words are 'hospital'`


In [None]:
print(fdist.freq('hospital'))
print(fdist.freq('he'))
print(fdist.freq('she')) # notice this fraction is so small that it goes into 'scientific notation e-05'

#### Note on counting tokens that match a Regular Expression

If you have a list of tokens created using regular expression matching as in the previous section and you’d like to count them then you can also simply count the length of the list:

In [None]:
# this will take a minute
import re
womaen_strings = [word for word in corpus_tokens if re.search('^wom[ae]n$', word)]
print(len(womaen_strings))

Frequency counts of tokens are useful to compare different corpora in terms of occurrences of different words or expressions, for example in order to see if a word appears a lot rarer in one corpus versus another.

Counts of tokens, documents and a entire corpus can also be used to compute simple pairwise document similarity of two documents (later, have a look at e.g. see Jana Vembunarayanan’s blogpost for a hands-on example of how to do that https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/).

# 2. Visualising Frequency Distributions
# (but first, cleaning up the data)


#### Questions & Objectives:

- How can I draw a frequency distribution of the most frequent words in a collection?
- How can I visualise this data as a word cloud.

#### Key Points

- A frequency distribution can be created using the plot() method.
- In this episode you have also learned how to clean data by removing stopwords and other types of tokens from the text.
- A word cloud can be used to visualise tokens in text and their frequency in a different way.

## Visualising Frequency distributions of tokens in text

### Graph of the frequency of the words as they are:

The plot() method can be called to draw the frequency distribution as a graph for the most common tokens in the text.

In [None]:
fdist.plot(30,title='Frequency distribution for 30 most common tokens in our text collection')

You can see that the distribution contains a lot of non-content words like “the”, “of”, “and” etc. (we call these stop words) and punctuation. This is not very useful. Let's have a small peek on what these words are:

In [None]:
fdist.most_common(100)

Yes, definitely not useful. Many of these words look like noise.

### 🐛Minitask:  Identify 3-4 categories of not-very-helpful tokens in the above set

Look at the above set of most popular tokens. Many of them look important and meaningful 'government', 'disease' etc but many of them are not very useful.

- identify some families of not helpful tokens and write names for these families below:

Do not spend too much time on this (max 2 minutes)

In [None]:
# here you can write yoru answer

#### Removing tokens that are just noise:

We can remove them before drawing the graph. We need to import stopwords from the corpus package to do this. The list of stop words is combined with a list of punctuation and a list of single digits using + signs into a new list of items to be ignored.

Here are some of python's build in 'cheat sheets' of punctuations and other 'meaningless charaters', and some provided by the nltk library:

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

#let's have a look what are the words usually discarded:
print(string.punctuation)
print(string.digits)
print(stopwords.words('english'))

In [None]:
# note: we turn string to a list, so they can be added to the other list. But 
print(list(string.punctuation))
print(list(string.digits))

In [None]:
# remove stopwords, punctuation and digits. the set( ... ) syntax removes duplicates

remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))

filtered_text = [word 
                 for word in corpus_tokens 
                 if not word in remove_these]

# note: above broken-down 3-line version could be a one-liner (see below). 
# it's up to you, which format you prefer. above, or this:
# filtered_text = [word for word in corpus_tokens if not word in remove_these]


fdist_filtered = FreqDist(filtered_text)
print(fdist_filtered.most_common(30))
fdist_filtered.plot(30,title='Frequency distribution (excluding stopwords and punctuation)')

# this graph should be a bit better

### Manually adding more words to be ignored

Above is already much better, but sometimes we want to manually add some words to be ignored. It is easy, we just need to add more elements to `remove_these` List.

We looked at the top 100 words and found these to be not particularly useful:

In [None]:
print( [word for (word,count) in fdist_filtered.most_common(100)])

There seems to be some elements here that are not meaningful for our analysi:

- rogue punctuation '...', '..', etc
- a lot of two-digit numbers '12', 000
- individual letters 'a', 'j' etc
- some other not very meaningful words 'also','would'

Let's create more lists of things we want to remove:

In [None]:
# I create a range of numbers from 0 to 100 and turn them into strings, so they are like '45' not like 45
numbers_1_to_100 = [str(integer) for integer in range(101)]
print(numbers_1_to_100)

In [None]:
# here are the weird puctuations
extra_punctuation_to_remove = ['.', '..','...','....','.....','......', ').', '.,']
print(extra_punctuation_to_remove)

In [None]:
individual_letters = list(string.ascii_lowercase)
print(individual_letters)

### 🐛Minitask: Identiy more words that are potentially noise

In a minute we will remove tokens that are noise. Based on your previous minitask and the current most popular tokens:

- identify 10 more words that are most likely noise
- add them to the list `some_more_words_to_remove`, run the cell again and continue

In [None]:
# run this cell to see current 100 most popular tokens, as part of the task above 
# add some more words to some_more_words_to_remove
print( [word for (word,count) in fdist_filtered.most_common(100)])

some_more_words_to_remove = [ 'rs', 'per', 'would', '000']



#### Note about cleaning up data carefully
Sometimes you might find words that do not communicate the content, like 'would' and we are showing you how to remove them here.

But it comes with a warning: you have to be very careful persoming steps like these, because they have potential of completely biasing your data, but also careful cleaning of messy datasets is very important.

Also: While it makes sense to remove stop words for this type of frequency analysis it essential to keep them in the data for other text analysis tasks. Retaining the original text is crucial, for example, when deriving part-of-speech tags of a text or for recognising names in a text.

Now we can re-do the visualisation again, this time using an expanded and customised list of items to ignore.
This should be a more meaningful graph!

In [None]:
# lets combine it all together and generate our new graph

remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits) 
        + numbers_1_to_100 + extra_punctuation_to_remove + individual_letters+some_more_words_to_remove)

filtered_text = [word 
                 for word in corpus_tokens 
                 if not word in remove_these]
    
fdist_filtered = FreqDist(filtered_text)
print(fdist_filtered.most_common(30))
fdist_filtered.plot(30,title='Frequency distribution (excluding stopwords and punctuation)')

# 3. Advanced visualisation: Wordclouds

## Basic wordcloud:

We can also present the filtered tokens as a word cloud. It's a modern type of graph where sizes of words communicate their frequency, often used fo have a quick an overview of the corpus. Additionally these wordcouds can be shaped or customised as we'll see below.

We will use the `WordCloud( ).generate_from_frequencies()` method.  The input to this method is a frequency dictionary of all tokens and their counts in the text.

You will also see another way to create a simplified frequency count of words. That's because wordcloud requires words to be in a dictionary format:

 `{'total': 94009, 'year': 82829, 'number': 63358, 'cases': 39876 .... }`
 
We will use another python package `Counter` to create such dictionary using the `filtered_text` variable as input. Note it is much less powerful than FreqDist, but you might see it in other people's code, so we want you to be familiar with it.

Once we have the data in the correct format, we generate the WordCloud using the frequency dictionary and plot the figure to size. We can show the plot using `plt.show()`.

In [None]:
from collections import Counter
simple_frequencies_dict = Counter(filtered_text)

# let's have a peek into this dictionary. How many times the word 'hospital' appears in filtered_text?
print(simple_frequencies_dict['hospital'])

#### note on installing extra software on your virtual machine with !pip

But first we'll show you one of the most powerful features of of jupyter notebooks: you can downlload and install almost any software from the internet into your 'virtual machine'. Because it is 'sandboxed' it is primarilly safe. To install things we use `!pip` python package installer command

In [None]:
# run this now, it will install wordcloud on your machine
!pip install wordcloud

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

cloud = WordCloud(max_font_size=80,colormap="hsv").generate_from_frequencies(simple_frequencies_dict)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

# you should see a colourful diagram below. It is generate right on time for you. 
# go no, generate it again, you will see that it changes slightly

### 🖇💬Buddy discussion: What text data would wordclouds be really good and really bad at?

#### Ask your buddy now if they reached the **BUDDY TASK**. Once you both did, complete this task:

Try to identify strengths and weaknesses unique to the usage of wordclouds.

Don't spend too much time on this (max 2 mins) but take note of your favourite idea.

## Fancy-shaped word cloud:

And now a shaped word cloud for a bit of fun This will present your workcloud in the shape of a given image.

You need a shape file which we provide for you in the form of the medical symbol (it looks like two snakes wrapped around a staff with wings).

The mask image needs to have a transparent background so that only the black shape is used as a mask for the word cloud - parts that are black will be filled with your words, parts that are transparent will be left empty. You can use your own images later!

It would be fun if we can customise:

- the shape in which the words arrange themselves (ex, circle, shape of UK, question mark)
- colours to use

The **geeky bits** about how we use images and colours **(feel free to skip reading them)**:

- To display the shaped word cloud you need to import the Image package form PIL as well as numpy. The image first needs to be opened and converted into a numpy array which we call med_mask. 
- A customised colour map (cmap) is created to present the words in black font. 
- Colours use #RRGGBB format where two hexadecimal characters (0123456789ABCDEF) describe amount of Red Green and Blue we want. eg. #FF0000 means full red, no green, no blue. #000000 means no neither, which is black. #FFFFFF is white, #111111 and #222222 are shades of grey, getting brighter as amount of colours increses
- Then the word cloud is created with a white background, the mask and the colour map set as parameters and generated from the dictionary containing the number of occurrences for each word.

In [None]:
from PIL import Image
import numpy as np
from matplotlib.colors import LinearSegmentedColormap

# masking image
medical_icon_mask_image = np.array(Image.open("./images/medical.png"))

# Custom Colormap
colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)

# imcge details: background, shape-mask, colours 
wordcloud = WordCloud(background_color="white", mask=medical_icon_mask_image, colormap=cmap)

wordcloud.generate_from_frequencies(simple_frequencies_dict)
plt.figure(figsize=(16,12))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")



### 🦋 Extra task (optional): if you have finished everything else already:

- can you create a wordcloud of the presidential speeches corpus?
- then, can you use your own masking image?