# Visualising nursery rhymes

In this notebook we will look at two different ways that we can represent passages of text with numbers, and create visualisations for them.

We will be using nursery rhymes from the [nursery rhyme dataset](https://git.arts.ac.uk/tbroad/nursery-rhymes-dataset) for our visualisation. These were scraped from the website [allnurseryrhymes.com](https://allnurseryrhymes.com/) where and you can find out more information about lots of the rhymes there. Many of them have interesting historical facts to go along with them. 

The nursery rhymes dataset is a small dataset that we will be using for the next few weeks while we are learning the basics of natural language processing. Nursery rhymes are usually short, have simple language, and have repetitve vocabularies. Therefore they are the perfect thing to study while we are learning the basics of NLP!

In this exercise we will be running code that makes two different visualisations of ways in which we can represent text as numbers. The first is a one-hot vector encoding of a nursery rhyme based on words, and the second is bag of words visualisation.

First, lets do some imports:

In [None]:
#if you haven't already installed these, install them now:
%pip install matplotlib
%pip install seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

Lets load in the nursey rhyme `row-row-row-your-boat.txt`. [You can read more about the rhyme here](https://allnurseryrhymes.com/row-row-row-your-boat/).

<a id='task2'></a>

In [None]:
f = open("../data/nursery-rhymes/row-row-row-your-boat.txt", "r")
text = f.read()
print(text)

Lets replace the new line characters with spaces. In code we represent new line with `\n`

In [None]:
text = text.replace('\n', ' ')
print(text)

You will need to do some more filtering of the text here, though for now don't worry, see the tasks at the end of the document for hints on what to do here:

<a id='task1'></a>

In [None]:
# Don't change anything in this cell yet. You'll come back here in Task 1A

#Then, you might want to start by thinking about using text.replace(....) (you'll need to figure out how to fill it in!)


Now lets get the words by splitting the string based on the space character ` `. We will also find our vocabulary of unique words by using the in-built [set data structure](https://docs.python.org/3/tutorial/datastructures.html#:~:text=Python%20also%20includes%20a%20data%2C%20difference%2C%20and%20symmetric%20difference.).

In [None]:
words = text.split(' ')
vocab = sorted(set(words))
vocab_size = len(vocab)
print(f'There are {vocab_size} different words in the rhyme.')
print(f'Here are all the words in the nursery rhyme: {vocab}')

Now let's assign each work to a number index:

In [None]:
word_to_index = {word: index for index, word in enumerate(vocab)}
print(word_to_index)

Now lets make a [one-hot](https://en.wikipedia.org/wiki/One-hot) vector visualisation of the nursery rhyme. This form of representation is important, especially for when we want to generate text. Don't worry about understanding all of the code for now, just run it and look at the visualisation.

In [None]:
# Encode the text using a one-hot vector representation
one_hot_matrices = []
for word in words:
    one_hot = np.zeros(vocab_size)
    one_hot[word_to_index[word]] = 1
    one_hot_matrices.append(one_hot)

# Convert the list of one-hot matrices to a 2D NumPy array
one_hot_array = np.array(one_hot_matrices)

# Create a heatmap to visualize the one-hot array
plt.figure(figsize=(12, 6))
plt.imshow(one_hot_array.T, cmap="Blues", aspect="auto")
plt.xticks(np.arange(len(words)), words, rotation=45)
plt.yticks(np.arange(vocab_size), vocab)
plt.xlabel("Nursery rhyme")
plt.ylabel("Vocabulary")
plt.title("One-Hot Vector Encoding of a Nursery Rhyme")
plt.grid(True)
plt.show()

Using this kind of visualisation we can easily see which words repeat and the order of the words. 

The computer though just represents this arrays (aka lists) of numbers which are either 0 or 1. Lets take a look at the value of the matrix (aka table):

In [None]:
# Make a pandas data frame with labels for the rows and columns
df = pd.DataFrame(data = one_hot_array.T, 
                  index =  vocab, 
                  columns = words)

# Now lets inspect our dataframe
df

Now lets look at a different kind of visualisation. Here lets look at a visualisation of a [bag-of -words](https://en.wikipedia.org/wiki/Bag-of-words_model) representation of our nursery rhyme.

In bag-of-words, we count the occurances of each word. This type of representation assumes that words that are repeated a lot are more important than words used infrequently. This assumption has its limitations, such as counting common words like 'and', 'or', 'it', but in the next two weeks we explore methods to address this.

Lets first make an empty array for our counts:

In [None]:
document_term_matrix = np.zeros((1, len(vocab)), dtype=int)
document_term_matrix

Now lets count the occurance of each word in the nursery rhyme:

In [None]:
for i, word_in_vocab in enumerate(vocab):
    for j, word_in_full_rhyme in enumerate(words):
        if word_in_vocab == word_in_full_rhyme:
            document_term_matrix[0, i] += 1

document_term_matrix

Now lets visualise these counts with a heatmap of different colours:

In [None]:
plt.figure(figsize=(2, 6))
sns.set(font_scale=1.2)  # Adjust font size
sns.heatmap(document_term_matrix.T, annot=True, cbar=False, cmap="YlGnBu", yticklabels=list(vocab), xticklabels='')
plt.ylabel("Vocab")
plt.xlabel("Number of occurences")
plt.title("Bag of Words Heatmap")
plt.show()

## Tasks

**Task 1.A** Take a look at the vocab in both visualisations. You will notice that there are some words that are repeated multiple times. This is because the punctuation (like . and ,) have not been removed from the strings, so our algorithm classifies them as separate words. Write some code to remove these from the original text string in [this cell](#task1). 

**Task 1.B** We can also merge more of the words if we make [all the letters the same case](https://www.w3schools.com/python/python_strings_modify.asp), either *lowercase* or *UPPERCASE*. Change that and see how that effects the visualisations. Do this in the [same cell](#task1).

**Task 2.A** Now take a look at some of the other nursery rhymes in `data/nursery-rhymes`, try loading in a different rhyme into [this cell](#task2) by changing the path to a different file. Try it with a short nursery rhyme to start with. 
- How does this effect the visualisation? Maybe you need to remove more special characters to merge all of the same words together. 
- If the visualisation does not look right and you have overlapping text, try changing the values of the numbers in the line of code `plt.figure(figsize=(12, 6))` to something different bigger. 

**Task 2.B** Once you are happy with your visualisation, upload it to the miro board to share with the class.