# Analysis of Alasdair Beckett-King's Lear

![ABK](abk.jpg)


To practice using Python, I'm going to try and find the top 115 most frequently used words in "King Lear".

The reason for this choice is that comedian Alasdair Beckett-King once performed a reading of the top 115 most frequently used words in "King Lear" as if it was a genuine Shakespeare play. The funny thing about it was that you couldn't tell that he was just saying a bunch of random Shakespearian words, rather than reading from an actual script. 

[Here's the skit](https://www.youtube.com/watch?v=ZkIrDMLDDjs): 

But did he *really* recite the actual top 115 words from Lear? Let's find out.

In [1]:
# 0) PREPARE ENVIRONMENT
from urllib.request import urlopen
import string

# Google Colab libraries
# from google.colab import files
# from IPython.display import Image
# from IPython.display set_matplotlib_formats
# set_matplotlib_formats('pdf', 'svg)

# 1) Get Data
Let's get the data from Project Gutenberg, which has King Lear in both HTML and txt format. I'm going to get the .txt file because I'm lazy.

In [2]:
# Download King Lear from Project Gutenberg
filepath = urlopen("https://www.gutenberg.org/files/1532/1532-0.txt")
data = filepath.read()
words = data.split()
print('Number of words in the text file is: ', len(words))


Number of words in the text file is:  30802


In [3]:
# Explore data, choosing a word at random and seeing what it looks like
my_word = words[15400]
print("My word is: " + str(my_word))
print("The type of word my word is is: " + str(type(my_word)))

My word is: b'pen'
The type of word my word is is: <class 'bytes'>


Hmmm. Each word is in a weird format called "bytes".

# 2) Tidy Data

## 2a. Clean the Text
There are a few problems with the text:
* **Type** - Each word starts with the letter b. This means it's a bytes type not a string, so we can't do normal string things to it until we convert it.
* **Case** - We should make everything lower case.
* **Punctuation** - There's a whole heap of punctuation and brackets etc which should be removed.


In [4]:
# Tidy up
preclean = []
for word in words:
    # Convert from bytes to string
    w = word.decode()

    # Convert to lowercase and remove punctuation
    w = w.lower()
    w = ''.join([letter for letter in w if not letter in string.punctuation])

    preclean.append(w)
my_word = preclean[15400]
print("My word is: " + str(my_word))
print("My type is: " + str(type(my_word)))

My word is: pen
My type is: <class 'str'>


That's better. Now our text is in a more manageable string format.

## 2b. Remove Non-Shakespeare Text  

Looking at the [website](https://www.gutenberg.org/files/1532/1532-h/1532-h.htm),Project Gutenberg has added a preamble just before the play starts which we need to remove. They've also added some stuff at the end. Luckily, Python allows us to tidy this up with a couple of split() commands. 

The play starts just after the first use of the word "twenty Project Gutenberg volunteers". It ends just before the words "End of Project Gutenberg's King Lear".

In [5]:
# Join the corpus back together again
removing = ' '.join(preclean)
removed = removing.split("volunteers")[1].split("end of project")[0]
removed = removed.split()

print("Word count: " + str(len(removed)))
print("THE FIRST TEN WORDS")
print(removed[0:10])

print("THE LAST TEN WORDS")
print(removed[-10:])


Word count: 27680
THE FIRST TEN WORDS
['the', 'tragedy', 'of', 'king', 'lear', 'by', 'william', 'shakespeare', 'contents', 'act']
THE LAST TEN WORDS
['much', 'nor', 'live', 'so', 'long', 'exeunt', 'with', 'a', 'dead', 'march']


## 2c. Remove Names

The final problem is that there are some proper names, like "Lear" and "Cordelia" which aren't really words. We should remove them.

In [6]:
list_of_names = [
    "lear", "burgundy", "cornwall", "albany", "kent", "gloucester",
    "edgar", "edmund", "curan", "oswald", "goneril", "regan", 
    "cordelia", "william", "shakespeare"
]

clean = [word for word in removed if not word in list_of_names]

print("Word count before edit : " + str(len(removed)))
print("THE FIRST TEN WORDS")
print(clean[0:10])

print("THE LAST TEN WORDS")
print(clean[-10:])
print("Word count after edit: " + str(len(clean)))

Word count before edit : 27680
THE FIRST TEN WORDS
['the', 'tragedy', 'of', 'king', 'by', 'contents', 'act', 'i', 'scene', 'i']
THE LAST TEN WORDS
['much', 'nor', 'live', 'so', 'long', 'exeunt', 'with', 'a', 'dead', 'march']
Word count after edit: 26354


# 3) Analyse Data

## 3a. Sort
Let's get a list of all the words in frequency order.

The original words of the skit are as follows:

*The and I, to of you my, a that in not this me your thou is his.*  
*Have him with it! He be thy for no so. Thee.*    *laughs*  
*What her will but are as do, Sir Our.*  
*Fool! If all on shall Lord, from come by am good.*    
*O more where now which we let man know.*  

*Out! I'll how well. Who then King there take? Or hear would father?*
*They at go. Old hath they why she most may yet there make!*  
*Tis was us. Love see must heart upon seek poor. Like, then genetlemen.*
*Should such what I'm give art one, nor had these can some say.*   

*Eyes away. Night! Nature! To nothing!"*

Let's see if the original text has the same word frequencies.

In [7]:
from collections import defaultdict
wordCount = defaultdict(int)
for word in clean:
    wordCount[word] += 1

# Prints words in descending order, so we can read them 
result = sorted(wordCount, key=wordCount.__getitem__, reverse=True)[:120]
print(result)

['the', 'and', 'i', 'to', 'of', 'you', 'my', 'a', 'that', 'in', 'not', 'this', 'me', 'your', 'is', 'thou', 'his', 'with', 'have', 'him', 'it', 'he', 'be', 'thy', 'for', 'no', 'thee', 'what', 'so', 'her', 'but', 'are', 'will', 'as', 'fool', 'our', 'sir', 'if', 'do', 'on', 'all', 'shall', 'lord', 'from', 'by', 'come', 'am', 'which', 'good', 'more', 'when', 'o', 'now', 'know', 'let', 'we', 'king', 'man', 'enter', 'out', 'i’ll', 'who', 'how', 'or', 'than', 'their', 'here', 'well', 'father', 'take', 'they', 'would', 'at', 'go', 'old', 'there', 'hath', 'make', 'scene', 'gentleman', 'may', 'us', 'most', 'she', 'yet', 'was', 'love', '’tis', 'them', 'why', 'see', 'must', 'speak', 'poor', 'upon', 'an', 'should', 'heart', 'then', 'like', 'exit', 'such', 'where', 'give', 'art', 'one', 'had', 'eyes', 'can', 'some', 'away', 'these', 'say', 'life', 'nature', 'nor', 'exeunt', 'nothing', 'too', 'up']


## 3b. Wha????

Wait a minute. Those aren't the words he used in his script! Well, many of them are, but they're not exactly the same. 

Could it be that some of the words appear the same number of times as each other, which would lead to some small discrepancies? 

Let's see print each word along with its frequency.

In [8]:
# Show the result with their frequencies
unsorted = [(wordCount[word], word) for word in wordCount]
result = sorted(unsorted, reverse = True)
result[:120]
# tl;dr No, that's not it

[(921, 'the'),
 (738, 'and'),
 (622, 'i'),
 (570, 'to'),
 (494, 'of'),
 (456, 'you'),
 (454, 'my'),
 (422, 'a'),
 (336, 'that'),
 (314, 'in'),
 (280, 'not'),
 (234, 'this'),
 (225, 'me'),
 (222, 'your'),
 (219, 'is'),
 (216, 'thou'),
 (210, 'his'),
 (206, 'with'),
 (206, 'have'),
 (200, 'him'),
 (193, 'it'),
 (174, 'he'),
 (165, 'be'),
 (158, 'thy'),
 (153, 'for'),
 (152, 'no'),
 (138, 'thee'),
 (136, 'what'),
 (134, 'so'),
 (131, 'her'),
 (130, 'but'),
 (127, 'will'),
 (127, 'are'),
 (123, 'as'),
 (119, 'fool'),
 (115, 'our'),
 (113, 'sir'),
 (112, 'if'),
 (108, 'do'),
 (104, 'on'),
 (101, 'all'),
 (98, 'shall'),
 (97, 'lord'),
 (94, 'from'),
 (91, 'by'),
 (88, 'come'),
 (86, 'am'),
 (83, 'which'),
 (83, 'good'),
 (82, 'more'),
 (79, 'when'),
 (77, 'o'),
 (76, 'now'),
 (76, 'let'),
 (76, 'know'),
 (75, 'we'),
 (74, 'king'),
 (73, 'man'),
 (73, 'enter'),
 (72, 'out'),
 (70, 'i’ll'),
 (69, 'who'),
 (69, 'how'),
 (68, 'or'),
 (67, 'than'),
 (66, 'well'),
 (66, 'their'),
 (66, 'here'),
 (

It isn't. Only a couple of pairs of words occur the same number of times. 

# 4 Conclusion

Alasdair has taken some poetic license with the word frequencies to make it easier to perform and to flow better. 

Alas, there's more to being a hilarious YouTube star than programming word frequencies in Python. You have to have talent as well. All my dreams are dashed.