# Looking at Macbeth

In an attempt to realize the observations of this past week's reading.

Note also their reference to the following, which describes calculating log-likelihood for comparing texts:
* https://wordhoard.northwestern.edu/userman/analysis-comparewords.html

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from string import punctuation
import numpy as np
import pandas as pd

In [2]:
myStopWords = list(punctuation) + stopwords.words('english')

In [3]:
WordNetLemmatizer().lemmatize('rocks')

'rock'

In [4]:
with open('CompleteShakespeare.txt','r') as f:
    completeshakespeare = f.read()
words = word_tokenize(completeshakespeare.lower().replace('th’','the'))
completewords = [w for w in words if w not in myStopWords or w in ['the']]
completestemmed = [WordNetLemmatizer().lemmatize(w) for w in completewords]

In [5]:
with open('Macbeth.txt','r') as f:
    macbeth = f.read()
words = word_tokenize(macbeth.lower().replace('th’','the'))
macbethwords = [w for w in words if w not in myStopWords or w in ['the']]
# macbethstemmed = [PorterStemmer().stem(w) for w in macbethwords]
macbethstemmed = [WordNetLemmatizer().lemmatize(w) for w in macbethwords]

In [6]:
completewords[:20]

['’',
 'well',
 'ends',
 'well',
 'contents',
 'act',
 'scene',
 'i.',
 'rossillon',
 'room',
 'the',
 'countess',
 '’',
 'palace',
 'scene',
 'ii',
 'paris',
 'room',
 'the',
 'king']

In [7]:
completestemmed[:20]

['’',
 'well',
 'end',
 'well',
 'content',
 'act',
 'scene',
 'i.',
 'rossillon',
 'room',
 'the',
 'countess',
 '’',
 'palace',
 'scene',
 'ii',
 'paris',
 'room',
 'the',
 'king']

In [8]:
macbethwords[:20]

['macbeth',
 'william',
 'shakespeare',
 'contents',
 'act',
 'scene',
 'i.',
 'open',
 'place',
 'scene',
 'ii',
 'camp',
 'near',
 'forres',
 'scene',
 'iii',
 'heath',
 'scene',
 'iv',
 'forres']

In [9]:
macbethstemmed[:20]

['macbeth',
 'william',
 'shakespeare',
 'content',
 'act',
 'scene',
 'i.',
 'open',
 'place',
 'scene',
 'ii',
 'camp',
 'near',
 'forres',
 'scene',
 'iii',
 'heath',
 'scene',
 'iv',
 'forres']

In [10]:
completefreq = FreqDist(completestemmed)
macbethfreq = FreqDist(macbethstemmed)

In [11]:
completefreq['the']

29971

In [12]:
macbethfreq['the']

775

According to the above mentioned website, the following will allow us to calculate the log-likelihood ratio as a measure of difference.

* a: the number of times the word occurs in the analysis text
* b: the number of times the word occurs in the reference text
* c: the total number of words in the analysis text
* d: the total number of words in the reference text
* We then want to calculate the log-likelihood $G^2$ as:
  * $E1 = c*(a+b)/(c+d)$
  * $E2 = d*(a+b)/(c+d)$
  * $G^2 = 2*((a*\ln(a/E1)) + (b*\ln(b/E2)))$

In [13]:
a = macbethfreq['the']
b = completefreq['the']
c = len(macbethstemmed)
d = len(completestemmed)
E1 = c*(a+b)/(c+d)
E2 = d*(a+b)/(c+d)
G2 = 2*((a*np.log(a/E1)) + (b*np.log(b/E2)))

In [14]:
G2

31.34965317727199

In [15]:
def logli(word):
    a = macbethfreq[word]
    b = completefreq[word]
    c = len(macbethstemmed)
    d = len(completestemmed)
    E1 = c*(a+b)/(c+d)
    E2 = d*(a+b)/(c+d)
    if a != 0 and b != 0:
        G2 = 2*((a*np.log(a/E1)) + (b*np.log(b/E2)))
    else:
        G2 = 0
    return G2

In [16]:
macbethfreq['dagger']/len(macbethstemmed)

0.0008585164835164835

In [17]:
completefreq['dagger']/len(completestemmed)

9.532374100719424e-05

In [18]:
logli('the')

31.34965317727199

In [19]:
logli('dagger')

24.79273019804705

In [20]:
logli('thane')

156.28996392687182

In [21]:
wordlist = []
wordfreq = []
loglike = []
for i in sorted(macbethfreq, key=macbethfreq.get, reverse=True):
    specialword = logli(i)
    if specialword > 25:
        wordlist.append(i)
        wordfreq.append(macbethfreq[i])
        loglike.append(specialword)
df = pd.DataFrame({'wordlist':wordlist,'wordfreq':wordfreq,'loglike':loglike})

In [22]:
df

Unnamed: 0,wordlist,wordfreq,loglike
0,the,775,31.349653
1,’,694,280.420985
2,macbeth,282,1421.734511
3,macduff,111,559.618903
4,lady,97,119.235725
5,banquo,72,362.996045
6,malcolm,59,297.455093
7,witch,59,246.149512
8,ross,53,245.620677
9,fear,44,30.222202


In [23]:
df.sort_values(by='loglike',ascending=False)

Unnamed: 0,wordlist,wordfreq,loglike
2,macbeth,282,1421.734511
3,macduff,111,559.618903
5,banquo,72,362.996045
6,malcolm,59,297.455093
1,’,694,280.420985
7,witch,59,246.149512
8,ross,53,245.620677
11,duncan,38,191.581246
12,lennox,34,171.414799
13,thane,31,156.289964
