In [1]:
import nltk
from nltk import ngrams
import pandas as pd

In [2]:
with open(r'C:\Users\elian\Downloads\hamlet.txt', encoding="utf8") as h:
    hamletRaw = h.read()
with open(r'C:\Users\elian\Downloads\hamletesperanto.txt', encoding="utf8") as e:
    hesperantoRaw = e.read()

In [3]:
start = hamletRaw.find('SCENE. Elsinore.')
end = hamletRaw.find('*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***')
hamletClean = hamletRaw[start:end].lower()

start = hesperantoRaw.find('La agado havas lokon en Elsinoro.')
end = hesperantoRaw.find('End of the Project Gutenberg EBook of Hamleto, by William Shakespeare')
esperantoClean = hesperantoRaw[start:end].lower()

In [16]:
tokenizer = nltk.RegexpTokenizer('\w+', gaps=False, discard_empty = True) #this pattern removes all punctuation but separates contractions (i.e. it prints 'don', 't')
esperantoTokens = tokenizer.tokenize(esperantoClean)

In [5]:
tokenizer = nltk.RegexpTokenizer('\w+', gaps=False, discard_empty = True)
hamletTokens = tokenizer.tokenize(hamletClean)

## How many words are in each text? Find a sentence or two that are translations of one another, but have a big difference in word length. Explain why one needs more or fewer words than its translation.

In accordance with the RegExp tokenizer, Hamlet in Esperanto is 31095 words long, whereas Hamlet in English is 32906; since the tokenizer separates contractions, one can assume that an accurate word count is slightly below what is reported here. The difference in word count between the Esperanto text and the English one is logical, since Esperanto uses a simpler grammatical form than English. The two sentences that I chose here showcase the structural differences between the two languages; in the first sentence, English uses a negative to show that only Hamlet is present on stage, which emphasizes Hamlet's presence, whereas the Esperanto translation names all the characters that leave the stage. This shows that Esperanto tends to translate things quite literally, i.e. it describes what is happening, rather than what is not happening. Here, the other characters in the scene leave while Hamlet remains, doing nothing, and this is the version of the text that Esperanto shows. In the second sentence, one can see the semantic difference between English and Esperanto; when translated back to English, the Esperanto sentence means "About Hamlet, about the prince". It does not include "so please you" which shows the emphasis on literal action/meaning. Additionally, while "Lord Hamlet" in English is one title, Esperanto separates the two identities, first saying "Hamlet" then "the prince"; this show Esperanto's inability to conjoin two nouns together.

In [29]:
len(hamletTokens)


32906

In [31]:
len(esperantoTokens)

31095

In [22]:
esperantoSentence = '(La reĝo, reĝino, Laerto kaj korteganoj foriras).'
hamletSentence = '[_Exeunt all but Hamlet._]'

In [32]:
esperantoSentence2 ='OFELIO Pri Hamleto, Pri la reĝido.'
hamletSentence2 ='OPHELIA. So please you, something touching the Lord Hamlet.'

## Compare the most frequent trigrams, or 4-grams, or 5-grams, of each. Are there some that don’t seem to have counterparts in the other language? Why is that?

While comparing the most frequent trigrams in both the Esperanto and English translations, there are some similarities in the semantics of the phrases, however, because Esperanto is more "literal" than Shakespearean English, some of the Esperanto phrases are slightly off. For instance, "my lord hamlet" is by far the most frequent trigram in English, but in Esperanto the most common trigram translates to "Rosencratz and Guildenstern" and I believe this has to do with the fact that Esperanto sometimes translates the phrase "my lord hamlet" into two different phrases (as seen in the previous problem where Ophelia says "to hamlet, to the prince" in Esperanto). Another ocurrence similar to this is the phrase "I pray you" in English and the phrase "Mi petas vin" in Esperanto; the latter directly translates to "I beg you", which is slightly different than "I pray you". The Esperanto phrase ocurrs 6 more times than the English one, which leads me to believe that "Mi petas vin" is a phrase that equates to multiple different English phrases, which is why it is used more often in Esperanto. The frequency of 4grams in Esperanto and English seem to be a bit closer to each other in terms of translation, since a lot of the frequent 4grams have to do with location. The same pattern can be seen in the most frequent 5grams; phrases like "a room in the castle" or "another room in the castle" are commonly found in both languages. However, it is interesting that English has more stage directions because while skimming through the Esperanto version, I found a lack of stage directions; this could be a discrpenacy between the two translations or just a quirk about Esperanto, however I think it is the former. Another interesting phrase that shows up in Esperanto's most frequent 4 and 5grams is the phrase "the death of the father"; this presence could be due to the fact that the English version may refer to the death of Hamlet's father indirectly, which Esperanto then would translate it to the direct meaning. 

In [9]:
def freq_ngrams(text, n ):
    n_grams = ngrams(text, n)
    return nltk.FreqDist([ ' '.join(grams) for grams in n_grams])

In [56]:
engTri = freq_ngrams(hamletTokens, 3)
engTriPd = pd.Series(engTri).sort_values(ascending = False)[:10]

In [55]:
espTri = freq_ngrams(esperantoTokens, 3)
espTriPd = pd.Series(espTri).sort_values(ascending = False)[:10]

In [57]:
df1 = pd.DataFrame(engTriPd, columns = ['English Trigrams'])
df2 = pd.DataFrame(espTriPd, columns = ['Esperanto Trigrams'])
df1.append(df2)

Unnamed: 0,English Trigrams,Esperanto Trigrams
my lord hamlet,64.0,
rosencrantz and guildenstern,22.0,
my lord i,19.0,
the castle enter,15.0,
good my lord,14.0,
in the castle,13.0,
i pray you,12.0,
that i have,11.0,
lord hamlet i,10.0,
room in the,10.0,


In [58]:
eng4 = freq_ngrams(hamletTokens, 4)
eng4Pd = pd.Series(eng4).sort_values(ascending = False)[:10]

In [59]:
esp4 = freq_ngrams(esperantoTokens, 4)
esp4Pd = pd.Series(esp4).sort_values(ascending = False)[:10]

In [62]:
df3 = pd.DataFrame(eng4Pd, columns = ['English 4grams'])
df4 = pd.DataFrame(esp4Pd, columns = ['Esperanto 4grams'])
df3.append(df4)

Unnamed: 0,English 4grams,Esperanto 4grams
in the castle enter,13.0,
room in the castle,10.0,
my lord hamlet i,8.0,
my lord i have,7.0,
the castle enter king,6.0,
rosencrantz and guildenstern _,6.0,
_exeunt rosencrantz and guildenstern,6.0,
my lord hamlet why,6.0,
a room in the,5.0,
_exeunt _ scene ii,5.0,


In [60]:
eng5 = freq_ngrams(hamletTokens, 5)
eng5Pd = pd.Series(eng5).sort_values(ascending = False)[:10]

In [61]:
esp5 = freq_ngrams(esperantoTokens, 5)
esp5Pd = pd.Series(esp5).sort_values(ascending = False)[:10]

In [63]:
df5 = pd.DataFrame(eng5Pd, columns = ['English 5grams'])
df6 = pd.DataFrame(esp5Pd, columns = ['Esperanto 5grams'])
df5.append(df6)

Unnamed: 0,English 5grams,Esperanto 5grams
room in the castle enter,10.0,
in the castle enter king,6.0,
_exeunt rosencrantz and guildenstern _,6.0,
a room in the castle,5.0,
another room in the castle,5.0,
lord _exeunt rosencrantz and guildenstern,3.0,
ghost _beneath _ swear hamlet,3.0,
_exeunt all but hamlet _,3.0,
my lord _exeunt rosencrantz and,3.0,
in the castle enter hamlet,3.0,


## Compare plural nouns between your two texts. (NNS, NNP, or NNPS in English, -j in Esperanto.) What kinds of things are more or less likely to be plural? How do these differ between languages?

Firstly, there are some discrepancies in both the English POS tags and the Esperanto ones; in English, some of the characters' names such as "Fortinbras" or "Horatio" are counted as plural nouns. In Esperanto, some adjectives are included in the list of plural nouns because it is not uncommon for 'j' to be used with ajectives to describe plural nouns (this is also why I manually excluded "kaj" from the list of nouns, since it does fit the criteria of an adjective, but it simply translates to "and"). Overall, the plural nouns between both texts are quite similar, as they usually refere to the presence of more than one *thing*. The differences in plurality between English and Esperanto start to show up where Englsih refers to groups of people such as "messengers" or "rivals", but still the difference is not incredibly remarkable, since it simply has to do with the translation of the words in Esperanto. For example, the English translation does not include "Norwegians" as a plural noun when POS tagged, but the Esperanto version does. I assume that some of this is due to Esperanto's limited vocabulary, which must be emphasized even more in translating Shakespeare since Shakesperean English is not practical for communicating with other people across languages. Also something that I noticed is that while the English version includes "disasters", Esperanto does not. Once again, I think this may be because of the translation. sicne Esperanto tends to explain concepts more literally/in a longer form (similar to Spanish) than English does. Another thing that I noticed is that in Esperanto, plural nouns are moreso used to describe aspects of the setting such as "streets" or "Sundays". 

In [129]:
engPOS = nltk.pos_tag(hamletTokens)

In [141]:
pluralNounsEng = set([word for word, tag in engPOS 
                if tag in ('NNS', 'NNP', 'NNPS')])
print(pluralNounsEng)

{'drunkards', 'imaginations', 'functions', 'brothers', 'captains', 'maggots', 'falls', 'manners', 'rises', 'lives', 'gods', 'herods', 'instances', 'contents', 'loggets', '_retires', 'women', 'defeats', 'cases', 'fines', 'spirits', 'marriages', 'excitements', 'hopes', 'accidents', 'doth', 'others', 'offices', 'ministers', 'secrets', 'malefactions', 'sans', 'buzzers', 'lords', 'sables', 'bounds', 'enterprises', 'cries', 'smooth', 'fiends', 'troubles', 'affections', 'acts', '_for', 'eyes', 'barnardo', 'loves', 'nations', 'spectators', 'rites', 'woodcocks', 'players', 'bells', 'erbears', '_beneath', 'closes', 'winds', 'animals', 'shocks', 'stairs', 'flames', 'needs', 'crows', 'toys', 'brains', 'demands', 'wards', '_exeunt', 'minds', '_draws', 'officers', 'bugs', 'lines', 'kneels', 'shows', 'priests', 'discourse', 'judges', 'churchyards', 'pangs', 'garments', 'soldiers', 'therein', 'souls', 'ho', 'terms', 'occurrents', 'deaths', 'doubts', 'ourselves', 'blasts', 'delights', 'fathers', 'recor

In [140]:
def pluralEsp():
    nounsEsp = []
    for token in esperantoTokens:
        if token.endswith('j' or 'jn'):
            nounsEsp.append(token)
    nounsEsp = set([token for token in nounsEsp if token != 'kaj'])
    return nounsEsp     

pluralEsp()

{'_elektitaj',
 '_rakontoj',
 'abomenindaj',
 'administratoroj',
 'adultaj',
 'afablaj',
 'aferistoj',
 'aferoj',
 'agoj',
 'agrablaj',
 'aj',
 'aktoroj',
 'aliaj',
 'almozuloj',
 'alportintoj',
 'altaj',
 'altastataj',
 'amataj',
 'amikoj',
 'amoj',
 'amuzoj',
 'anguletoj',
 'animoj',
 'antaŭparoloj',
 'antaŭsignoj',
 'anĝeloj',
 'apartaj',
 'apartaĵoj',
 'apartenaĵoj',
 'aplaŭdataj',
 'aranĝitaj',
 'arogantaj',
 'artaj',
 'artifikoj',
 'artoj',
 'atestantoj',
 'aventuristoj',
 'azenetoj',
 'aĉetaj',
 'aŭdataj',
 'aŭtoroj',
 'bataliloj',
 'batoj',
 'belaj',
 'berberaj',
 'bestaj',
 'bestoj',
 'bienoj',
 'birdetoj',
 'birdoj',
 'blindaj',
 'bonaj',
 'bondeziroj',
 'bonegaj',
 'bonkoraj',
 'bovidoj',
 'brakoj',
 'bravaj',
 'brilantaj',
 'brovoj',
 'bruantaj',
 'ceremonioj',
 'ceteraj',
 'ciklopoj',
 'danaj',
 'danoj',
 'decidoj',
 'deflankiĝaj',
 'deliroj',
 'delogantoj',
 'demandoj',
 'depagoj',
 'devoj',
 'deziroj',
 'diabloj',
 'difinitaj',
 'dimanĉoj',
 'dioj',
 'disligitaj',
 'disp

## Choose several people (characters, in fiction, or real people, in nonfiction) that appear in your text. For each, write a function to collect all the adjectives you see in the same sentences as those people’s names. What do you notice about them? Are they roughly the same in both languages?

Firstly, there are some discrepancies (once again) in how the texts filtered through adjectives; since the POS Tag function counts "lord" as an adjective, it is one of the most common ones found for Hamlet. However, for the most part, the adjectives that are used to describe the characters are semantically similar. For instance, in the Esperanto version, "Terrible" and "unpleasant" are used to describe Hamlet while "Unmannerly" and "unfellowed" are used in English. This shows the discrepancy between translation, but the overall connotation of the adjectives still remain the same. Something interesting is that in the English version "ignorant" is one of the adjectives used to describe Hamlet and "unheard" is used in Esperanto; I am not sure if these are meant to be direct translations because I do not know where in the text they ocurr, but nonetheless I though it was an interesting example of how Esperanto potentially translated "ignorant" in a different way. For Horatio, again, the semantics seem to be more or less the same between translations; "good" is a descriptor that appears frequently in both. An interesting translation could be seen in "partisan" used in English versus "honest" used in Esperanto; though these words mean basically the same thing, they have different connotations in the contexts in which they are used. Personally, when "partisan" is used, I think of politics, and when "honest" is used I tend to think of it in a more personal/interpersonal context. For Ophelia, the same holds true, however, there seems to be more variance in the adjectives in Esperanto, because she is described as "Beloved", "beautiful" and "delightful" (also, she is described as "headless" which cannot be literal since she is not beheaded. My guess is that this is meant to refer to Ophelia's intelligence). In English, the most (and only) frequently used adjectives to describe Ophelia are "poor", "sweet", and "fair". I think that the adjectives in Esperanto and English describe Ophelia's condition fairly well; she exists to sit and look pretty for the most part. I do think it is interesting that there is more variance/detail in Esperanto's adjectives than in the English version, because my general impression of Esperanto is that it is a simpler language than English.

In [135]:
def findDescriptorsEng(charname):
    for i,posPair in enumerate(engPOS):
        word, pos = posPair
        if i+1 < len(engPOS): # dont go beyond the end
            prevWord, prevPOS = engPOS[i-1]
        if word == charname and prevPOS == 'JJ':
            print(prevWord)

In [136]:
espAdj = []
for token in esperantoTokens:
    if token.endswith('a' or 'aj' or 'ajn'):
        espAdj.append(token)
espAdj = [token for token in espAdj if token != 'la']   

In [137]:
def findDescriptorsEsp(charname):
    for i,word in enumerate(esperantoTokens):
        if i+1 < len(esperantoTokens):
            prevWord = esperantoTokens[i-1]
        if word == charname and prevWord in espAdj:
            print(prevWord)

In [121]:
findDescriptorsEng('hamlet')

valiant
young
queen
good
thee
lordship
lord
funeral
king
like
sight
lord
anger
pale
t
honour
lord
marcellus
custom
ghost
thee
unfold
hear
unnatural
crown
o
_
lord
lord
lord
lord
lord
faith
swear
lord
swear
swear
strange
swear
lord
enter
lord
lord
lord
lord
lord
lord
button
lord
lord
service
lord
child
lord
lord
long
desert
lord
lord
lordship
honesty
deceived
lord
lord
honour
lord
mine
dear
lord
lord
lord
lord
lord
prologue
meant
_
lord
first
t
lord
keen
_
sir
choler
affair
lord
denmark
unmannerly
lord
weasel
whale
_
enter
enter
queen
o
tinct
sweet
mad
o
twain
lord
lord
lord
polonius
ay
c
sir
fortinbras
fee
d
lord
thine
ulcer
enter
easiness
lord
lord
lord
meet
mine
dead
king
fortinbras
young
lord
olympus
i
queen
horatio
lord
certain
possible
lord
d
majesty
hot
spent
ignorant
unfellowed
lordship
lord
lord
fit
t
poor
sir
_
fortune
late
d
dear
hamlet
noble
_
bear


In [122]:
findDescriptorsEsp('hamleto')

mia
eksterorda
unua
pala
terura
griza
malvarmega
lia
neaŭdita
nova
honesta
longa
kara
kara
malagrabla
pikema
bona
finita
kara
luma
decidita
via
mia
fantazia
plenumita
sekvanta
inklina
disponita
dua


In [123]:
findDescriptorsEng('horatio')

welcome
partisan
strange
red
long
cold
struck
wonderful
secret
pray
pajock
good
perceive
queen
servant
good
good
dead
unsatisfied
good


In [124]:
findDescriptorsEsp('horacio')

instruita
ha
mia
malafabla
honesta
brava
lia
kara
bona
mia


In [125]:
findDescriptorsEng('ophelia')

t
few
late
_to
fair
d
fair
farewell
mischief
ophelia
lady
ophelia
poor
sweet
poor
fair


In [126]:
findDescriptorsEsp('ofelio')

lia
senkapa
nova
ravanta
kara
mia
ĉarma
virta
bela
kara
bela


## Familiarize yourself with the concept of an idiom. Esperanto has almost no idioms. Find 2-3 idiomatic phrases in your English text, and find their Esperanto translations. How are they worded unidiomatically in Esperanto? What happens when you try to put the idiomatic phrase into Google Translate, and then translate it back again?

The two idioms I chose are ones that are fairly common in English; "in my minds eye" is meant to mean "in my imagination/memory" or something similar, so I find it interesting that Esperanto translated the phrase to "in the eyes of the soul". The use of "soul" here instead of "mind" connotates a more spiritual or trasncendant meaning (when I think of "Soul" in this context, I think of something more philosophical) and it can very much change the entire meaning of what Hamlet is saying, depending on the reader. In the next idiom "in my heart of heart" or "in my heart of hearts" as it is more commonly used, means ones true or innermost feelings about a subject. In this context, Hamlet is talking to Horatio and saying that he keeps him close to his heart, since he is an honset man whois not "passion's slave". In Esperanto, the phrase is translated to "in the deepest place from my heart" which is a very close translation of the meaning of the original idiom. The most interesting aspect of the translation is the actual action involved, as in Esperanto Hamlet has "closed" Horatio in the deepest place in his heart, which makes sense, but is not exactly what the English translation is commnicating (though it is very close). Though Esperanto does not word-for-word translate the idiomatic expression, in these two sentences I find that it is quite accurate in translating the meaning of the expressions.

In [142]:
englishIdiom1 = "HAMLET. In my mind’s eye, Horatio."
esperantoIdiom1 = "HAMLETO En l' okuloj de l' animo."
esperantoTranslate1 = 'In the eyes of the soul'

In [143]:
englishIdiom2 = 'Give me that man That is not passion’s slave, and I will wear him In my heart’s core, ay, in my heart of heart, As I do thee.'
esperantoIdiom2 = '''Ho, donu vi al mi la viron, kiun Pasio lia ne sklavigis,--mi Lin fermos en la plej profundan lokon De mia koro, kiel vin mi fermis'''
esperantoTranslate2= 'Oh, give me the man that His passion did not enslave me It will close it in the deepest place From my heart, how I closed you'