Assignment #1

Create or find your own corpus. This corpus should have the following characteristics:
• Three different subcorpora, belonging to three different genres (aka registers or text types)
• Each genre at least 5,000 words long

By genre I mean a type of text that differs from other types of text in content, audience, or mode of delivery. A newspaper article and a podcast about electric vehicles may have similar content but have different audiences and modes of delivery. A lecture and an exam may have similar content and audience, but a different mode of delivery (one oral and one written). 

You have many options. You can collect your own email or text messages from several days, your own papers or other course work, or you can collect from external sources. 
A few things to take into account:
• If you use email or text messages, you can only include text sent by you. Text by other participants (including threads in quotes in your own messages) needs to be excluded for confidentiality reasons. Alternatively, you can ask the people involved to give you permission to use the data for research purposes. This also includes text posted on social networks, where a password or permission is required to access the data. 
• If you use text from web sites, you do not normally need permission, unless the site is protected under a password.
• Newspaper text is certainly the easiest to find, but please make sure you have distinct genres within it. For instance, you could use a high-brow vs. low-brow newspaper, or opinion pieces vs. news articles.
• You will need to save the corpora as plain text.
• It is fine to use all written text, but if you have access to transcribed sources of spoken language, feel free to use that as well.
• Please include a reference to the source of each genre in your assignment.

Here is what you need to submit, for each subcorpus:
1. The length (in words).
   Command: len(text)
3. The lexical diversity.
   Command: lexical_diversity(text)
5. Top 10 most frequent words and their counts.
   Command:
   fdist1 = FreqDist(text)
   fdist1.most_common(10)
7. Words that are at least 10 characters long and their counts.
8. The longest sentence (type the sentence and give the number of words). Hint: look at the Gutenberg part of Section 2.1 in NLTK.
9. A stemmed version of the longest sentence.
10. Overall (not for each subcorpus): A reflection (1 paragraph or so): What do the most frequent words, the longest words, and longest sentence tell you about each of the 3 genres? How do you interpret the lexical diversity?

For the assignment, you need to submit on Canvas:
• The answer to the six questions, for each subcorpus, and the answer to question 7. Please submit this as a pdf, with clear indication of which corpus the answer is for.
• The commands used to arrive at the answer (a pdf of your notebook or a link to a GitHub repository). 

In [81]:
import nltk

In [5]:
import numpy
import matplotlib

In [7]:
with open("/Users/catalinaporime/Desktop/SDA250/Assignment_1/Dahmer.txt", "r", encoding = "utf8") as f:
    Dahmer = f.read()
    
DahmerTokens = nltk.word_tokenize(Dahmer)

In [150]:
# print(Dahmer)

In [146]:
#Q1:
print("Question #1")
lengthDahmer = len(DahmerTokens)
print(lengthDahmer)
print("The number of words in the Jeffrey Dahmer documentary on YouTube:", lengthDahmer)

Question #1
8084
The number of words in the Jeffrey Dahmer documentary on YouTube: 8084


In [144]:
#Q2:
print("Question #2")
lexical_diversity(DahmerTokens)

Question #2


0.19594260267194458

In [148]:
#Q3:
print("Question #3")
from nltk.probability import FreqDist
fdist1 = FreqDist(DahmerTokens)
fdist1.most_common(13)

Question #3


[(',', 420),
 ('.', 388),
 ('the', 301),
 ('of', 199),
 ('to', 194),
 ('was', 189),
 ('he', 171),
 ('a', 165),
 ('and', 144),
 ('that', 134),
 ('in', 132),
 ('-', 127),
 ('Dahmer', 115)]

In [140]:
#Q4:
print("Question #4")
dahmerLong = [word for word in DahmerTokens if len(word) >= 10]
longDist = FreqDist(dahmerLong)
print(longDist.most_common())

Question #4
[('television', 6), ('themselves', 5), ('confession', 5), ('necrophilia', 4), ('detectives', 4), ('cannibalism', 3), ('constantly', 3), ('completely', 3), ('interested', 3), ('remembered', 3), ('grandmother', 3), ('14-year-old', 3), ('photographs', 3), ('essentially', 3), ('immediately', 3), ('one-bedroom', 2), ('investigate', 2), ('31-year-old', 2), ('strangling', 2), ('psychological', 2), ('classmates', 2), ('collection', 2), ('apparently', 2), ('neighborhood', 2), ('hitchhiker', 2), ('ultimately', 2), ('eventually', 2), ('dismembered', 2), ('homosexual', 2), ('opportunity', 2), ('predominantly', 2), ('atrocities', 2), ('13-year-old', 2), ('unconscious', 2), ('necrophiliac', 2), ('refrigerator', 2), ('potentially', 2), ('circumstances', 2), ('incredible', 2), ('controlling', 2), ('assistants', 2), ('whispering', 1), ('synonymous', 1), ('revelations', 1), ('journalist', 1), ('antiseptic', 1), ('throughout', 1), ('playground', 1), ('ostracized', 1), ('surrounding', 1), ('pr

In [138]:
#Q5:
print("Question #5")
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize(Dahmer)
longest_sentence = max(sentences, key=lambda sentence: len(word_tokenize(sentence)))
word_count = len(word_tokenize(longest_sentence))
print("Longest sentence:", longest_sentence)
print("Number of words:", word_count)

Question #5
Longest sentence: - And then when the guy said he wanted to leave, Dahmer clubbed him on the back of the head with a barbell
and then strangled him, then ultimately disposed of the body,
removed all the flesh, and eventually dissolved it in acid
and pulverized the bones with a sledgehammer.
Number of words: 55


In [136]:
#Q6:
print("Question #6")
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer ("english")
long1 = word_tokenize(longest_sentence)
stemmed_word = [stemmer.stem(word) for word in long1]
print(stemmed_word)

Question #6
['-', 'and', 'then', 'when', 'the', 'guy', 'said', 'he', 'want', 'to', 'leav', ',', 'dahmer', 'club', 'him', 'on', 'the', 'back', 'of', 'the', 'head', 'with', 'a', 'barbel', 'and', 'then', 'strangl', 'him', ',', 'then', 'ultim', 'dispos', 'of', 'the', 'bodi', ',', 'remov', 'all', 'the', 'flesh', ',', 'and', 'eventu', 'dissolv', 'it', 'in', 'acid', 'and', 'pulver', 'the', 'bone', 'with', 'a', 'sledgehamm', '.']
