<a href="https://colab.research.google.com/github/giorgiosld/Natural-Language-Processing/blob/main/lab2/T_725_Lab02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T-725 Natural Language Processing: Lab 2
In today's lab, we will be working with text classification.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

## List comprehensions in Python
List comprehensions are a concise way of creating lists in Python, and take the form:

```python
[expression for item in iterable]
```

A list comprehension creates a new list by evaluating some expression for every item in a given iterable (such as a string, a list or a dictionary). Let's look at an example:

In [125]:
sentence = "In a hole in the ground there lived a hobbit."
words = sentence.split()
print(words)

# Example of a list comprehension
word_lengths = [len(word) for word in words]
print(word_lengths)

# This is equal to
word_lengths = []
for word in words:
  word_lengths.append(len(word))

print(word_lengths)

['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit.']
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]
[2, 1, 4, 2, 3, 6, 5, 5, 1, 7]


You can also add a conditional statement to list comprehensions, so that the expression will only be evaluated for items that meet a certain criteria:

In [126]:
e_words = [word for word in words if len(word) > 5]
print(e_words)

['ground', 'hobbit.']


Python also has set and dictionary comprehensions:

In [127]:
lowercase_characters = {c.lower() for c in sentence}
print(lowercase_characters)

word_length = {word: len(word) for word in words}
print(word_length['ground'])

{'a', 'g', 'b', 'n', 'o', 'i', 'l', 't', 'e', ' ', 'v', '.', 'u', 'h', 'r', 'd'}
6


A nested list is a list within another list. You can iterate through nested lists in the following way:

In [128]:
# A list of countries and their capitals within different continents
continents = [
    [('Iceland', 'Reykjavík'), ('Germany', 'Berlin'), ('Spain', 'Madrid')],  # Europe
    [('Japan', 'Tokyo'), ('China', 'Beijing'), ('South Korea', 'Seoul')],  # Asia
    [('Nigeria', 'Abuja'), ('Algeria', 'Algiers'), ('Angola', 'Luanda')]  # Africa
]

# Create a list of all the countries in the previous list
[country for continent in continents for (country, capital) in continent]

['Iceland',
 'Germany',
 'Spain',
 'Japan',
 'China',
 'South Korea',
 'Nigeria',
 'Algeria',
 'Angola']

## Sentiment analysis with NLTK
[Chapter 6](https://www.nltk.org/book/ch06.html) of the NLTK book shows how the toolkit can be used to create document classifiers, including a sentiment analyzer. The NLTK includes the `movie_reviews` corpus, which contains 2,000 movie reviews. Half of the reviews have been labelled as **positive** and the other half as **negative**. Let's download it and take a look:

In [129]:
import nltk
from nltk.corpus import movie_reviews
nltk.download('punkt')

nltk.download('movie_reviews')
print("Categories:", movie_reviews.categories())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


Categories: ['neg', 'pos']


As expected, there are two categories: `pos` for positive reviews and `neg` for negative reviews. For this particular corpus, each review is stored as a separate text file. To get a list of all the text files in the corpus, we can use `movie_reviews.fileids()`. We can also get a list of files for a specific category:

In [130]:
pos_fileids = movie_reviews.fileids('pos')
neg_fileids = movie_reviews.fileids('neg')

print(pos_fileids[:5])  # The first 5 positive reviews
print(neg_fileids[:5])  # The first 5 negative reviews

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


We can get a list of all the tokens in the corpus with `movie_reviews.words()`. We can also specify a filename to get a single tokenized review:

In [131]:
pos_reviews = [movie_reviews.words(fid) for fid in pos_fileids]
neg_reviews = [movie_reviews.words(fid) for fid in neg_fileids]

print(pos_reviews[0][:10])  # The first 10 tokens of the first positive review
print(neg_reviews[0][:10])  # The first 10 tokens of the first negative review

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


Some words, such as *brilliant* and *memorable*, are more strongly associated with positive reviews than negative ones. Similarly, *boring* and *unfunny* have a stronger association with negative reviews.

Using the movie review corpus, we can train a classifier to predict whether a given review is positive or negative. The classifier extracts a set of *features* from every review, which are then used to make the classification. In this case, the features we use will be a dictionary that tells us whether each of the 2,000 most common words in the corpus is present within a review or not.

In [132]:
# Create a set with 2,000 of the most frequent words in the movie review corpus
movie_fd = nltk.FreqDist(movie_reviews.words())
movie_words = {word for word, count in movie_fd.most_common(2000)}

# For a given review (in the form of a list or set of tokens), create a
# dictionary which tells us which words are present and which are not.
def get_review_features(review):
  review_words = set(review)
  return {word: word in review_words for word in movie_words}

In [133]:
# Let's see how this works for the first positive review:
example_features = get_review_features(pos_reviews[0])
print(example_features)
print("'funny' is in the review:", example_features['funny'])
print("'boring' is in the review:", example_features['boring'])

{'this': True, 'club': False, 'boys': False, 'live': False, 'decent': False, 'start': False, 'men': False, 'called': True, 'shame': False, 'shooting': False, 'voice': False, 'project': False, 'children': False, 'solid': True, 'british': True, 'things': False, 'finds': False, 'tone': False, 'impressive': False, 'attempt': True, 'amusing': False, 'lacks': False, 'doubt': False, 'missing': False, 'man': False, 'done': False, 'silent': False, 'step': False, 'aspect': False, 'meaning': False, 'intriguing': False, 'thing': True, 'himself': False, 'third': False, 'd': False, 'secret': True, 'parts': False, 'test': False, 'utterly': False, 'south': False, 'scale': False, 'decided': False, 'jennifer': False, 'william': False, 'reach': False, 'tommy': False, 'he': True, 'goes': False, 'sitting': False, 'law': False, 'various': False, 'george': False, 'development': False, 'happens': False, 'teacher': False, 'arts': False, 'growing': False, '--': False, 'totally': False, 'hand': False, 'made': Tr

Next, let's create a training set that we can use to train a Naive Bayesian classifier. The training set, in this case, is a list of tuples in the format `[(features, category), ...]`, where `features` is a dictionary from `get_review_features()` and `category` is either `pos` or `neg`, depending on whether the review is positive or negative. To get an idea of how well the classifier performs, we're going to reserve 10% of the reviews for testing. That means that we'll be training our classifier on 1800 examples and testing it on 200 examples.

In [134]:
pos_examples = [(get_review_features(review), 'pos') for review in pos_reviews]
neg_examples = [(get_review_features(review), 'neg') for review in neg_reviews]

movie_training = pos_examples[:900] + neg_examples[:900]  # 1800 examples total
movie_test = pos_examples[900:] + neg_examples[900:]  # 200 examples total

Now we have everything we need to train our classifier.

In [135]:
movie_classifier = nltk.NaiveBayesClassifier.train(movie_training)

How well does it perform on the test set?

In [136]:
print("Accuracy:", nltk.classify.accuracy(movie_classifier, movie_test))

Accuracy: 0.815


The classifier achieves an accuracy of 81.5%. Let's take a look at which words have the biggest weights:

In [137]:
movie_classifier.show_most_informative_features(20)

Most Informative Features
             outstanding = True              pos : neg    =     15.6 : 1.0
                   mulan = True              pos : neg    =      9.0 : 1.0
             wonderfully = True              pos : neg    =      7.1 : 1.0
                  seagal = True              neg : pos    =      7.0 : 1.0
                   damon = True              pos : neg    =      6.1 : 1.0
                   flynt = True              pos : neg    =      5.7 : 1.0
                  wasted = True              neg : pos    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.3 : 1.0
                  poorly = True              neg : pos    =      5.2 : 1.0
                   awful = True              neg : pos    =      4.9 : 1.0
              ridiculous = True              neg : pos    =      4.8 : 1.0
                    jedi = True              pos : neg    =      4.4 : 1.0
                 unfunny = True              neg : pos    =      4.4 : 1.0

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59 on September 6th. Remember to save your file before uploading it.

## Question 1
The NLTK also includes a `subjectivity` corpus, which contains a collection of sentences that have either been categorized as **subjective** (emotional, expressing personal feelings and views)  or **objective** (more rational, factual). Some examples:

* **Objective sentences**:
  * uma thurman stars in quentin tarantino's fourth film venture , kill bill .  
  * he lives in a motor garage with his six friends .
  * the ensuing battle was one of the most savage in u . s . history .
* **Subjective sentences**:
  * seagal's strenuous attempt at a change in expression could very well clinch him this year's razzie .
  * de niro cries . you'll cry for your money back .
  * a heroic tale of persistence that is sure to win viewers' hearts .

Unlike the movie review corpus, where every review is stored in separate file, here there is only one file for each category.

Complete the following tasks:
1. Import and download the `subjectivity` corpus.
2. Find the names of each category.
3. Using the category names, get the relative path of each file.
4. Get a list of tokenized sentences for each category (using `subjectivity.sents(fileid)`).

In [138]:
# Your solution here
# if not already imported import nltk
import nltk
from nltk.corpus import subjectivity

# downloading the corpus "subjectivity" and punkt if isn't already present
nltk.download('punkt')
nltk.download('subjectivity')

# find name of each category
# print(f"Categories: {subjectivity.categories()}")

# retrieve the relative path of each path
obj_path = subjectivity.fileids('obj')
subj_path = subjectivity.fileids('subj')

# print(obj_path)
# print(subj_path)

# get a list of tokenized sentences
list_sentences_obj = subjectivity.sents(obj_path)
list_sentences_subj = subjectivity.sents(subj_path)

print(list_sentences_obj)
print(list_sentences_subj)

[['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ...]
[['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ...]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


## Question 2
Complete the following tasks:
1. Create a set with the 2,000 most common words in the `subjectivity` corpus using `nltk.FreqDist()`.
2. Create a function that takes a single, tokenized sentence as input (e.g., `['the', 'ensuing', 'battle', ...]`), and returns a dictionary of the 2,000 most frequent words and whether or not they are in the sentence (e.g., `{'battle': True, 'amusing': False, ...}`).

In [139]:
# Your solution here
# get the distribution using FreqDist
subject_fd = nltk.FreqDist(subjectivity.words())
# using set comprehensions I get the most commond (2000) words and take in account
# only `word` from the set since the set consist in a pair {word, count}.
subject_word = {word for word, count in subject_fd.most_common(2000)}

# function that takes a tokenized sentence and return if the token appears in
# the most frequent words
def get_token_sentence(sentence):
  # extract the token and save in a set to avoid duplicate tokens
  set_token = set(sentence)
  return {token: token in subject_word for token in set_token}
  # return {token:(True if token in subject_word else False) for token in set_token}
  # return {token:  for token in set_token if token in subject_word}

# See the behaviour using a sentence as example
example = get_token_sentence(list_sentences_obj[54])
print(example)


{'lead': True, 'that': True, 'molly': False, 'in': True, 'of': True, 'them': True, 'will': True, 'one': True, ',': True, 'ahead': True, '1': False, 'with': True, 'rabbit-proof': False, 'the': True, 'an': True, 'fence': False, 'authorities': False, 'continent': False, 'search': True, 'journey': True, 'determination': False, "australia's": False, 'outback': False, 'epic': True, 'home': True, 'step': True, '500': False, 'bisects': False, 'miles': False, 'over': True, 'grit': False, '.': True, 'on': True, 'guides': False, 'and': True, 'girls': True}


## Question 3
Complete the following tasks:
1. Create a training set with 9,000 sentences (4,500 of each category)
2. Create a test set with 1,000 sentences (500 of each category)

In [140]:
# Your solution here
# Create sentences for each category
obj_example = [(get_token_sentence(category), 'obj') for category in list_sentences_obj]
subj_example = [(get_token_sentence(category), 'subj') for category in list_sentences_subj]

# split set into training and test set
subjective_training = obj_example[:4500] + subj_example[:4500]
subjective_test = obj_example[4500:5000] + subj_example[4500:5000]

# check if the training has 9k sentences nd if test has 1k
print(f"Length training: {len(subjective_training)}")
print(f"Length test: {len(subjective_test)}")


Length training: 9000
Length test: 1000


## Question 4
Complete the following tasks:
1. Train a Naive Bayes classifier using the training set from the previous question.
2. Evaluate the classifier on the test set. How accurate is it?
3. Find the 20 most informative features.

In [141]:
# Your solution here

# Train the Naive Bayes classifier
subjective_clf = nltk.NaiveBayesClassifier.train(subjective_training)
# Evaluation using test set to see the accuracy
print(f"Accuracy: {nltk.classify.accuracy(subjective_clf, subjective_test)}")
# See the most informative features
subjective_clf.show_most_informative_features(20)

Accuracy: 0.923
Most Informative Features
                      -- = True             subj : obj    =     70.1 : 1.0
                   order = True              obj : subj   =     39.0 : 1.0
                 decides = True              obj : subj   =     35.7 : 1.0
                  sister = True              obj : subj   =     27.7 : 1.0
            entertaining = True             subj : obj    =     26.6 : 1.0
              girlfriend = True              obj : subj   =     26.3 : 1.0
                discover = True              obj : subj   =     25.0 : 1.0
                  film's = True             subj : obj    =     25.0 : 1.0
                  you're = True             subj : obj    =     22.6 : 1.0
                daughter = True              obj : subj   =     22.4 : 1.0
                 married = True              obj : subj   =     21.7 : 1.0
                 amusing = True             subj : obj    =     19.7 : 1.0
                   plans = True              obj : subj   

# Question 5
Dialog acts are sort of the type of *action* performed by the speaker. In the instant messaging corpus dataset 'NPS', each utterance is labeled with one of 15 dialogue act types, such as **Statement**, **Emotion**, **ynQuestion**, **Continuer**, etc.

Your task is to classify text from the NPS corpus into two dialog acts: **whQuestion** or **Emotion**.

Start by downloading the NPS corpus and getting all posts from the corpus:

In [142]:
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()

[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!


Create a list that only includes posts of class **Emotion** and **whQuestion**. You can access the class of a post by calling `post.get("class")`.

In [143]:
# Your solution here
# print(posts)
# list comprehension to get only post `Emotion` and `whQuestion`
filtered = [(post.text, post.get("class")) for post in posts if post.get("class") == "Emotion" or post.get("class") == "whQuestion"]
# print(filtered)


Randomize the posts and create a training set and a test set, where the first 1300 **Emotion + whQuestion** posts are used for training and the rest for testing.

In [144]:
# Your solution here
# Divide training and test set
import random
random.shuffle(filtered)
training_set = filtered[:1300]
test_set = filtered[1300:]
# print(training_set)

Create a list of the 200 most frequent tokens in the training set. You can access the text of a `post` object by calling `post.text`. Remember that the **split** function will use whitespace to tokenize a string: `some_string.split()`

In [145]:
# Your solution here
# Create a list of tokens from the training set
tokens = [token for post, _ in training_set for token in post.split()]

# Get the 200 most frequent tokens
fd = nltk.FreqDist(tokens)
most_frequent_tokens = [token for token, _ in fd.most_common(200)]

# print(most_frequent_tokens)



Define two feature selection functions that take a string as input and output a dictionary of features:
* `get_word_features(string)`
* `get_custom_features(string)`

Begin by defining `get_word_features`. This function should use the words as features, just like in the movie review example above.




In [146]:
# Your solution here
def get_word_features(post):
  # words_review = set(post.split())
  # words_review = post
  return {token: token in post.lower() for token in most_frequent_tokens}
  # return {token: True for token in most_frequent_tokens}

Next, define `get_custom_features`. This function should extract the features from the text that characterize the **Emotion** and **whQuestions** classes.

In [147]:
# Your solution here
def get_custom_features(custom):
  custom = custom.lower()
  #ustom_tokens = set(custom.split())
  # custom_tokens = custom.lower().split()
  features = ["who", "what", "when", "where", "why", "how", "!", "?", "lol", "hahaha", "haha", ":)", ":(", "lmao", "rofl", "teehee", "XD", ":-)", "8D", ":X", ":-/", ":/", "<3", ":'("]
  features += ["love", "hate", "like", "dislike", "sad", "happy", "angry", "mad", "annoyed", "excited", "bored", "scared", "fear", "afraid", "surprised", "surprise", "disgusted", "disgust", "shocked", "shock", "confused", "confuse", "confusing", "confusion", "depressed", "depress", "depressing", "depression", "anxious", "anxiety", "anxious", "anxiously"]
  features += ["damn", "omg", ";-)", "<3", "grrr", "hehehe", "hehe", ":p", ":P" ]
  features += [":)", ":(", ":D", ":P", ":/", ":|", ":O", ":S", ":*", ":'("]
  features += ["haha", "hahaha", "lol", "lmao", "rofl", "O:", ":O", "o:", ":o", "wtf"]


  return {token: token in custom for token in features}


Conduct the following tasks:
*   Train two Naive Bayes classifiers on the **Emotion + whQuestions** training set: one that uses the `get_word_features` function and another using `get_custom_features`.
*   Evaluate each classifier on the test set. How accurate are they? Which one is better?
*   What are the 20 most informative features for each classifier?


In [148]:
# Your solution here

# Prepare training sets with word features and custom features
training_word_features = [(get_word_features(post), label) for post, label in training_set]
training_custom_features = [(get_custom_features(post), label) for post, label in training_set]

# Prepare test sets with word features and custom features
test_word_features = [(get_word_features(post), label) for post, label in test_set]
test_custom_features = [(get_custom_features(post), label) for post, label in test_set]

# Train Naive Bayes Classifiers
word_classifier = nltk.NaiveBayesClassifier.train(training_word_features)
custom_classifier = nltk.NaiveBayesClassifier.train(training_custom_features)

# Evaluate classifiers
word_accuracy = nltk.classify.accuracy(word_classifier, test_word_features)
custom_accuracy = nltk.classify.accuracy(custom_classifier, test_custom_features)


print(f"Accuracy (word features): {word_accuracy}")
print(f"Accuracy (custom features): {custom_accuracy}")

# Show most informative features
word_classifier.show_most_informative_features(20)
custom_classifier.show_most_informative_features(20)


Accuracy (word features): 0.9734513274336283
Accuracy (custom features): 0.9911504424778761
Most Informative Features
                    what = True           whQues : Emotio =    181.4 : 1.0
                     how = True           whQues : Emotio =    118.1 : 1.0
                      at = True           whQues : Emotio =     93.6 : 1.0
                      do = True           whQues : Emotio =     58.7 : 1.0
                      up = True           whQues : Emotio =     37.6 : 1.0
                     who = True           whQues : Emotio =     34.2 : 1.0
                      in = True           whQues : Emotio =     34.0 : 1.0
                     and = True           whQues : Emotio =     32.3 : 1.0
                     you = True           whQues : Emotio =     26.5 : 1.0
                    that = True           whQues : Emotio =     23.1 : 1.0
                      is = True           whQues : Emotio =     20.7 : 1.0
                    from = True           whQues : Emotio