# Lab 3
We'll use this lab as an experiment of using a single file where you fill in codeblocks where necessary. They will be available as .py and .ipynb. Using the latter, or Jupyter Notebook, is highly recommended, as it provides substantially better feedback.


Provide your outputs in a simple report, along with textual answers.


The idea behind this format is to clarify what sort of output is required, as all answers run on tests based in the `tests.py` file.

In [2]:
import sklearn
import nltk
import random
import pandas as pd
import re
# feel free to import from modules of sklearn and nltk later
# e.g., from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

## Exercise 1 - Gender detection of names
In NLTK you'll find the corpus `corpus.names`. A set of 5000 male and 3000 female names.
1) Select a ratio of train/test data (based on experiences from previous labs perhaps?)
2) Build a feature extractor function
3) Build two classifiers:
    - Decision tree
    - Naïve bayes
    
Finally, write code to evaluate the classifiers. Explain your results, and what do you think would change if you altered your feature extractor?

In [4]:
class GenderDataset:
    def __init__(self):
        self.names = nltk.corpus.names
        self.data = None 
        self.build()

    def make_labels(self, gender):
        return [(n, gender) for n in self.names.words(gender + ".txt")]
    
    def build(self):
        self.data = self.make_labels("female")
        self.data.extend(self.make_labels("male"))
    
    def split(self, ratio):
        return train_test_split(self.data, test_size=ratio, shuffle=True, random_state=4)

In [14]:
class Classifier:
    def __init__(self, classifier):
        self.classifier = classifier
        self.model = None
    
    def train(self, data):
        self.model = self.classifier.train(train_set)
        
    def test(self, data):
        return nltk.classify.accuracy(self.model, data)
    
    def train_and_evaluate(self, train, test):
        self.train(train)
        return self.test(test)
        
    def show_features(self):
        # OPTIONAL
        pass

                                 
class FeatureExtractor:
    def __init__(self, data):
        self.data = data
        self.features = []  
        
        self.build()
                 
    @staticmethod
    def text_to_features(name):
        return {
            "name": name,
            "last_letter": name[-1],
            "first_letter": name[0],
        }
    
    def build(self):
        for name, tag in self.data:
            self.features.append((FeatureExtractor.text_to_features(name), tag))


Note: you should achieve an accuracy of well above 70%!

In [13]:
from nltk.classify import NaiveBayesClassifier, DecisionTreeClassifier

In [28]:
split_ratio = 0.75  # TODO: modify
train, test = GenderDataset().split(ratio=split_ratio)

classifiers = {
    "decision_tree": Classifier(DecisionTreeClassifier),
    "naive_bayes": Classifier(NaiveBayesClassifier), 
}

train_set = FeatureExtractor(train).features
test_set = FeatureExtractor(test).features

for name, classifier in classifiers.items():
    acc = classifier.train_and_evaluate(train_set, test_set)
    print("Model: {}\tAccuracy: {}".format(name, acc))

Model: decision_tree	Accuracy: 0.6124538435716683
Model: naive_bayes	Accuracy: 0.7554548506210138


Model: decision_tree	   Accuracy: 0.6124538435716683 

Model: naive_bayes	       Accuracy: 0.7554548506210138

## Exercise 2 - Spam or ham
Spam or ham is referred to a mail being spam or regular ("ham"). Follow the instructions and implement the `TODOs`

In [54]:
spam = pd.read_csv(
    'spam.csv',
    usecols=["v1", "v2"],
    encoding="latin-1"
).rename(columns={"v1": "label", "v2": "text"})

print(spam.label.value_counts())
spam.head()

ham     4825
spam     747
Name: label, dtype: int64


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


```
ham     4825
spam     747
Name: label, dtype: int64
label	text
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
```

In [55]:
labelmapping = {
    "ham": 0,
    "spam": 1
    }

spam.label = spam.label.apply(labelmapping.get)
spam.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

0    4825
1     747
Name: label, dtype: int64

In [56]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

class TextCleaner:
    def __init__(self, text):
        self.text = word_tokenize(text) 
        self.stemmer = PorterStemmer() 
        self.stopwords = set(stopwords.words('english'))
        self.lem = WordNetLemmatizer()
    
 
    def lowercase(self):
        """
        Create small functions to replace your tokens (self.text)
        iteratively. Such as a lowercase function.
        """
        self.text = [w.lower() for w in self.text]

    def clean(self):
        self.lowercase()
        self.text = [word for word in self.text if word not in self.stopwords]
        self.text = [self.stemmer.stem(word) for word in self.text]
        self.text = [self.lem.lemmatize(word) for word in self.text]
        return " ".join(self.text)

In [57]:
clean = lambda text: TextCleaner(text).clean()
spam.text = spam.text.apply(clean)

In [58]:
spam.head()

Unnamed: 0,label,text
0,0,"go jurong point , crazi .. avail bugi n great ..."
1,0,ok lar ... joke wif u oni ...
2,1,free entri 2 wkli comp win fa cup final tkt 21...
3,0,u dun say earli hor ... u c alreadi say ...
4,0,"nah n't think goe usf , live around though"


```
label	text
0	0	go jurong point , crazi .. avail bugi n great ...
1	0	ok lar ... joke wif u oni ...
2	1	free entri 2 wkli comp win fa cup final tkt 21...
3	0	u dun say earli hor ... u c alreadi say ...
4	0	nah n't think goe usf , live around though
```

In [59]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

split_ratio = 0.50 
X_train, X_test, y_train, y_test = train_test_split(
    spam.text, spam.label, test_size=split_ratio, random_state=4310)


vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)


classifier = MultinomialNB()
if classifier:
    classifier.fit(X_train, y_train)

In [60]:
def predict(model, vectorizer, data, all_predictions=False):
    data = vectorizer.transform(data) # TODO apply the transformation from the vectorizer to test data 
    if all_predictions:
        return model.predict_proba(data)
    else:
        return model.predict(data)

def print_examples(data, probs, label1, label2, n=10):
    percent = lambda x: "{}%".format(round(x*100, 1))

    for text, pred in list(zip(data, probs))[:n]:
        print("{}\n{}: {} / {}: {}\n{}".format(
            text,
            label1,
            percent(pred[0]),
            label2,
            percent(pred[1]),
            "-" * 100  # to print a line
        ))

In [62]:
if classifier:
    y_probas = predict(classifier, vectorizer, X_test, all_predictions=True)
    print_examples(X_test, y_probas, "ham", "spam")

    y_pred = predict(classifier, vectorizer, X_test)
    # TODO display a confusion matrix on the test set vs predictions
    confusion_mat = confusion_matrix(y_test, y_pred)
    print(confusion_mat)

    # show precision and recall in a confusion matrix
    tn, fp, fn, tp = confusion_mat.ravel()
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)

    print("Recall={}\nPrecision={}".format(round(recall, 2), round(precision, 2)))

world famamu ....
ham: 98.1% / spam: 1.9%
----------------------------------------------------------------------------------------------------
\aww must nearli dead ! well jez iscom todo workand whilltak forev ! \ '' ''
ham: 98.9% / spam: 1.1%
----------------------------------------------------------------------------------------------------
babe . hope ok . shit night sleep . fell asleep 5.iåõm knacker iåõm dread work tonight . thou upto tonight . x
ham: 100.0% / spam: 0.0%
----------------------------------------------------------------------------------------------------
thank . like well ...
ham: 99.9% / spam: 0.1%
----------------------------------------------------------------------------------------------------
'm read text sent . meant joke . read light
ham: 99.8% / spam: 0.2%
----------------------------------------------------------------------------------------------------
oki ì_ wan meet bishan ? co bishan . 'm drive today .
ham: 100.0% / spam: 0.0%
-----------------------

```
world famamu ....
ham: 98.1% / spam: 1.9%
----------------------------------------------------------------------------------------------------
\aww must nearli dead ! well jez iscom todo workand whilltak forev ! \ '' ''
ham: 98.9% / spam: 1.1%
----------------------------------------------------------------------------------------------------
babe . hope ok . shit night sleep . fell asleep 5.iåõm knacker iåõm dread work tonight . thou upto tonight . x
ham: 100.0% / spam: 0.0%
----------------------------------------------------------------------------------------------------
thank . like well ...
ham: 99.9% / spam: 0.1%
----------------------------------------------------------------------------------------------------
'm read text sent . meant joke . read light
ham: 99.8% / spam: 0.2%
----------------------------------------------------------------------------------------------------
oki ì_ wan meet bishan ? co bishan . 'm drive today .
ham: 100.0% / spam: 0.0%
----------------------------------------------------------------------------------------------------
smile pleasur smile pain smile troubl pour like rain smile sum1 hurt u smile becoz someon still love see u smile ! !
ham: 100.0% / spam: 0.0%
----------------------------------------------------------------------------------------------------
hi : ) ct employe ?
ham: 95.7% / spam: 4.3%
----------------------------------------------------------------------------------------------------
c movi juz last minut decis mah . juz watch 2 lar tot ì_ interest .
ham: 100.0% / spam: 0.0%
----------------------------------------------------------------------------------------------------
; - ) ok . feel like john lennon .
ham: 100.0% / spam: 0.0%
----------------------------------------------------------------------------------------------------
[[2395   15]
 [  31  345]]
Recall=0.92
Precision=0.96
```

## Exercise 3 - Word features
Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication of what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to each other.

Your task:
- Use the WordNet lexicon and augment the movie review document classifier (See NLTK book, Ch. 6, section 1.3) to use features that generalize the words that appear in a document, making it more likely that they will match words found in the training data.

Download wordnet and import

In [63]:
nltk.download('wordnet')
from nltk.corpus import movie_reviews
from nltk.corpus import wordnet as wn
import random

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [81]:
from itertools import chain
# TODO: implement a function that returns a synonym for "word" if available, otherwise return the word itself
def word_to_syn(word):
    synonyms = wn.synsets(word)
    lemmas = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
    return word if len(lemmas) == 0 else random.choice(list(lemmas))

In [82]:
"""
this is from Ch. 6, sec. 1.3, with slight modifications
note that word_to_syn(word) (from the above implementation)
is in the beginning of the following function
"""
documents = [([word_to_syn(word) for word in list(movie_reviews.words(fileid))], category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
n_most_freq = 2000
word_features = list(all_words)[:n_most_freq]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [83]:
featuresets = [(document_features(d), c) for (d, c) in documents]

split_ratio = 0.75 
train_set, test_set = train_test_split(featuresets, test_size=split_ratio)


classifier = NaiveBayesClassifier
model = classifier.train(train_set)

In [93]:
# TODO: return a flattened list of input words and their lemmas
def synset_expansion(words) -> list:
    allwords = []
    for word in words:
        synonyms = wn.synsets(word)
        lemmas = set(map(str.lower,chain.from_iterable([word.lemma_names() for word in synonyms])))
        lemmas.add(word)
        allwords.extend(lemmas)
    return allwords

expanded_word_features = synset_expansion(word_features)

In [94]:
# some assertions to test your code :-)
assert sorted(synset_expansion(["pc"])) == ["microcomputer", "pc", "personal_computer"]
assert sorted(synset_expansion(["programming", "coder"])) == [
    'coder',
    'computer_programing',
    'computer_programmer',
    'computer_programming',
    'program',
    'programing',
    'programme',
    'programmer',
    'programming',
    'scheduling',
    'software_engineer'
]

In [95]:
doc_featuresets = [(document_features(d), c) for (d, c) in documents]
doc_train_set, doc_test_set = train_test_split(doc_featuresets, test_size=0.1)

doc_model = model.train(doc_train_set)
doc_model.show_most_informative_features(5)
print("Accuracy: ", nltk.classify.accuracy(doc_model, doc_test_set))

Most Informative Features
         contains(mulan) = True              pos : neg    =      8.2 : 1.0
      contains(touching) = True              pos : neg    =      7.6 : 1.0
        contains(seagal) = True              neg : pos    =      7.6 : 1.0
    contains(determined) = True              pos : neg    =      7.5 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.5 : 1.0
Accuracy:  0.685


Most Informative Features
         contains(mulan) = True              pos : neg    =      8.2 : 1.0
      contains(touching) = True              pos : neg    =      7.6 : 1.0
        contains(seagal) = True              neg : pos    =      7.6 : 1.0
    contains(determined) = True              pos : neg    =      7.5 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.5 : 1.0
Accuracy:  0.685

In [96]:
def lexicon_features(reviews):
    review_words = set(reviews)
    features = {}
    for word in expanded_word_features:
        if word not in word_features:
            features['synset({})'.format(word)] = (word in review_words)
        features['contains({})'.format(word)] = (word in review_words)

    return features

Question: do you see any issues with including the synsets? Experiment a bit with different words and verify your ideas.

In [97]:
# warning: this may take some time to run
lex_featuresets = [(lexicon_features(d), c) for (d, c) in documents]
lex_train_set, lex_test_set = train_test_split(lex_featuresets, test_size=0.1)
lex_model = model.train(lex_train_set)  # the same classifier as you defined above
lex_model.show_most_informative_features()
print("Accuracy: ", nltk.classify.accuracy(lex_model, lex_test_set))

Most Informative Features
  contains(pudding_head) = True              neg : pos    =     13.3 : 1.0
    synset(pudding_head) = True              neg : pos    =     13.3 : 1.0
      contains(touching) = True              pos : neg    =     12.7 : 1.0
   contains(pudden-head) = True              neg : pos    =     12.6 : 1.0
     synset(pudden-head) = True              neg : pos    =     12.6 : 1.0
       contains(declare) = True              pos : neg    =     11.4 : 1.0
         synset(declare) = True              pos : neg    =     11.4 : 1.0
       contains(misfire) = True              neg : pos    =     11.2 : 1.0
         contains(worst) = True              neg : pos    =     11.2 : 1.0
         synset(misfire) = True              neg : pos    =     11.2 : 1.0
Accuracy:  0.695


Most Informative Features
  contains(pudding_head) = True              neg : pos    =     13.3 : 1.0
    synset(pudding_head) = True              neg : pos    =     13.3 : 1.0
      contains(touching) = True              pos : neg    =     12.7 : 1.0
   contains(pudden-head) = True              neg : pos    =     12.6 : 1.0
     synset(pudden-head) = True              neg : pos    =     12.6 : 1.0
       contains(declare) = True              pos : neg    =     11.4 : 1.0
         synset(declare) = True              pos : neg    =     11.4 : 1.0
       contains(misfire) = True              neg : pos    =     11.2 : 1.0
         contains(worst) = True              neg : pos    =     11.2 : 1.0
         synset(misfire) = True              neg : pos    =     11.2 : 1.0
Accuracy:  0.695

## Exercise 4 -- Experimentation
This exercise is largely open to experiment with and testing your skills thus far!
Large websites are an ideal place to look for large corpora of natural language. In this exercise, you're free to implement what you've learned on real-world data, mined from youtube (see `youtube_data`). Reuse classes defined earlier on in the exercise if you want.

The only requirement here is to **use a classifier not previously used in the exercise**

In [None]:
# Trying to classify videos with disabled comments.
yt_data = pd.read_csv(
    'youtube_data/videos.csv',
    usecols=["comments_disabled", "description"],
).rename(columns={"comments_disabled": "label", "description": "text"})
yt_data.fillna("", inplace=True)

In [None]:
yt_data.label = yt_data.label.apply(lambda x: 1 if x else 0) # your transformation goes here
yt_data.label.value_counts()

In [None]:
clean = lambda text: TextCleaner(text).clean()
yt_data.text = yt_data.text.apply(clean)

yt_data.head()

In [None]:
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

split_ratio = 0.5
X_train, X_test, y_train, y_test = train_test_split(
    spam.text, spam.label, test_size=split_ratio, random_state=4310)

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

classifier = ComplementNB()
if classifier:
    classifier.fit(X_train.toarray(), y_train)

In [None]:
def predict(model, vectorizer, data, all_predictions=False):
    data = vectorizer.transform(data).toarray() # TODO apply the transformation from the vectorizer to test data 
    if all_predictions:
        return model.predict_proba(data)
    else:
        return model.predict(data)
 

In [None]:
if classifier:
    y_probas = predict(classifier, vectorizer, X_test, all_predictions=True)
    print_examples(X_test, y_probas, "comment", "no_comment")

    y_pred = predict(classifier, vectorizer, X_test)
    confusion_mat = confusion_matrix(y_test, y_pred)
    print(confusion_mat)

    tn, fp, fn, tp = confusion_mat.ravel()
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)

    print("Recall={}\nPrecision={}".format(round(recall, 2), round(precision, 2)))