# T-725 Natural Language Processing: Lab 3
In today's lab, we will be working with logistic regression and part-of-speech tagging, and word embeddings.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

## Extracting numerical features from text
Machine learning algorithms generally only accept numerical input, meaning that we must represent all features numerically. For example, to classify a single sentence, we might pass a classifier a list of word counts in that sentence, or a list of `True` and `False` values (which have numerical values of 1 and 0, respectively), representing the presence or absence of particular words.

[Scikit-learn](https://scikit-learn.org/stable/) is a popular machine learning library for Python that implements a wide variety of machine learning algorithms, including naive Bayesian and logistic regression. It also offers a convenient way to extract numerical features from text, for example with the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class. `CountVectorizer` is used to generates feature vectors containing character or word n-gram counts for any n within a given range (e.g., `ngram_range=(2, 2)` for only bigrams, or `ngram_range(1, 3)` for unigrams, bigrams and trigrams). The `CountVectorizer` has an attribute called `analyzer` that can be set to 'char' for character n-grams.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize a vectorizer that counts word bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))

# Count all bigrams in the sentences and create a feature vector
sentences = ["It was the best of times, it was the worst of times,",
             "it was the age of wisdom, it was the age of foolishness,"]

vector = vectorizer.fit_transform(sentences)

print("Bigrams:", vectorizer.get_feature_names_out())
print("\nFeatures:")
print(vector.toarray())

Bigrams: ['age of' 'best of' 'it was' 'of foolishness' 'of times' 'of wisdom'
 'the age' 'the best' 'the worst' 'times it' 'was the' 'wisdom it'
 'worst of']

Features:
[[0 1 2 0 2 0 0 1 1 1 2 0 1]
 [2 0 2 1 0 1 2 0 0 0 2 1 0]]


Here, `vectorizer` created a matrix with 13 columns (one for each bigram) and two rows (one for each sentence). Each row consists of bigram counts for the corresponding sentence. For example, the first sentence has the bigram counts `[0 1 2 0 2 0 0 1 1 1 2 0 1]`, which means that it contains 0 instances of "age of", 1 instance of "best of", two instances of "it was", and so on (we can see which column represents which bigram with `vecorizer.get_feature_names()`).

## Creating training and test sets
Scikit-learn lets us quickly split data into training and test sets with the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Note that by convention, examples are generally denoted with a capital X while labels are denoted with a lowercase y. Let's create a training and test set for the subjectivity corpus from the NLTK:

In [2]:
import nltk
from nltk.corpus import subjectivity
from sklearn.model_selection import train_test_split

# Download the subjectivity corpus and get the sentences for each category
nltk.download('subjectivity')

obj_fileids = subjectivity.fileids('obj')
subj_fileids = subjectivity.fileids('subj')

# Let's get the untokenized sentences from each category
obj_sentences = subjectivity.raw(obj_fileids).splitlines()
subj_sentences = subjectivity.raw(subj_fileids).splitlines()

X = obj_sentences + subj_sentences
y = ['obj'] * 5000 + ['subj'] * 5000

# Create a word unigram count vectorizer and generate the feature vectors
vectorizer = CountVectorizer(ngram_range=(1, 1))
X_vectorized = vectorizer.fit_transform(X)

# Create a training and test set (80%/20% split). This function always shuffles
# the examples before making the split, but we can make sure that it always
# shuffles them the same way by specifying a specific random_state value.
X_train, X_test, y_train, y_test = train_test_split(X_vectorized,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

[nltk_data] Downloading package subjectivity to
[nltk_data]     C:\Users\pasqu\AppData\Roaming\nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


## Logistic regression
We can create a logistic regression classifier with the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class:

In [3]:
from sklearn.linear_model import LogisticRegression

subj_clf = LogisticRegression(solver='liblinear')
subj_clf.fit(X_train, y_train)  # Train the model
score = subj_clf.score(X_test, y_test)  # Evaluate the model on the test set
print("Accuracy: {:.1%}".format(score))

Accuracy: 90.2%


Our logistic regression classifier obtains an accuracy of 90.2%, which is quite a bit higher than the accuracy obtained by NLTK's naive Bayes classifier in a previous lab.

Once the classifier is trained, we can use it to classify new sentences:

In [4]:
example_sentences = [
  "Monty Python's Flying Circus, the British comedy group which gained fame via\
   BBC-TV, send-up Arthurian legend, performed in whimsical fashion with Graham\
   Chapman an effective straight man as King Arthur.",
  "The funniest movie of 1975 and probably the silliest movie ever made."
]

features = vectorizer.transform(example_sentences)
subj_clf.predict(features)

array(['obj', 'subj'], dtype='<U4')

## Pipelines
Instead of having to call `vectorizer.transform()` every time we use the classifier, we can create a `Pipeline` that automatically extracts features for us.

In [5]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1, 1))),
    ('clf', LogisticRegression(solver='liblinear'))
])

# The feature vectors are automatically created
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print("Accuracy: {:.1%}".format(score))

pipeline.predict(example_sentences)

Accuracy: 90.2%


array(['obj', 'subj'], dtype='<U4')

## Creating word embeddings
[Gensim](https://radimrehurek.com/gensim/) is a Python library that makes it easy to generate and work with word embeddings.

Let's start by supressing some warnings from Gensim:

In [6]:
import os
import warnings

# Suppress some warnings from Gensim about deprecated functions
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

Now, let's create word2vec embeddings for NLTK's movie review corpus:

In [7]:
import nltk
from nltk.corpus import movie_reviews
from gensim.models import Word2Vec

nltk.download('movie_reviews')
nltk.download('punkt')

sents = movie_reviews.sents()
movie_embeddings = Word2Vec(sents, epochs=1, min_count=5, vector_size=50)

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\pasqu\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pasqu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


What does the vector for *actor* look like?

In [8]:
movie_embeddings.wv['actor']

array([-0.10284212,  0.10995306, -0.21341045, -0.09947027, -0.3638763 ,
       -0.3150482 ,  0.9922025 ,  0.7381499 , -1.0560341 , -0.38521498,
       -0.03333388, -0.7229469 ,  0.59951705,  0.4576334 , -0.40081444,
        0.3702124 ,  0.51480293,  0.26415542, -1.1690065 , -0.75881875,
        0.24920873,  0.70570695,  0.9074972 , -0.1502226 ,  0.47385252,
        0.29440308,  0.05997658, -0.01057256, -0.5727217 , -0.02375873,
       -0.14740857, -0.61561674,  0.19331753, -0.13283251, -0.33181086,
        0.1982581 ,  0.5541349 , -0.00721729,  0.17681986, -0.35482278,
        0.55401146, -0.49998602,  0.36560115,  0.26329264,  1.3752546 ,
        0.07222486, -0.12959574, -0.7434003 ,  0.51343596,  0.3329553 ],
      dtype=float32)

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 AM, Monday September 18th. Remember to save your file before uploading it.

## Question 1
The NLTK includes a copy of the *Universal Declaration of Human Rights* (UDHR) in over 300 languages, including Icelandic, Norwegian, Swedish, Danish, Finnish and Faroese.

Create a `Pipeline` with a `CountVectorizer` and a `LogisticRegression` classifier that satisfies the following requirements:

The `CountVectorizer` should:
* Create character-level n-grams.
* Generate unigram, bigram and trigram counts.

The `LogisticRegression` classifier should:
* Use the `liblinear` solver.

Refer to Scikit-learn's reference for the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for information on possible parameters.

Once you've created the pipeline, train it using the `train_udhr(pipeline)` function below, which returns the test examples and labels (and should not be modified). Report the accuracy of the classifier, and try making predictions on a few sentences from these languages, for example from Wikipedia ([is](https://is.wikipedia.org/wiki/Fors%C3%AD%C3%B0a), [no](https://no.wikipedia.org/wiki/Portal:Forside), [se](https://sv.wikipedia.org/wiki/Portal:Huvudsida), [da](https://da.wikipedia.org/wiki/Forside), [fi](https://fi.wikipedia.org/wiki/Wikipedia:Etusivu), [fo](https://fo.wikipedia.org/wiki/Fors%C3%AD%C3%B0a)). One sentence from each language is enough. Does the classifier perform as well as you would expect, given the reported accuracy?

In [9]:
# Don't change anything in this code cell
import random
from nltk.corpus import udhr
nltk.download('udhr')

def train_udhr(pipeline):
  X = []
  y = []

  # The UDHR is quite small, so let's create 1,000 "fake" sentences in each
  # language by randomly stringing together 3-15 words.
  for lang in languages:
    words = udhr.words(lang)
    sents = [" ".join(random.choices(words, k=random.randint(3, 15))) for x in range(1000)]
    X.extend(sents)
    y += [lang] * len(sents)

  X_train, X_test, y_train, y_test = train_test_split(X,
                                                      y,
                                                      test_size=0.1,
                                                      random_state=42)

  # Train the classifier
  pipeline.fit(X_train, y_train)
  return X_test, y_test

languages = ['Icelandic_Yslenska-Latin1',
             'Norwegian-Latin1',
             'Swedish_Svenska-Latin1',
             'Danish_Dansk-Latin1',
             'Finnish_Suomi-Latin1',
             'Faroese-Latin1']

[nltk_data] Downloading package udhr to
[nltk_data]     C:\Users\pasqu\AppData\Roaming\nltk_data...
[nltk_data]   Package udhr is already up-to-date!


In [22]:
# Your solution here
pipeline = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1, 3), analyzer='char')),
    ('clf', LogisticRegression(solver='liblinear'))
])

sentences = [
    ["Bandamannaleikarnir 1919 voru íþróttamót sem haldið var að frumkvæði Bandaríkjahers á íþróttavelli skammt fyrir utan París frá 22."],
    ["Gruppa har ved sida av Black Sabbath og Led Zeppelin vorte rekna som pionerar innan hardrock og heavy metal, sjølv om somme bandmedlemmar hevdar at ein ikkje kan kategorisere musikken deira til ein enkelt sjanger."],
    ["Vid en eventuell stormaktskonflikt ville landet, genom att vara alliansfri i fredstid vid en eventuell stormaktskonflikt ha möjlighet att vara neutralt."],
    ["I samme periode blev The Miracles' oprindelige forsanger og grundlægger Smokey Robinson en af historiens mest succesfulde sangskrivere og pladeproducere, og gruppen hed i en periode Smokey Robinson & the Micrales. "],
    ["Lajin tyypillistä elinympäristöä ovat levinneisyysalueen pohjoisosassa erilaiset rämeet, etelässä taas hiekkadyynialueet ja soraiset rannat. Vuoristoalueilla perhosta tavataan myös vuoristoniityillä."],
    ["Hann vaks upp í fátækradømi við einum pápa, ið var krígsveteranur frá 1. heimskríggi| og orsakað av krígsskaðum ofta var innlagdur á sjúkrahúsi."],
]


X_test, y_test = train_udhr(pipeline)
accuracy = str(pipeline.score(X_test, y_test)*100) + "%"

for sentence in sentences:
    result = pipeline.predict(sentence)
    print("The sentence is predicted as: "+str(result))

print("The model performed well given the "+accuracy+" of accuracy. The sentences were predicted correctly.")
  

The sentence is predicted as: ['Icelandic_Yslenska-Latin1']
The sentence is predicted as: ['Norwegian-Latin1']
The sentence is predicted as: ['Swedish_Svenska-Latin1']
The sentence is predicted as: ['Danish_Dansk-Latin1']
The sentence is predicted as: ['Finnish_Suomi-Latin1']
The sentence is predicted as: ['Faroese-Latin1']
The model performed well given the 96.66666666666667% of accuracy. The sentences were predicted correctly.


## Question 2
The logistic regression classifier below tries to determine which of the following tags should be assigned to a given word:
* **NP** (proper nouns, singular),
* **NP\$** (proper nouns, singular and possessive),
* **VBG** (verbs, present participle) or
* **VBD** (verbs, past tense).

The classifier makes its determination solely on characteristics of the word itself and does not make use of any contextual features. The function `extract_features(word)` extracts a list of numerical features from each word, currently only the length of a word and whether or not it ends with "r". Using these features, the classifier obtains an accuracy of 37.1%, which is quite poor. Replace the features that the `exctract_features()` function generates with your own. Use Python's [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) to generate the features, and try to get at least 99% accuracy.

**Remember**: each feature must be numerical (or `True`/`False`), and don't forget to add a comma after each feature in the list.

In [11]:
# Don't change anything in this code cell
from collections import defaultdict
from nltk.corpus import brown

nltk.download('brown')

def get_brown_tags(tag):
  return sorted({w for s in brown_train for w, t in s if t == tag})

def train_model():
  # Create the training set
  word_list = [word for tag_words in words for word in tag_words]
  X = [extract_features(word) for word in word_list]
  y = [tag for tag, tag_words in zip(tags, words) for word in tag_words]

  # Train and evaluate the classifier
  log_clf = LogisticRegression(solver='liblinear', multi_class='ovr')
  log_clf.fit(X, y)
  print("Accuracy: {:.1%}".format(log_clf.score(X, y)))

  # Print the accuracy for each tag
  predictions = log_clf.predict(X)
  errors = defaultdict(list)
  for word, example, label, prediction in zip(word_list, X, y, predictions):
    if label != prediction:
      errors[label].append(word)

  print("\nAccuracy and first 10 errors per tag:")
  for tag, tag_words in zip(tags, words):
    error_words = errors[tag]
    num_total = len(tag_words)
    num_correct = num_total - len(error_words)
    ratio = num_correct / num_total
    print("{:>3} {:,}/{:,} ({:.1%}) {}".format(tag, num_correct, num_total, ratio,
                                              ", ".join(error_words[:10])))

# Download and prepare the Brown corpus for training and testing
brown_train, brown_test = train_test_split(brown.tagged_sents(),
                                           test_size=0.1,
                                           random_state=42)

print("Training sentences: {:,}".format(len(brown_train)))
print("Test sentences: {:,}".format(len(brown_test)))

# Get 1,000 examples of each tag
tags = ['NP', 'NP$', 'VBG', 'VBD']
random.seed(42)
words = [random.sample(get_brown_tags(tag), 1000) for tag in tags]

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\pasqu\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Training sentences: 51,606
Test sentences: 5,734


In [12]:
# Modify the features generated by this function and run the code cell to see
# how your changes affect the accuracy of the classifier.
def extract_features(word):
    features = [
        word.istitle(),
        word.islower(),
        word.isupper(),
        word.endswith('\'s'),
        word.endswith('\''),
        word.endswith('ing'),
    ]
    return features

# The errors listed by this function are words belonging to that tag that were
# incorrectly assigned with another tag. Use them to figure out useful features.
train_model() # 99.2% accuracy

Accuracy: 99.2%

Accuracy and first 10 errors per tag:
 NP 988/1,000 (98.8%) Whiting, niger, Diffring, aerogenes, anhemolyticus, Ring, Kooning, Sing, Rudkoebing, orzae
NP$ 999/1,000 (99.9%) Grevyles
VBG 994/1,000 (99.4%) Followin', Rammin', Shootin', Shippin', Countin', waitin
VBD 988/1,000 (98.8%) Scared, Exclaimed, Sat, Came, Asked, Became, Ran, Replied, Stroked, Thought


## Question 3
Word embeddings can capture semantic and syntactic relationships between words. For example, the vector between the words *king* and *man* is identical to the vector between *queen* and *woman* (i.e., *king* is to *man* as *queen* is to *woman*). This means that if we have a good vector representation for each of those words, we should be able to apply vector arithmetic to find that *king* - *man* + *woman* = *queen*.

The function `find_word(a, b, x)`, defined below, finds the word **y**, such that **a** is to **b** as **x** is to **y** (also expressed as **a**:**b** as **x**:**y**).

Below, we download GloVe word vectors through Gensim's API. Use those vectors and `find_words()` to complete the following tasks:
1. In the UK, people say *petrol* instead of *gas*. Find the British English equivalent of the word *truck*.
2. Find the capital of France.
3. Find the present tense of the verb *flew*.

**Note**: all words in `glove-wiki-gigaword-100` are in lowercase!

In [13]:
import gensim.downloader as api
glove = api.load("glove-wiki-gigaword-100")

def find_word(a, b, x):
  # a is to b as x is to ?
  a = a.lower()
  b = b.lower()
  x = x.lower()
  print(f"> {a}:{b} as {x}:?")
  top_words = glove.most_similar_cosmul(positive=[x, b], negative=[a])
  for num, (word, score) in enumerate(top_words[:5]):
    print(f"{num + 1}: ({score:.3f}) {word}")
  print()

In [14]:
# Example 1: man is to king as woman is to ?
find_word('man', 'king', 'woman')

# Example 2: evening is to dinner as noon is to ?
find_word('evening', 'dinner', 'noon')

# 1) In the UK, people say 'petrol' instead of 'gas'. Find the British English
# equivalent of 'truck'.
find_word('gas', 'truck', 'petrol')

# 2) Find the capital of France. Remember to use only lowercase characters.
find_word('france', 'paris', 'france')

# 3) Find the present tense of the verb "flew".
find_word('flew', 'fly', 'flew')


> man:king as woman:?
1: (0.896) queen
2: (0.850) monarch
3: (0.845) throne
4: (0.837) princess
5: (0.836) elizabeth

> evening:dinner as noon:?
1: (0.839) lunch
2: (0.829) breakfast
3: (0.814) a.m.
4: (0.814) p.m.
5: (0.813) meal

> gas:truck as petrol:?
1: (1.020) lorry
2: (0.957) wagon
3: (0.951) trucks
4: (0.950) lorries
5: (0.945) car

> france:paris as france:?
1: (0.900) prohertrib
2: (0.867) london
3: (0.852) brussels
4: (0.847) french
5: (0.844) rome

> flew:fly as flew:?
1: (0.883) flying
2: (0.871) flies
3: (0.848) planes
4: (0.842) plane
5: (0.839) flight


## Question 4
Gensim offers us several ways to find words that are similar or dissimilar to one another. Complete the following tasks:
1. Use `glove.most_similar(word, topn=5)` to find the five words that are most similar to:
  1. cat
  2. samsung
  3. batman
2. Use `glove.doesnt_match(list_of_strings)` to find which of the words below doesn't fit with the rest:
  1. cat hamster gremlin rabbit goldfish dog
  2. samsung microsoft dell panasonic mcdonalds facebook
  3. batman spiderman daredevil shrek hulk deadpool

In [15]:
# Your solution here

print("2): ", glove.most_similar('cat', topn=5))
print("3): ", glove.most_similar('samsung', topn=5))
print("4): ", glove.most_similar('batman', topn=5))
print("5): ", glove.doesnt_match(['cat', 'hamster', 'gremlin', 'rabbit', 'goldfish', 'dog']))
print("6): ", glove.doesnt_match(['samsung', 'microsoft', 'dell', 'panasonic', 'mcdonalds', 'facebook']))
print("7): ", glove.doesnt_match(['batman', 'spiderman', 'daredevil', 'shrek', 'hulk', 'deadpool']))

2):  [('dog', 0.8798074722290039), ('rabbit', 0.7424427270889282), ('cats', 0.732300341129303), ('monkey', 0.7288709878921509), ('pet', 0.719014048576355)]
3):  [('lg', 0.8194022178649902), ('toshiba', 0.7769339084625244), ('hyundai', 0.7322311401367188), ('fujitsu', 0.7246403694152832), ('panasonic', 0.7154008746147156)]
4):  [('superman', 0.8058773279190063), ('superhero', 0.6820072531700134), ('sequel', 0.6592288017272949), ('catwoman', 0.654157817363739), ('joker', 0.6362104415893555)]
5):  gremlin
6):  mcdonalds
7):  shrek
