# Day 1 Tutorials: Natural Language Processing in Humans and Machines

## 1) Preprocessing text

In the lecture, we learned about preprocessing text. In this part of the tutorials, we will apply these concepts with popular python NLP libraries.


###1.1) Built-in python functions

Python has some built-in functions that can help us preprocess the text. For example, we can make any string of text lower-case:

In [None]:
raw_text = 'Once upon a time, in a Natural Language Processing Seminar at TU Berlin.'
print(raw_text.lower())

###1.2) Tokenization

We can also split a string of text at every space, a kind of very basic tokenizing:

In [None]:
print(raw_text.lower().split())

Note that this function can also split at any given character by giving it as argument:

In [None]:
print('Once-upon-a-time,-in-a-Natural-Language-Processing-Seminar-at-TU-Berlin.'.lower().split('-'))

We will use several functions from the NLTK library. Python has many libraries that can help you do lots of stuff. You don't have to reinvent the wheel! When you use a library, it's always a good idea to read its documentation to learn how to use it and what you can do with it. You can read more about NLTK [here](https://www.nltk.org/).

Let's import NLTK and instantiate a tokenizer:

In [None]:
import nltk

tokenizer = nltk.tokenize.wordpunct_tokenize

We can now use this tokenizer on our previous text:

In [None]:
tokens = tokenizer(raw_text.lower())
tokens

It did a better job than the built-in `split()` function, as it also separated words from the punctuation.

### 1.3) Stemminng and Lemmatization

The NLTK library also provides functions that are very helpful for preprocessing of text in general, such as stemmers and lemmatizers. Let's have a look:

In [None]:
nltk.download('wordnet')

stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = nltk.stem.WordNetLemmatizer()

print(stemmer.stem('walks'))
print(lemmatizer.lemmatize('walks'))

Stemming cuts the end of the word to its root (its stem) whereas lemmatizing uses more morphological information such as POS, plural, conjugation...

In [None]:
print('Stem:', stemmer.stem('cries'))
print('Lemma:', lemmatizer.lemmatize('cries'))

print('Stem:', stemmer.stem('mice'))
print('Lemma:', lemmatizer.lemmatize('mice'))

###1.4) Break-out session

These examples were all in English ([like most research in neuroscience](https://www.cell.com/trends/cognitive-sciences/fulltext/S1364-66132200236-4)). Try out some of these functions in other languages, maybe german?

In [None]:
der_stemmer = nltk.stem.SnowballStemmer('german')
...


## 2) Information Extraction

### 2.1) POS tags

Next, we apply part-of-speech tagging to the tokenized words.

Assigning each word a part-of-speech tag (e.g., noun, verb, adjective) helps in understanding the syntactic structure of the sentence.

We will again use the NLTK library. The output will be a list of tuples where the first element is the word and the
second element is the POS tag.


In [None]:
nltk.download('averaged_perceptron_tagger')

pos_tags = nltk.tag.pos_tag(tokens)

print("POS Tags:", pos_tags)


### 2.2) Named Entity Recognition (NER)

We will use the [spacy](https://spacy.io/) library for NER.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Process the text using the nlp model
doc = nlp('Apple is looking at buying U.K. startup for $1 billion.')

# Extract and print named entities
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

### 2.3) Break-out session

Let's perform POS tagging and NER for our german sentence using Spacy.

NOTE: we will now need a German language model!

In [None]:
nlp = spacy.load("de_core_news_sm")
text = "Angela Merkel besuchte Berlin und sprach mit Vertretern von Siemens."

## 3) Word vectors: GloVe

In the lecture, we studied word vectors. In this part of the tutorials, we will explore some of their capabilities. This was partly inspired by material from the [Natural Language Processing with Deep Learning class](http://web.stanford.edu/class/cs224n/index.html#schedule) from Stanford University.


###3.1) Getting started

For looking at word vectors, we will use Gensim. Gensim is a package for word and text similarity modeling. It is efficient and scalable, and quite widely used. You can read more about Gensim [here](https://radimrehurek.com/gensim/).

In [None]:
from gensim.models import KeyedVectors
from gensim.test.utils import get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

We will use the [GloVe word vectors](https://nlp.stanford.edu/projects/glove/) from Stanford. Gensim does not give them first class support, but allows you to convert a file of GloVe vectors into word2vec format.

We will first download the embeddings in a zipped file.
The `!` command in iPython allows us to call bash commands, just like you would do from the terminal.

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
# Create a folder for the embeddings and unzip them there
!mkdir word_vectors
!unzip glove.6B.zip -d word_vectors

f-strings are a very convenient way to format strings, for example for filenames. Here, we can set `dim` to the dimension we want to use.

You can play around with the dimensions as a mix between speed and smallness vs. quality. If you try out the 50d vectors, they basically work for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, they're even better than the 100d vectors.

In [None]:
import os

dim = 100
word_dir = "word_vectors"
glove_file = os.path.join(word_dir, f"glove.6B.{dim}d.txt")

# this temporary file is used to convert the embeddings into the right format
word2vec_glove_file = get_tmpfile(f"glove.6B.{dim}d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

In [None]:
# actually load the model
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

###3.2) First look at the embeddings

`model` now contains the word vectors in a lookup table by token!

We can quickly check which words are in it, and what the values look like.

In [None]:
model.get_vector('natural')

In [None]:
model['language']

In [None]:
model.get_vector('processing')

These values are not very interesting per se. Each dimension has no explicit meaning. However we can see that some words are not in the vocabulary of the embeddings. They are called out-of-vocabulary (OOV) words.

In [None]:
model.get_vector('natuerlich')

In [None]:
model.get_vector('lenguage')

In [None]:
model.get_vector('Processing')

In [None]:
model.get_vector('problem,')

In [None]:
model.get_vector('tintinnabulation')

Given these examples, can you give reasons why some words might be OOV?

Can you think of other cases?

What could be done to remedy these issues?

###3.3) Useful gensim functions

The Gensim library comes with useful functions. Again, you don't have to reinvent the wheel!

Let's look at the output of the `most_similar()` function.


In [None]:
model.most_similar('obama')

In [None]:
model.most_similar('merkel')

In [None]:
model.most_similar('berlin')

Do you think these results make sense? What is the number next to the word?

The default parameter is `positive=`, but the `most_similar()` function can also be used by inputting words as `negative=`. Words will then contribute with the negative of their vector.

In [None]:
model.most_similar(positive='banana')

In [None]:
model.most_similar(negative='banana')

As you can see, the opposite of a given word is not very relevant, why do you think that is the case?

###3.4) Creating a function to do analogies

The geometry of word vectors allows us to use them for many word relations, such as analogies: A is to B what X is to Y.

For example, 'Berlin' is to 'Germany' what 'Rome' is to 'Italy'.

Write a function which does this:

```
analogy('berlin', 'germany', 'rome')
> 'italy'
```

You can draw the geometry of these vectors on a piece of paper to help you.
(Hint: all you need is the function `most_similar`, with the `positive=` and `negative=` parameters).


In [None]:
def analogy(A, B, X):

In [None]:
analogy('berlin', 'germany', 'rome')

Let's check that the function works on other types of analogies:

In [None]:
analogy('germany', 'beer', 'france')

In [None]:
analogy('japan', 'japanese', 'australia')

In [None]:
analogy('obama', 'clinton', 'reagan')

In [None]:
analogy('tall', 'tallest', 'long')

In [None]:
analogy('walk', 'walked', 'see')

In [None]:
analogy('good', 'fantastic', 'bad')

###3.5) Other functions

[Here is the documentation](https://radimrehurek.com/gensim/models/keyedvectors.html) of the `KeyedVectors` model in `Gensim`.
You can even check the [source code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L776) of given functions, for example `most_similar()`.

Let's look at other functions:

In [None]:
model.doesnt_match(["breakfast", "cereal", "dinner", "lunch"])

Try any other function that you find interesting!

###3.6) Visualization

Let's look at the values of a few word vectors. We will use the
very popular library [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html).



In [None]:
import matplotlib.pyplot as plt

# Tools for plotting
%matplotlib inline
plt.style.use('ggplot')

# This creates a figure, you can define the size
plt.figure(figsize=(12,4))

word1 = 'banana'
word2 = 'apple'
word3 = 'car'
plt.plot(model.get_vector(word1), label=word1)
plt.plot(model.get_vector(word2), label=word2)
plt.plot(model.get_vector(word3), label=word3)
plt.legend()

It is always good to add labels on the x and y axes of your plots, as well as a title. Can you find a way to do that?

As we said, the absolute values of the vectors are not very interpretable. We are more interested in the relative values between words. However, in such a high dimensional space, they are tough to visualize.

This is why dimensionality reduction techniques, such as PCA or t-SNE, are often used to visualize word vectors. Let's use PCA to show some word vectors as a cloud of points. We will use the PCA implementation for the [scikit-learn library](hthttps://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.htmltps://).

In [None]:
from sklearn.decomposition import PCA
import numpy as np

def display_pca_scatterplot(model, words):

    # gets the vectors corresponing to the words in the model
    word_vectors = np.array([model[w] for w in words])

    # fit the PCA is keep the first 2 dimensions
    twodim = PCA().fit_transform(word_vectors)[:,:2]

    # create a scatter plot
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')

    # add the words close to the points
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(model,
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'paris', 'germany', 'berlin', 'italy', 'rome', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])

Do the 2D visualizations make sense with the analogies we computed earlier?

Go ahead an plot any other word wou would like!

In [None]:
display_pca_scatterplot(model,
                        [''])