<a href="https://colab.research.google.com/github/dgromann/cl_intro/blob/main/tutorials/Tutorial2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 2: Introduction to Computational Linguistics

This is the second tutorial with practical exercises for the lecture Introduction to Computational Linguistics in the winter semester 2023. Hands-on exercises are marked with 👋 ⚒ and questions are marked with ❓. Remember to first **store this notebook** in your Drive or GitHub.

Today's focus is on the traditional NLP processing pipeline, for which we will be using [spaCy](https://spacy.io/) and [Natural Language Toolkit (NLTK)](https://www.nltk.org/).

---

## **Lesson 2: NLP Pipeline**

For the NLP pipeline, we will be using three different libraries today: NLTK, [Stanza](http://stanza.run/), and [spaCy](https://spacy.io/). Thus, we first need to install Stanza.

In [None]:
!pip install stanza

NLTK and spaCy are already available in a standard Colab Notebook, however, we need to download some packages that we will need in NLTK.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('gutenberg')

## Tokenization and POS Tagging

First we will use NLTK to tokenize and POS tag a sample sentence. The tagset that the Perceptron Tagger uses is the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

❓ Are the POS tags for the two different uses of *tears* correct? How does their pronunciation differ?


In [None]:
# Tokenization
from nltk.tokenize import word_tokenize
# Part-of-Speech tagger
from nltk.tag.perceptron import PerceptronTagger

tagger = PerceptronTagger()

# Example sentences
sentence = "It just tears me apart to see you suffering like that and in tears."

# Tokenize the sentence
print(word_tokenize(sentence))
# POS tag each token in the tokenized sentence
pos_tags = tagger.tag(word_tokenize(sentence))
print("Part of speech tags of the sentence: ", pos_tags)

👋 ⚒ Let's do the same in spaCy. Go to the [spaCy documentation](https://spacy.io/usage/linguistic-features) and perform tokenization and POS tagging on the same example sentence. Attention: Only output the tokens, their spaCy internal POS label and the Penn Treebank tags.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Your code here

## Lemmatization and Stemming

We have looked at the comparison between these two in the lecture. Now it is time for you to play around with the two yourself.

👋 ⚒ Which stemmer worked better? Which method would you prefer to determine word frequency information of a text corpus?

In [None]:
# Lemmatizer
from nltk.stem import WordNetLemmatizer

# Stemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer

# Lemmatizer
lemmatizer = WordNetLemmatizer()

# Selection of stemmers
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer("english")

# Exercise: Lemmatize and stem (maybe try different stemmers) the following words
words = ['presumably', 'provisions', 'owed', 'abacus', 'flies', 'dies', 'mules',
        'seizing', 'caresses', 'sensational', 'colonizer', 'traditional', 'plotted']

With spaCy the code is very much the same for lemmatization as for tokenization and POS tagging, exemplified for our example sentence below. The library, unfortunately, has no function for stemming.

In [None]:
doc = nlp(sentence)
for token in doc:
    print(token.text, token.lemma_)

## Named Entity Recognition (NER)

👋 ⚒ Get the results for NER for the following example sentence in spaCy.

In [None]:
example_sentence = "Vienna is lovely in December."

# Your code here

## Dependency Parsing

Whenever grammatical relations are needed, dependency parsing is very useful. The most common tagset are the [Universal Dependency Relations](https://universaldependencies.org/u/dep/).

While there are some options for dependency parsing in NLTK, the successful ones depend on the Stanford Parser. However, Stanza is the more recent version of the Stanford Parser and therefore more useful.

In [None]:
doc = nlp(example_sentence)

for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

In [None]:
# You can also visualize the dependency relations
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

In Stanza, we can do all of the above operations in one pipeline. Also spaCy offers a pipeline solution and the combination of several of these parsers in one go.

In [None]:
import stanza
stanza.download('en')

👋 ⚒ The print statement contains two for loops and an if/else statement. Try to split it up from a one-line code back to the two loops and the statement in several lines.

In [None]:
pipeline = stanza.Pipeline(lang='en', processor='tokenize,pos,lemma,depparse')
doc = pipeline(example_sentence)

# Try to split the following line into two for statements and one if/else
print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' for sent in doc.sentences for word in sent.words], sep='\n')

❓ Do you notice any differences between the two types of dependency relations and the output for this sentence? Do the two parsers agree on the existing relations in this sentence?
