# Homework 1: NLP Basics and NLP Pipelines (7 + 1 points)

**Welcome to homework 1!** 

The homework contains several tasks. You can find the amount of points you get for the correct solution in the task header. Maximum amount of points for each homework is 7 + 1 (bonus exercise). 
The **grading** for each task is the following: 
* correct answer - **full points** 
* insufficient solution or solution resulting in the incorrect output - **half points**
* no answer or completely wrong solution - **no points**

Even if you don't know how to solve the task, we encourage you to write down your thoughts and progress and try to address the issues that stop you from completing the task.

When working on the written tasks, try to make your answers short and accurate. Most of the times, it is possible to answer the question in 1-3 sentences.

When writing code, make it readable. Choose appropriate names for your variables (a = 'cat' - not good, word = 'cat' - good). Avoid constructing lines of code longer than 100 characters (79 characters is ideal). If needed, provide the commentaries for your code, however, a good code should be easily readable without them :)

Finally, all your answers should be written only by yourself. If you copy them from other sources it will be considered as an academic fraud. You can discuss the tasks with your classmates but each solution must be individual.



**Before sending your solution, do the Kernel -> Restart & Run All to ensure that all your code works.**

In [None]:
!pip install stanza 
stanza.download(lang='en') # download appropriate language model for your chosen language

In [None]:
import nltk
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

import spacy
import stanza

from tqdm.notebook import tqdm
import re
from collections import defaultdict, Counter

import matplotlib.pyplot as plt

## Task 1: Find the data (0.5 points)

Find large enough text data in English or any other language supported by spaCy and Stanza. If the resources for your language are very limited, you may use English or other language of your preference. 

**What is the language of your data?**

Answer: TODO 

**Where did you get the text data?**

Answer: TODO

**What kind of texts is it? (books,magazines, news articles, etc.)**

Answer: TODO 

**What style(s) of text does your data have? (user commetaries, scientific, neutral, etc)**

Answer: TODO

**Was it easy to download the data? If no, describe what difficulties you had and how you resolved them.**

Answer: TODO




## Task 2: Tokenize and count statistics (1 points)

Using either NLTK or Spacy tools, tokenize your text data you found in the previous exercise. 

P.S. If you are using Spacy, don't forget to load an appropriate module for it. 

**Compute and output the following:**
* number of sentences
* number of tokens
* number of unique tokens (or types)
* average length of a sentence 
* average length of a token
* sentence length (tokens in a sentence) histogram (you can use matplotlib.pyplot for that) 
* token length (characters in a token) histogram (you can use matplotlib.pyplot for that) 

In [None]:
# Replace the path with the name of your data file
data_path = "path_to_your_text_data.txt"

data = open(data_path, encoding='utf-8').read()

# Split the data into sentences and tokens

In [None]:

num_sentences = ...
num_tokens = ...
num_unique_tokens = ...
avg_sentence_len = ...
avg_token_len = ...

print("Number of sentences:", num_sentences)
print("Number of tokens:", num_tokens)
print("Number of unique tokens (or types):", num_unique_tokens)
print("Average sentence length:", avg_sentence_len)
print("Average token length:", avg_token_len)

In [None]:
sentence_lengths = ... 

# draw the histogram 
...

In [None]:
token_lengths = ...

# draw the histogram
...

## Task 3: Bype pair encoding (BPE) tokenization (1 point) 

### Task 3.1 (0.25 points)

Byte pair encoding (BPE) [link text](https://en.wikipedia.org/wiki/Byte_pair_encoding) is a simple algorithm of data compression. It looks for the most frequent pair of bytes in the data and replaces it with a new byte which is not seen in the data.

Recently, this idea became used in the [tokenization](https://www.aclweb.org/anthology/P16-1162.pdf). Let's say that we want to train a network that captures the meaning of words. We can have in out data the following words: low, lower, lowest. If we tokenize the text in a simple way by splitting the words as a whole, the model will probably learn the relation between low, lower, lowest. Now, imagine that we get some new text that the model didn't see during training and it has the words small, smaller, smallest and in the training data we had only the word small. Since the model didn't see smaller and smallest during the training, it will most likely fail to capture the relation.

One of the ways to solve this is BPE tokenization. It learns the most frequent sequences and can split an unknown word into **subwords**. In our case, it can split smaller into ['small', 'er'] since we had small in the training data and probably many other words ending with -er. Now. instead of one unknown word, the model have two known subwords from which it can take the information.

The code below builds the subwords from the text data. For the purpose of time saving, we set the number of merges to 1000.

**Study the code below and answer the questions after it.**


In [None]:
def get_vocab(filename):
    """Gets the text from a file and splits it with spaces."""
    
    vocab = Counter()
    with open(filename, encoding='utf-8') as f:
        for line in f:
            words = line.strip().split()
            for word in words:
                vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab

def get_stats(vocab):
    """Computes the frequencies for each pair of characters in the vocab."""

    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, in_vocab):
    """Merges the most frequent pair.

    Arguments:
    pair -- the most frequent word pair (tuple(str, str))
    in_vocab -- vocabulary with frequencies (dict)
    """
    
    out_vocab = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in in_vocab:
        out_word = p.sub(''.join(pair), word)
        out_vocab[out_word] = in_vocab[word]
    return out_vocab

def get_tokens_from_vocab(vocab):
    tokens_frequencies = Counter()
    vocab_tokenization = {}
    for word, freq in vocab.items():
        word_tokens = word.split()
        for token in word_tokens:
            tokens_frequencies[token] += freq
        vocab_tokenization[''.join(word_tokens)] = word_tokens
    return tokens_frequencies, vocab_tokenization

def measure_token_length(token):
    if token[-4:] == '</w>':
        return len(token[:-4]) + 1
    else:
        return len(token)

vocab = get_vocab(data_path)

print('==========')
print('Tokens Before BPE')
tokens_frequencies, vocab_tokenization = get_tokens_from_vocab(vocab)
print('All tokens: {}'.format(tokens_frequencies.keys()))
print('Number of tokens: {}'.format(len(tokens_frequencies.keys())))
print('==========')

num_merges = 1000
for i in tqdm(range(num_merges)):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)

tokens_frequencies, vocab_tokenization = get_tokens_from_vocab(vocab)

print('All tokens: {}'.format(tokens_frequencies.keys()))
print('Number of tokens: {}'.format(len(tokens_frequencies.keys())))
print('==========')

Answer the following questions: 

**Study the subwords from your data. Do you see any subwords that make sense from the linguistic point of view? (e.g suffixes, prefixes, common roots etc.). Provide examples.**

Answer: TODO

**What will happen if you increase the number of merges?**

Answer: TODO 

### Task 3.2 (0.75 points) 
Now, you are going to implement the function that splits the unknown word into subwords using the vocab that we built above.

One way to do it is the following:

1. Sort our vocab by the length in the descending order.
2. Find the boundaries of the "window" that is going to search if a candidate word has a corresponding subword in the vocab. In the beginning, the starting index is 0, since we start to scan the word from the first characher. The end index is the length of the longest subword in the vocab or the length of the word if it is smaller.
3. In a while loop, start looking at the possible subwords. If the subword you are looking at is in the vocab, append it to the result. Now, your new starting index is your previous end index. Your new end index is your new start index plus the length of the longest subword in the vocab or the length of the word if it is smaller than the resulting sum. If the subword is not in the vocab, we reduce the end index by one thus narrowing our search window. Finally, is the length of our window is equal to one, we put an unknown subword in the result and update our window as above.
4. End the loop when we reach the end of the word.

After you finish with the function, test the tokenizer on a very common word and on a very unusual word (you can even try to invent a word yourself).


In [None]:
# Sorting the subwords by the length in the descending order
sorted_tokens_tuple = sorted(tokens_frequencies.items(), key=lambda item: (measure_token_length(item[0]), item[1]), reverse=True)
sorted_tokens = [token for (token, freq) in sorted_tokens_tuple]

def tokenize_word(string, sorted_tokens, unknown_token='</u>'):
    """
    Tokenizes the word into subword using learned BPE vocab
    
    Arguments:
    string -- a word to tokenize. Must end with </w>
    sorted_tokens -- sorted vocab by frequency in descending order
    unknown_token -- a token to replace the words not found in the vocab
    """
    
    if string == '':
        return []
    if sorted_tokens == []:
        return [unknown_token]

    # We are going to store our subwords here
    string_tokens = []
    
    # Find the maximum length of the ngram in vocab
    ngram_max_len = ...
    # End index is the maximum length of the ngram or the length of the string if it's smaller
    end_idx = ...
    # Starting index is 0 in the beginning
    start_idx = 0
    
    while start_idx < len(string):
        subword = string[start_idx:end_idx]
        if subword in sorted_tokens:
            ...
        elif len(subword) == 1:
            ...
        else:
            ...
            
    return string_tokens

# The word should end with "</w>". For example, "cat</w>".
word_known = '...</w>'
word_unknown = '...</w>'

print('Tokenizing word: {}...'.format(word_known))
if word_known in vocab_tokenization:
    print(vocab_tokenization[word_known])
else:
    print(tokenize_word(string=word_known, sorted_tokens=sorted_tokens, unknown_token='</u>'))
    

print('Tokenizing word: {}...'.format(word_unknown))
if word_unknown in vocab_tokenization:
    print(vocab_tokenization[word_unknown])
else:
    print(tokenize_word(string=word_unknown, sorted_tokens=sorted_tokens, unknown_token='</u>'))

## Task 4: Lemmatization and normalization (1 point) 

### Task 4.1 (0.5 points) 

Using either NTLK or Spacy, lemmatize your data. Make a copy of your data but this time transform all the tokens and lemmas into the lowercase.

Provide the following statistics:

* Number of unique lemmas (original case)
* Number of unique lemmas (lower case)
* Number of unique tokens (original case)
* Number of unique tokens (lower case)

In [None]:
# Lemmatize your data
...


# Make a copy of your tokens but in lowercase
...


# Count statistics (no need to calculate the number of unique tokens in original case since we did it in Task 2)
num_unique_lemmas = ...
num_unique_lemmas_lower = ...
num_unique_tokens_lower = ...

# Print out the numbers
print("Number of unique lemmas (original case):", num_unique_lemmas)
print("Number of unique lemmas (lower case):", num_unique_lemmas_lower)
print("Number of unique tokens (original case):", num_unique_tokens)
print("Number of unique tokens (lower case):", num_unique_tokens_lower)

### Task 4.2 (0.5 points)

Look at the numbers you got. 

**Imagine that you want to use your data to train a network that captures the meaning of the words. Do you want to use tokens or lemmas? Original or lowercase? Explain your choices.**

Answer: TODO 

**Imagine that you want to use your data to train a system that detects named entities, i.e. names of people, places, companies etc. Do you want to use tokens or lemmas? Original or lowercase? Explain your choice.**

Answer: TODO


## Task 5: Different Pipelines (0.5 points) 

In the next tasks you need to process your data from task 1 with two different pipelines. Use Stanza and spiCy for that. 

**What components do the pipelines have?**

Answer: TODO 

**What languages do the pipelines support?**

Answer: TODO 

## Task 6: Process your text (2 points) 

### Task 6.1 (1.5 point) 

Process the text data from the first task with two different pipelines. Use Stanza and spiCy for that. 

Select one sentence from the processed document and print out all the results (tokens, pos-tags, lemmas, depparse, etc.) from both pipelines. 

In [None]:
# Process the text 
# Pipeline 1
...

# Pipeline 2
...

# Print out the results from both pipelines
...

### Task 6.2 (0.5 points)

**Look at the output from both pipelines. Are the results correct and do the pipelines have the same output? If no, provide the examples of the mistakes and differences.** 

Answer: TODO

**What is the difference between a POS tag and morphological tag?**

Answer: TODO

**What is the difference between tagging and parsing?**

Answer: TODO

**Analyze the dependency parsing result from both pipelines. Does the results make sense? Briefly describe the meaning behind the relations.**

Answer: TODO

**Is one pipeline better than the other based on the output of one sentence?**

Answer: TODO 

## Task 7: Statistics (1 point)

In your processed output (choose output from only one of the pipelines), compute and print out (in a human readable format) the following statistics: 
* POS tag frequency for each tag (in descending order) 

* 50 most frequent lemmas 
* 10 least frequent lemmas 

In [None]:
# Compute and print out POS tag frequency 
...


# Compute and print out 50 most frequent lemmas 
...

# Compute and print out 10 least frequent lemmas 
...

## Bonus Task: WordCloud (1 point) 

Wordcloud gives us a visual representation of the most common words in the data. Visualisation is key to understanding whether we are on the right track with preprocessing, it allows us to verify if we need more preprocessing before further analysing the text data. 

**Your task is to create three wordclouds. One wordcloud should be created before any preprocessing is done to the text data and other two is created from the preprocessed text data. 
Do suitable preprocessing for tasks described in 4.2. This means:**
1. preprocess data, so we could train a neural network that will capture the meaning of words. 
2. preprocess data, so we could train a system that detects named entities.


Python has a massive number of open libraries for drawing wordclouds. You can use Andreas Mueller's [wordcloud](http://amueller.github.io/word_cloud/) library to do that.



In [None]:
# create the first wordcloud from raw text data
...


In [None]:
# Preprocess the data for task 1
...

# create the second wordcloud
...

In [None]:
# Preprocess the data for task 2
...

# create the third wordcloud
...

**What are the differences between these wordclouds (provide examples)? Can you say from the wordclouds if the preprocessing was enough for the tasks?**

Answer: TODO