# Assignment 2: Theses

---

## Task 2) Theses Inspiration

Imagine you'd have to write another thesis, and you just can't find a good topic to work on.
Well, n-grams to the rescue!
Download the `theses.txt` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 1,000 theses topics chosen by students in the past.

In this assignment, you will be sampling from n-grams to generate new potential thesis topics.
Pay extra attention to preprocessing: How would you handle hyphenated words and acronyms/abbreviations?

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [2]:
# Dependencies
import numpy as np
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\flockan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Prepare the Data

1.1 Spend some time on pre-processing. How would you handle hyphenated words and abbreviations/acronyms?

In [3]:
def load_theses_titles(filepath):
    """Loads all theses titles and returns them as a list."""
    ### YOUR CODE HERE
    
    f = open(filepath, mode="r", encoding="utf-8")
    lines = f.readlines()
    
    return lines
    
    ### END YOUR CODE

In [4]:
def preprocess(data):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE

    result = []
    for line in data: 
        line = line.replace("-", " ").replace("\"", "").replace("(", "").replace(")", "") #TODO: Refactor into regex
        tokenized = nltk.tokenize.word_tokenize(line, language="german")
        result.append(['<s>'] + tokenized + ['</s>'])

    return result 
    raise NotImplementedError()
    
    ### END YOUR CODE

In [5]:
basepath = "C:\\Dev\\uni\\seqlrn-assignments\\2-markov-chains\\data\\"
theses_data = preprocess(load_theses_titles(basepath + "theses.txt"))
print(theses_data[422])

['<s>', 'Johan', 'Gotliebs', 'Einführung', 'in', 'die', 'Buchhaltung', 'Edition', 'und', 'Kommentar', 'zu', 'den', 'Geschäftsvorfällen', 'Faktorbuchhaltung', '</s>']


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5]. What about \<s> and \</s>?

In [6]:
def build_n_gram_models(n, data):
    """This method does calculate all n-grams up to the given n."""
    ### YOUR CODE HERE
    
    ngrams = []
    for i in range(1, n+1):
        freq = nltk.FreqDist()
        for tweet in data: 
            ngram = list(nltk.ngrams(tweet, i))
            freq.update(ngram)
             
        ngrams.append(freq)
    return ngrams

    raise NotImplementedError()
    
    ### END YOUR CODE

In [7]:
n_gram_models = build_n_gram_models(5, theses_data)

### Generate the Titles

3.1 Write a generator that provides thesis titles of desired length. Please do not use the available `lm.generate` method but write your own.

3.2 How can you incorporate seed words?

3.3 How do you handle </s> tokens (w.r.t. the desired length?)

3.4 If you didn't just copy what nltk's lm.generate does: compare the outputs.

In [16]:
# Notice: If you fix the seed in numpy.random.choice, you get reproducible results.

def sample_next_token(prev, n_gram_model):
    """Samples the next word for the given n_grams."""
    ### YOUR CODE HERE
    
    count = 1
    possible_words = []
    for ngram in n_gram_model.keys():
        if list(ngram[:len(prev)]) == prev:
            possible_words.append(ngram) 
            count += n_gram_model.get(ngram)

    suggestions = []
    for word in possible_words:
        x = n_gram_model.get(word) / count  
        suggestions.append((word[-1], x))  
        
    suggestions.sort(key=lambda x: x[1], reverse=True)

    selected = np.random.choice(len(suggestions))
    
    return suggestions[selected][0]

    
    raise NotImplementedError()
    
    ### END YOUR CODE


def generate(n, n_gram_models, seed, title_length):
    """Generates a thesis title using the n_grams, seed word and title length."""
    ### YOUR CODE HERE
        
    text = ['<s>', seed]
    while text[-1] != '</s>' and len(text) < title_length + 1:
        # use smaller ngram for start
        if len(text) < n:
            text.append(sample_next_token(text, n_gram_models[len(text)]))

        # biggest ngram model for rest  
        else:
            text.append(sample_next_token(text[-n+1:], n_gram_models[n-1]))

    if text[-1] != '</s>': text.append('</s>')
    
    return text
    
    ### END YOUR CODE

In [17]:
title_length = 20
seed_word =  "Entwicklung"
thesis_title = generate(5, n_gram_models, seed_word, title_length)
print(thesis_title)

seed_word =  "Cloud"
thesis_title = generate(5, n_gram_models, seed_word, title_length)
print(thesis_title)

['<s>', 'Entwicklung', 'intuitiver', 'Interaktionsmöglichkeiten', 'für', 'die', 'graphischen', 'Auswertungen', 'in', 'der', 'Software', 'IngSoft', 'InterWatt', '</s>']
['<s>', 'Cloud', 'basierte', 'Kurwenwarnung', '</s>']
