# Assignment 2: Theses

---

## Task 2) Theses Inspiration

Imagine you'd have to write another thesis, and you just can't find a good topic to work on.
Well, n-grams to the rescue!
Download the `theses.txt` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 1,000 theses topics chosen by students in the past.

In this assignment, you will be sampling from n-grams to generate new potential thesis topics.
Pay extra attention to preprocessing: How would you handle hyphenated words and acronyms/abbreviations?

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [3]:
# Dependencies
import numpy as np
import nltk
import re
import random
from string import punctuation

### Prepare the Data

1.1 Spend some time on pre-processing. How would you handle hyphenated words and abbreviations/acronyms?

In [4]:
nltk.download('punkt')

def load_theses_titles(filepath):
    """Loads all theses titles and returns them as a list."""
    ### YOUR CODE HERE
    
    with open(filepath, 'r', encoding='utf-8') as file:
        titles = [line.strip() for line in file.readlines()]
    return titles
    
    ### END YOUR CODE

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\André\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [62]:
def preprocess(data):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    preprocessed_data = []
    
    for title in data:
        tokens = nltk.word_tokenize(title) # title.lower()
        hyphenated_tokens = []
        
        for token in tokens:
            hyphenated_tokens.extend(re.split(r'[-\s]', token))
            
        tokens = hyphenated_tokens
        tokens = [re.sub(r'[^\w\s-]', '', token) for token in tokens]
        tokens = [token for token in tokens if token]
        #tokens = filtered = ['<s>'] + tokens + ['</s>']
        preprocessed_data.append(tokens)
        
    return preprocessed_data

    ### END YOUR CODE

In [44]:
theses_data = preprocess(load_theses_titles("./data/theses.txt"))


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5]. What about \<s> and \</s>?

In [63]:
def build_n_gram_models(n, data):
    """This method does calculate all n-grams up to the given n."""
    ### YOUR CODE HERE
    
    models = {}
    for i in range(1, n+1):
        n_gram_counts = {}
        prefix_counts = {}
        for tweet in data:
            # Generate n-grams and their respective prefixes
            # Ensure everything is a tuple to avoid TypeError
            n_grams = [tuple(tweet[j:j+i]) for j in range(len(tweet) - i + 1)]
            prefixes = [tuple(tweet[j:j+i-1]) for j in range(len(tweet) - i + 1)] if i > 1 else [('<s>',) * (i-1)] * (len(tweet) - i + 1)

            for n_gram in n_grams:
                if n_gram in n_gram_counts:
                    n_gram_counts[n_gram] += 1
                else:
                    n_gram_counts[n_gram] = 1

            for prefix in prefixes:
                if prefix in prefix_counts:
                    prefix_counts[prefix] += 1
                else:
                    prefix_counts[prefix] = 1

        # Calculate probabilities for the model of order i
        model = {}
        for n_gram, count in n_gram_counts.items():
            prefix = n_gram[:-1]
            model[n_gram] = count / prefix_counts[prefix]
        models[i] = model

    return models
    
    ### END YOUR CODE

In [64]:
n_gram_models = build_n_gram_models(5, theses_data)
print(n_gram_models[1][("Cloud",)])

0.0005902752158193757


### Generate the Titles

3.1 Write a generator that provides thesis titles of desired length. Please do not use the available `lm.generate` method but write your own.

3.2 How can you incorporate seed words?

3.3 How do you handle </s> tokens (w.r.t. the desired length?)

3.4 If you didn't just copy what nltk's lm.generate does: compare the outputs.

In [65]:
# Notice: If you fix the seed in numpy.random.choice, you get reproducible results.

def sample_next_token(prev, n_gram_model):
    """Samples the next word for the given n_grams."""
    ### YOUR CODE HERE
    if len(prev) != len(next(iter(n_gram_model.keys()))) - 1:
        raise ValueError("The size of the previous tokens must match the n-1 value of the n-gram model.")

    possible_continuations = {key[-1]: n_gram_model[key] for key in n_gram_model if key[:-1] == prev}

    if not possible_continuations:
        return random.choice(list(n_gram_model.keys()))[-1]

    next_words = list(possible_continuations.keys())
    probabilities = list(possible_continuations.values())

    next_word = random.choices(next_words, weights=probabilities, k=1)[0]
    return next_word
    
    ### END YOUR CODE


def generate(n, n_gram_models, seed, title_length):
    """Generates a thesis title using the n_grams, seed word and title length."""
    ### YOUR CODE HERE
    
    generated_title = [seed]

    # Generate additional tokens to complete the title
    while len(generated_title) < title_length:
        # Determine the context for the next token
        context = generated_title[-(min(len(seed.split()), len(generated_title))):]
        # Sample the next token based on the context and n-gram model
        next_token = sample_next_token(tuple(context), n_gram_models[min(len(context) + 1, n)])
        # Add the next token to the generated title
        generated_title.append(next_token)

    # Combine the tokens into a single string
    generated_title = ' '.join(token for token in generated_title)

    return generated_title
    ### END YOUR CODE

In [66]:
title_length = 12
seed_word =  "Entwicklung"
thesis_title = generate(3, n_gram_models, seed_word, title_length)
print(thesis_title)
seed_word =  "Cloud"
thesis_title = generate(3, n_gram_models, seed_word, title_length)
print(thesis_title)

Entwicklung einer Customer Journey Map für Hardware Ereignisse im Anwendungsbereich der DATEV
Cloud basierte Musik Streaming Dienste für SPS Anwenderprogramme auf die IT Service


In [132]:
from nltk import ngrams
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

train, vocab = padded_everygram_pipeline(5, theses_data)
lm = MLE(2)
lm.fit(train, vocab)

generated_text = lm.generate(10)

print(generated_text)


['Software', 'für', 'die', 'IT', 'Dienstleistungsbranche', 'im', 'Kontext', 'eines', 'Reputationssytems', 'für']
