# Assignment 4: Keyphrase Extraction, Named Entity Recognition & Neural Models

Due: Monday, February 06, 2023, at 2pm via Moodle

**Team Members** 
* Björn Bulkens
* Klemens Gerber
* Daniel Knorr

Please note that this assignment comes with quite a number of artifacts, totaling somewhere around 5 GB of necessary disk space. In case you are running into issues or do want to keep your environment "clean", we suggest the use of [Google Colab](https://colab.research.google.com/).

In [None]:
%%bash
. ~/.bashrc
python3 -m pip install keybert
python3 -m pip install git+https://github.com/LIAAD/yake
python3 -m pip install transformers
python3 -m pip install datasets
python3 -m pip install nltk
python3 -m pip install spacy
# Install necessary packages for all questions

## Task 1: Keyphrase Extraction (5 + 3 + 3 + 5) = 16 Points

In this task, we will implement our own unsupervised keyphrase extraction (KPE) module utilizing a simple grammatical ruling system, which we apply to a Sherlock Holmes novel.
To generate TF-IDF-weighted phrases, we will be using the entire collection of Sir Arthur Donan Coyle novels to calculate document frequencies.

Finally, we compare the results to general-purpose KPE libraries.

### Sub Task 1: Unsupervised Keyphrase Extraction System (5 Points)

#### 1. Candidate Generation
We will need to generate a set of suitable candidate phrases first, which can then be ranked as keyphrases later on. To do this, we will again be using spaCy's, this time its rule-based [`Matcher` class](https://spacy.io/api/matcher).

The syntactic pattern of a keyphrase candidate should satisfy the following rules:

1. An optional adjective, noun, proper noun
2. An optional adjective, noun, proper noun
3. A mandatory noun or proper noun.

Add a second pattern, which recognizes the pattern

1. A noun or proper noun
2. An adposition
3. Another noun or proper noun

Note that the first condition will match any phrase of length between 1-3 tokens, which is a suitable approximation for our task at hand, whereas the second pattern is slightly more specific, always matching exactly three tokens.
An example of a valid matched phrases for the first pattern would be "Sherlock Holmes" ([PROPN, PROPN]), and "Hounds of Baskervilles" ([NOUN, ADP, PROPN]) for the second pattern.

In [None]:
import spacy
from spacy.matcher import Matcher

In [None]:
# load language model
nlp = spacy.load("en_core_web_sm", disable=["ner"])
matcher = Matcher(nlp.vocab)

# Define the above patterns
pattern = ## YOUR CODE

matcher.add( ## YOUR CODE

To verify whether your pattern is correct, use the below example.
If you have done everything correctly, your matcher will identify **13 phrases**.

In [None]:
doc = nlp("This is a simple test. It should return 'simple', and 'test', among other phrases. Maybe we can also see if it can recognize the art of war. Would it recognize integer linear programming, too?")
matches = matcher(doc)

print(len(matches))

#### 2. Applying Your System

Once you have matched the correct number of keyphrase candidates on the above example, apply your rule-based matcher to an actual data sample. We are going to use the Sherlock Holmes novel "Hounds of Baskervilles". You can find the raw text file at the following URL:

https://sherlock-holm.es/stories/plain-text/houn.txt

Download the text from this URL and apply your spaCy model and matcher on it.  
**Hint:** Make sure you properly decode your input, since some libraries return binary strings.

In [1]:
from urllib.request import urlopen
def load_txt_from_url(url: str = "https://sherlock-holm.es/stories/plain-text/houn.txt") -> str:
  text = ""
  for line in urlopen(url):
    line = line.decode('UTF-8')
    text += str(line)
  return text

We will now investigate which phrase candidates are the most frequently appearing in this novel, simply based on the phrase frequency. Therefore, convert your abstract match objects into actual strings, lowercase them, and return the 20 most frequently occurring phrase candidates and their respective frequencies.  
**Hint:** For counting occurrences, you may look at `collections.Counter`.

In [None]:
candidates = []
# Lowercase and add the extracted candidate matches to `candidates`
## YOUR CODE

# Count the number of occurrences of different candidate phrases
candidate_phrases = ## YOUR CODE
# Print the most frequently occurring phrases, together with the respective frequencies
print( ## YOUR CODE

#### 3. Briefly summarize the quality of your top 20 candidates:

YOUR ANSWER

### Sub Task 2: Generating Document Frequency Values (3 Points)

To compare the previously generated terms with a more refined model, we are going to extract document frequencies from the collection of all Sherlock Holmes works. Since the books are relatively long documents, we are instead going to split based on a simple heuristic in the input document, which should allow a decent approximation by taking into account individual chapters of each novel.

1. Start by loading the Sherlock Holmes canon from https://sherlock-holm.es/stories/plain-text/cnus.txt  
Afterwards, split the full document into individual chapters. For this, use three consecutive line breaks `\n\n\n` as a splitting condition to approximate the chapters.

In [None]:
df_texts = load_txt_from_url("https://sherlock-holm.es/stories/plain-text/cnus.txt")

split_df_texts = ## YOUR CODE

print(len(split_df_texts))

After splitting, you should have 353 individual "documents" to work with.

2. Now, create a dictionary containing each phrase encountered in the larger corpus, and its associated document frequency. Again, ensure that phrase strings are lowercased for consistency with the previous transformation.  
**Hint:** Since the processing of 353 documents might take a while, incorporate [`tqdm.tqdm`](https://tqdm.github.io/) to visualize progress on the task.

In [None]:
from tqdm import tqdm
from typing import List

def return_occurring_phrases(doc_text: str) -> List[str]:
  # process text with spaCy and apply the Matcher
  doc = ## YOUR CODE
  matches = ## YOUR CODE

  # Candidates can be a set, since we only care about the occurrence *once* for IDF values.
  # Again, extract the lower-cased text of a matched span.
  candidates = set()
  for match_id, start, end in matches:
    ## YOUR CODE

  return list(candidates)

all_document_phrases = []
# Iterate through the individual documents and extract phrases for them. Use `tqdm` to visualize progress
## YOUR CODE

# Once again, count the frequency of term occurrences across all documents
## YOUR CODE



3. Output the 20 most frequently appearing document phrases that your system detected:

In [None]:
print( ## YOUR CODE

### Sub Task 3: Generating Weighted Keyphrases (3 Points)

We can now incorporate the extracted keyphrases to calculate `tf-idf` scores, and return a hopefully improved version of our keyphrases for the original "Hounds of Baskervilles" document. 

1. Iterate over all phrases occurring in the novel "Hounds of Baskervilles", and re-score phrases according to the definition of TF-IDF. Use the smoothed definition of idf:

$ idf(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}| + 1} + 1 $

In [None]:
import math
from typing import Dict

def tf_idf(tf: int, df_count: int) -> float:
  """
  Computes the TF-IDF scores according to the above-mentioned formula.
  Note that you may use a constant for the number of documents (|D|).
  """
  return ## YOUR CODE

tf_idf_weighted_candidates = []

# Iterate through all candidate phrase/frequency pairs and compute the TF-IDF scores for each phrase
# Store the phrase together with its TF-IDF score in `tf_idf_weighted_candidates`
for candidate, tf in candidate_phrases.items():
  tf_idf_weighted_candidates.append( ## YOUR CODE

2. Now print the top 20 candidate phrases by TF-IDF weight, and compare the results to your previous output. 

In [None]:
print(## YOUR CODE

3. Write your insights on the comparison of the results below. Try to theorize why some of the phrases still appear, or why other phrases are no longer present:

YOUR ANSWER HERE

4. Give two examples of how you could further improve the list of keyphrase values.

YOUR ANSWER HERE



### Sub Task 4: Apply off-the-shelf Keyphrase Extraction Tools (5 Points)

To put the findings of your system into context, compare them with two popular open-source libraries, namely [YAKE!](https://github.com/LIAAD/yake) and [KeyBERT](https://github.com/MaartenGr/KeyBERT).

1. First, start by running the document with YAKE!; you may use the default parameters. Print the resulting keyphrases, which by default returns 20 phrases.

In [None]:
from yake import KeywordExtractor

extractor = ## YOUR CODE
keywords = ## YOUR CODE

# Print the top 20 keywords
print(keywords)

2. Compare both runtime efficiency and the extracted phrases with your own system.

YOUR ANSWER HERE

3. Now use the KeyBERT library to extract keyphrases. Importantly, you will need to split the document into separate paragraphs, as the underlying neural model will be unable to handle the complete document as input.  
Use the pattern of `\n\n` to separate the text into smaller paragraphs, and filter out any empty lines after. An "empty line" also constitutes all inputs that only contain newline (`\n`) or whitespace ` ` characters.


In [None]:
# Split the input text according to the specified criteria and filter empty lines out.
split_text = ## YOUR CODE

4. To ensure consistency between the tools when extracting keyphrases, set the *n*-gram range to `(1,3)`.
Otherwise, leave all parameters at the default value, and extract the keyphrases from each paragraph.

In [None]:
from keybert import KeyBERT

# This might take a while to install
model = KeyBERT("all-MiniLM-L6-v2")

# Extract the keyphrases from each split, using the adjusted keyphrase ngram range
# Hint: You may pass a list to the extraction function and KeyBERT will automatically handle iteration.
extracted_phrases = ## YOUR CODE

5. Combine the predictions of all individual splits into a single list. For this, sum up the prediction scores across all splits.  
**Hint:** `collections.defaultdict` makes aggregations like this much easier.

In [None]:
from typing import List, Tuple
from collections import defaultdict

def merge_predictions(list_of_predictions: List[List[Tuple]]) -> List[Tuple]:
    """
    Combines lists of predictions into a single list with added scores.
    """
    phrase_dict = ## YOUR CODE

    # Iterate through all the lists of predictions and add the scores to the correct dict entry
    ## YOUR CODE

    # Extract the 20 keyphrases with the highest weithgts from `phrase_dict`
    phrase_list = ## YOUR CODE

    return phrase_list

In [None]:
print(merge_predictions(extracted_phrases))

6. Again, evaluate the result and compare it to the other two approaches in terms of extraction quality and extraction speed.

YOUR ANSWER HERE

## 2. Named Entity Recognition (4 + 5 + 5 = 14 Points)

Slightly different, but still operating on the sequence level, is the task of Named Entity Recognition (NER).
In this task, we will evaluate the NER capabilities of some more open-source libraries.
Particularly, we will also evaluate the utility of NER as a stand-in for Keyphrase Extraction.

### Sub Task 1: Using spaCy NER (4 Points)

So far, when using spaCy models, we have primarily disabled the NER component, as it requires significant extra compute.
In this task, we will explicitly leave the component enabled, to see what results it can produce on the text from the previous question.

In [2]:
import spacy 

# Load the en_core_web_sm model, but with NER enabled.
nlp = spacy.load("en_core_web_sm")

  from .autonotebook import tqdm as notebook_tqdm


1. Re-load the text for the "Hounds of Baskervilles" novel, and run it with the spacy model.

In [3]:
# Re-use the function from the previous exercise.
text = load_txt_from_url()

doc = nlp(text)

2. Similar to the previous exercise, count the number of occurrences, however, this time for the extracted entities instead of phrases. Print the top 20 most frequently occurring entities.  
Make sure to lowercase the text again during your aggregation.

In [4]:
from collections import Counter
# Count the number of occurrences of particular entities
ents = Counter()

for ent in doc.ents:
    ent_lower = ent.text.lower()
    ent_label = ent.label_
    ents[f"{ent_label}:{ent_lower}"] += 1

# Print the top 20 most frequently occurring entities.
print(ents.most_common(20))

[('PERSON:henry', 119), ('PERSON:holmes', 115), ('PERSON:watson', 108), ('CARDINAL:one', 90), ('PERSON:mortimer', 74), ('PERSON:charles', 74), ('CARDINAL:two', 56), ('GPE:london', 49), ('ORDINAL:first', 41), ('ORG:stapleton', 31), ('PERSON:stapleton', 25), ('PERSON:barrymore', 25), ('PERSON:sherlock holmes', 22), ('PERSON:henry baskerville', 20), ('FAC:baskerville hall', 20), ('CARDINAL:half', 18), ('PERSON:coombe tracey', 17), ('ORG:i.', 15), ('GPE:devonshire', 14), ('PERSON:baskerville', 14)]


You might have noticed some unwanted results in the list, such as "night". Upon closer inspection, it turns out that the NER module further differentiates between different entity *categories*, such as PERSON (referencing, as expected, a physical person) or ORG (organizations, such as companies, NGOs, etc.), but also TIME (under which "night" falls). For reference, you can find the full list of supported NER labels by this particular model [here](https://spacy.io/models/en#en_core_web_sm-labels).

3. Refine the list of most common entities by printing out the top three occurring entities in the category `PERSON`, `ORG` and `GPE` (physical locations) instead.

In [5]:
def get_top_entities_by_class(doc: spacy.tokens.Doc, class_name: str, n: int = 3) -> list[tuple[str, int]]:
    """
    Returns the three most frequent entities (and their frequencies)
    of entity type `class_name` from `doc`.
    """
    # Extract phrase and frequency of a particular entity class
    counter = Counter()
    for ent in doc.ents:
        if (ent.label_ == class_name):
            ent_lower = ent.text.lower()
            ent_label = ent.label_
            counter[ent_lower] += 1
    # Return the top 3 entities and frequencies
    return counter.most_common(n)

# Print the results for "PERSON", "ORG" and "GPE"
print("Person:", get_top_entities_by_class(doc, "PERSON"))
print("ORG:", get_top_entities_by_class(doc, "ORG"))
print("GPE:", get_top_entities_by_class(doc, "GPE"))

Person: [('henry', 119), ('holmes', 115), ('watson', 108)]
ORG: [('stapleton', 31), ('i.', 15), ('times', 10)]
GPE: [('london', 49), ('devonshire', 14), ('england', 13)]


### Sub Task 2: Financial Bank Statements of Deutsche Bank (5 Points)

Instead of using the Sherlock Holmes Novels, we will now compare the functionality of spaCy and NLTK's NER modules on the financial statements of Deutsche Bank from 2021. For this, see the file available on Moodle.

1. Download it and convert the PDF document into text, by using the `pdftotext` command-line utility. In particular, run with the `-layout` option enabled.

In [36]:
%%bash
. ~/.bashrc
pdftotext -layout DB_annual_report.pdf DB_annual_report.txt
# If you have to execute this command through your shell, still paste the command you ran in here.

2. Given that the document is extremely long, split the inputs into chunks of 500.000 characters and process them separately.

In [6]:
import re
def load_long_text_in_chunks(fp: str, chunk_size: int = 500_000) -> list[str]:
    """
    Loads a text file (located at `fp`) and chunks it into chunks fo at most `chunk_size` characters.
    Note that the last chunk might be significantly shorter.
    """
    file = open(fp, 'r')
    text = file.read()
    #clean 
    cleaned = re.sub(r'\n', ' ', text)
    cleaned = re.sub(r'\x0c', '', cleaned)
    cleaned = re.sub(r'\t\x07', '', cleaned)
    cleaned = re.sub(r'\xad', '', cleaned)
    cleaned = re.sub(r'\x07', '', cleaned)

    # Split the text into segments of at most `chunk_size` characters
    chunks = [cleaned[i:i + chunk_size] for i in range(0, len(cleaned), chunk_size)]
    return chunks

In [7]:
db_chunks = load_long_text_in_chunks("./DB_annual_report.txt")

3. Print the top 5 occurring `ORG` entities that are not referencing Deutsche Bank itself, both by using spaCy's NER module and the NER function of NLTK.  
To exclude "Deutsche Bank" entities, filter out all entities that contain both "deutsche" and "bank" in their name, irrespective of the actual upper-/lowercasing.
**Hint:** For more information on how to run NER with NLTK, see [here](https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/#performing-ner-with-nltk-and-spacy)

In [8]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

org_entities_spacy = []
org_entities_nltk = []

def is_deutsche_bank_entity(name: str) -> bool:
    """
    Returns True if the entity name contains "deutsche" and "bank" in some upper-/lowercased version.
    This means both "Deutsche Bank" and "deutsche bank's" should be recognized.
    """
    name_lower = name.lower()
    if ("deutsche" in name_lower or "bank" in name_lower):
        return True
    else:
        return False

for chunk in db_chunks:
    # Process the chunk with spaCy
    doc = nlp(chunk)
    chunk_ents_spacy = []
    chunk_ents_nltk = []
    for ent in doc.ents:
        ent_label = ent.label_
        ent_text = ent.text
        is_deutsche_bank = is_deutsche_bank_entity(ent_text)
        if ent_label == "ORG" and is_deutsche_bank is False:
            chunk_ents_spacy.append(ent_text)

    # And also with NLTK
    for sent in nltk.sent_tokenize(chunk):
        for chunk_ in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            is_deutsche_bank = is_deutsche_bank_entity(' '.join(c[0] for c in chunk_))
            if hasattr(chunk_, 'label') and chunk_.label() == "ORGANIZATION" and is_deutsche_bank is False:
                chunk_ents_nltk.append(' '.join(c[0] for c in chunk_))
    

    # Add all the extracted "ORG" entities to `org_entities`, except those referencing Deutsche Bank
    org_entities_spacy.extend(chunk_ents_spacy)
    org_entities_nltk.extend(chunk_ents_nltk)
    

[nltk_data] Downloading package punkt to /home/bjoern/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/bjoern/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/bjoern/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/bjoern/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [9]:
# Return the top 5 entities by frequency

ents_spacy = Counter()
ents_nltk = Counter()
for ent in org_entities_spacy:
    ents_spacy[ent] += 1

for ent in org_entities_nltk:
    ents_nltk[ent] += 1

entity_counts_spacy = ents_spacy.most_common(5)
entity_counts_nltk = ents_nltk.most_common(5)

print(entity_counts_spacy)
print(entity_counts_nltk)

[('Group', 943), ('the Management Board', 295), ('the Supervisory Board', 239), ('Financial Institution', 223), ('Management Board', 129)]
[('Group', 745), ('Management Board', 438), ('Supervisory Board', 342), ('IFRS', 175), ('Total', 127)]


4. Compare and analyze the different results between the two methods.

We can see, that the first solution "Group" is in both spacy and nltk the same except the frequency. The next two only differ because of a "the" infront of the ORG name on the spacy side. So it seems, that spacy differs between ORGs which have a "the" or no "the" infront of their names, nltk group these to one ORG together. So we can see, that if we sum the frequency of "the Management Board" and "Management Board" by spacy, the result is almost the frequency of "Management Board" by nltk.

### Sub Task 3: Co-Occurrence Counts of Entities (5 Points)

As is becoming apparent, the *raw* occurrence counts of entities might not be meaningful on its own, especially if we are interested in less frequently occurring entities.

Instead, we will "investigate" the entities that are most frequently mentioned in association with "Deutsche Bank". For this purpose, we will look at the textual co-occurrences of two named entities. The basic idea is that entities that frequently appear together are likely related.

1. For each text chunk, extract all mentions of the entity `('Deutsche Bank', 'ORG')`, as well as all `PERSON` entity mentions in the text using spaCy. Store the respective entity name and the text position. Unlike the previous question, you do *not* need to check for different spelllings of the "Deutsche Bank" entity.  
**Hint:** Entities are represented as a [`Span`](https://spacy.io/api/span) element in spaCy, which has access to text position.


In [21]:
db_entity_mentions_with_start_position = []
per_entity_mentions_with_start_position = []

for chunk in db_chunks:
    db_chunk_mentions = []
    per_chunk_mentions = []
    # Process the doc with spaCy
    doc = nlp(chunk)
    
    # Extract only entity mentions of "Deutsche Bank" (ORG) or any PERSON mention.
    # Append each mention, including the text and its starting position, to `chunk_mentions`
    for ent in doc.ents:
        if ent.label_ == "ORG" and ent.text == "Deutsche Bank":
            ent_tuple = (ent.text, ent.start)
            db_chunk_mentions.append(ent_tuple)
        elif ent.label_ == "PERSON":
            ent_tuple = (ent.text, ent.start)
            per_chunk_mentions.append(ent_tuple)
            
        else:
            pass
    
    # Append the chunk's entities to the aggregate list
    db_entity_mentions_with_start_position.extend(db_chunk_mentions)
    per_entity_mentions_with_start_position.extend(per_chunk_mentions)    

    print(len(db_entity_mentions_with_start_position))
    print(len(per_entity_mentions_with_start_position))




118
217
175
314
240
361
489
427
646
806
731
1054


In [35]:
from operator import itemgetter

# Sort the mentions by their starting position
db_entity_mentions_with_start_position.sort(key=itemgetter(1))
per_entity_mentions_with_start_position.sort(key=itemgetter(1))



In [37]:
per_entity_mentions_with_start_position[:2]

[('Karl von Rohr                                                                                                          ',
  133),
 ('Bernd Leukert                                                                                                          ',
  152)]

2. Within each chunk, for each mention of `Deutsche Bank`, search for `PERSON` entities that have a starting position within 200 characters before/after the starting position of the `Deutsche Bank` mention. Count for each `PERSON` entity how many times it occurs nearby a mention of `Deutsche Bank`.  
Aggregate the co-occurrences across all chunks. 

In [43]:
from collections import defaultdict

co_occurrences = defaultdict(int)

for mentions in db_entity_mentions_with_start_position:
    # Iterate through the entities. If the entity is a "Deutsche Bank" mention, extract nearby
    mentions_start = mentions[1]
    # PERSON references (less than +/- 200 character difference in the starting position)
    for per in per_entity_mentions_with_start_position:
        if per[1] > mentions_start - 200 and per[1] < mentions_start + 200:
            # Append the PERSON entity to `co_occurrences`
            co_occurrences[per[0]] += 1
        else:
            pass
    



3. Return the number of co-occurrences and the name of the top 5 frequently occurring `PERSON` entities.


In [86]:
sorted(co_occurrences.items(), key=lambda fu: fu[1], reverse=True)[:5]

[('COVID-19', 224),
 ('MREL', 204),
 ('Paul Achleitner', 119),
 ('Main', 119),
 ('Norbert Winkeljohann', 72)]

4. Look back at the results of your previous task. Are the `PERSON` entities returned by your co-occurrence method the same ones that appear most frequently by raw counts?

In [87]:
# Count all mentions at positiion 0 of the tuples
Counter([x[0] for x in per_entity_mentions_with_start_position]).most_common(5)

[('MREL', 56),
 ('COVID-19', 49),
 ('Paul Achleitner', 30),
 ('Norbert Winkeljohann', 20),
 ('Dagmar Valcárcel', 19)]

They are mostly the same, but in different order and they did co-occur more often the they occured in the text itself. Meaning the word "Deutsche Bank" was placed multiple times on a range of 200 characters of those entities

## 3. Neural Models with Huggingface (3 + 5 + 2 = 10 Points)

For state-of-the-art performance, most text-related tasks nowadays use some variation of the Transformer architecture. The particular advantage is especiall the readily available weights for models that have been pre-trained on large general-purpose datasets, which reduces the amount of domain-specific labeled training data.

In this task, we will explore the [Huggingface](https://hf.co/) ecosystem to see in which way Transformer models can be used.
One of the central aspects of the Huggingface platform is the so-called [Model Hub](https://huggingface.co/models), where you can find many different models uploaded by community members for a variety of tasks.

Because the neural models are generally very expensive to run, this exercise will be limited to  less data than in previous questions.

### Sub Task 1: Loading Transformer Models (3 Points)

1. Install the `transformers` library and load the model `cardiffnlp/twitter-roberta-base-sentiment-latest` to classify a sequence.
2. Report the result of the prediction on the test sequence.

In [53]:
import csv
import urllib.request
import numpy as np

from transformers import AutoTokenizer, AutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

input_text = "Das ist ein Test."

tok_input_text = tokenizer(input_text, return_tensors="pt")
print(tok_input_text)

output = model(**tok_input_text)
print(output)

scores = output[0][0].detach().numpy()
print(scores)

def softmax(x):
    return(np.exp(x)/np.exp(x).sum())

scores = softmax(scores)

task='sentiment'
model_name = "cardiffnlp/twitter-roberta-base-sentiment"

labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

{'input_ids': tensor([[   0,  495,  281,   16,   90,  364,  179, 4500,    4,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[-0.0747,  1.2079, -1.0439]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
[-0.07466091  1.2078773  -1.0439109 ]
1) neutral 0.7233
2) negative 0.2006
3) positive 0.0761


### Sub Task 2: Using Pipelines (5 Points)

The most succinct way of using a Transformer model is the [`transformers.pipeline`](https://huggingface.co/docs/transformers/pipeline_tutorial). You can check out the linked tutorial for more information on the topic, but essentially, `pipeline` provides a light-weight wrapper around a number of different popular NLP tasks

1. Instead of manually defining a pipeline, now load a model through a `"text-classification"` pipeline. Look up the neural model that is loaded by default, and post the link to its [model card](https://huggingface.co/docs/hub/model-cards) below.


In [54]:
from transformers import pipeline

classifier = pipeline("text-classification")

classifier(input_text)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.8571606874465942}]

The default model is distilbert-base-uncased-finetuned-sst-2-english [Model Card](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

In [55]:
# Get model card metadata
print(classifier.modelcard)
# Disappointingly, the model card function is not used here

None


2. Now, instead, load a pipeline for `"text-classification"`, but with a custom model and tokenizer. Use the Model Hub platform to find the most popular model for the German language (by number of downloads) and manually specify the usage of another model (and tokenizer) to the pipeline. Re-run the previous example, and report the prediction result.


In [56]:
# XLM Roberta base language model detection 970,818 downloads
tokenizer = AutoTokenizer.from_pretrained("papluca/xlm-roberta-base-language-detection")

model = AutoModelForSequenceClassification.from_pretrained("papluca/xlm-roberta-base-language-detection")

# Instantiate the pipeline with custom components
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Output the prediction by your pipe on the test sample.
print( pipe("Das ist ein Testtext um den Text zu testen.") )

[{'label': 'de', 'score': 0.9930722117424011}]


3. Keeping in line with the previous exercises, let us now try and actually predict something with the model. Re-load a pipeline, this time for Named Entity Recognition, using the default model.

In [57]:
pipe = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


4. Run the pipeline with the text from the Deutsche Bank report from Question 2 and output the results.

In [59]:
hf_entities = []
for chunks in db_chunks:

    chunk_hf_entites = pipe(chunks)
    hf_entities.extend(chunk_hf_entites)

print(len(hf_entities))

179


In [65]:
# Count the number of unique entities
unique_hf_entities = set([x["word"] for x in hf_entities])

In [68]:
print(len(unique_hf_entities))

102


In [69]:
unique_hf_entities

{'##A',
 '##MC',
 '##MF',
 '##R',
 '##TC',
 '##a',
 '##ab',
 '##cker',
 '##d',
 '##elli',
 '##en',
 '##era',
 '##hl',
 '##hn',
 '##io',
 '##ke',
 '##ller',
 '##lt',
 '##m',
 '##mann',
 '##oh',
 '##r',
 '##riz',
 '##rt',
 '##t',
 '##uke',
 '##ulatory',
 '##vie',
 '##wing',
 '##yl',
 '##ß',
 '##ü',
 'A',
 'Act',
 'Alexander',
 'Article',
 'As',
 'Bank',
 'Bern',
 'Board',
 'CC',
 'CF',
 'Camp',
 'Capital',
 'Christian',
 'Commission',
 'Credit',
 'Deutsche',
 'Dodd',
 'Dr',
 'F',
 'Forum',
 'Frank',
 'Garth',
 'Group',
 'Hammond',
 'Hermann',
 'I',
 'James',
 'John',
 'Josef',
 'Karl',
 'Kimberly',
 'Ku',
 'L',
 'Lambert',
 'Le',
 'Lewis',
 'M',
 'Management',
 'Marcus',
 'Math',
 'Mo',
 'Model',
 'More',
 'Nicolas',
 'O',
 'R',
 'RC',
 'Rebecca',
 'Reg',
 'Release',
 'Riley',
 'Risk',
 'Ritchie',
 'S',
 'Sc',
 'Se',
 'Short',
 'Simon',
 'St',
 'States',
 'Stefan',
 'Stein',
 'Stuart',
 'Unit',
 'United',
 'Werner',
 'World',
 'and',
 'von',
 'zur'}

In [71]:
# Count the number of unique entities
Counter([x["word"] for x in hf_entities])

Counter({'Deutsche': 12,
         'Bank': 15,
         'Reg': 1,
         '##ulatory': 1,
         'Credit': 1,
         'Risk': 3,
         'Model': 1,
         'Forum': 1,
         'RC': 2,
         '##R': 2,
         '##MF': 1,
         '##MC': 1,
         'and': 1,
         'Capital': 1,
         'Management': 7,
         'As': 1,
         'I': 2,
         '##A': 1,
         'Article': 1,
         'L': 5,
         'O': 6,
         '##TC': 7,
         'CC': 1,
         'Dodd': 2,
         'Frank': 4,
         'Act': 2,
         'CF': 1,
         'United': 1,
         'States': 1,
         'Release': 1,
         'Unit': 1,
         'Commission': 3,
         'Group': 11,
         'World': 1,
         'Board': 4,
         'Christian': 2,
         'Se': 1,
         '##wing': 1,
         'Karl': 1,
         'von': 3,
         'R': 1,
         '##oh': 1,
         '##r': 1,
         'F': 1,
         '##ab': 1,
         '##riz': 1,
         '##io': 1,
         'Camp': 1,
         '##elli': 

5. Look at the results. Something looks strange here; why is it not working properly? Elaborate your answer.

There are three obivious problem with the results:

1. Tokens which are only one part of word (###) are also returned as entity, which is not really helpful
2. Entities which consist of multiple words are split up (deutsche Bank, names, ...)
3. There are only a fraction of the entites that were found by spacy
--> This is also meant for the occurences of a unique entity like "Deutsche", which has only been found 12 times instead of around 730 times


In [88]:
len(db_entity_mentions_with_start_position)

731

### Sub Task 3: Using Datasets through Huggingface (2 Points)

Instead of using the `transformers` library for model training and inference, it is also possible to use other libraries by Huggingface without neural models.
In particular, the `datasets` library provides a centralized and streamlined way of accessing a variety of different datasets.

1. Using the `datasets` library, load the `imdb` dataset.

In [77]:
from datasets import load_dataset

imdb = load_dataset("imdb")

Downloading builder script: 100%|██████████| 4.31k/4.31k [00:00<00:00, 2.43MB/s]
Downloading metadata: 100%|██████████| 2.17k/2.17k [00:00<00:00, 900kB/s]
Downloading readme: 100%|██████████| 7.59k/7.59k [00:00<00:00, 4.43MB/s]


Downloading and preparing dataset imdb/plain_text to /home/bjoern/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data: 100%|██████████| 84.1M/84.1M [00:23<00:00, 3.59MB/s]
                                                                                              

Dataset imdb downloaded and prepared to /home/bjoern/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 327.67it/s]


2. Report the mean length of `text` column for the training, validation and test split, respectively.


In [83]:
mean_length_train = np.mean([len(x) for x in imdb["train"]["text"]])

# There is no validation set only one called "unsupervised" where the labels are -1 (unkown)
mean_length_val = np.mean([len(x) for x in imdb["unsupervised"]["text"]])
mean_length_test = np.mean([len(x) for x in imdb["test"]["text"]])

print(f"Mean length of train set: {mean_length_train}")
print(f"Mean length of unsupervised set: {mean_length_val}")
print(f"Mean length of test set: {mean_length_test}")

Mean length of train set: 1325.06964
Mean length of unsupervised set: 1329.9025
Mean length of test set: 1293.7924
