# Content Outline

Grüezi mitenand! ☕

In this notebook we will look at two NLP Use Cases: Namend Entity Recognition (NER) and Text Summarization (TS).

We will:
1.   Load & Explore [Huggingface datasets](https://huggingface.co/docs/datasets/).
2.   NER: Load an out of the box NER Tagger.
3.   NER: Evaluate it's performance on the dataset.
4.   TS: Measure Extractiveness/Abstractiveness.
5.   TS: Use an out of the box extractive summarization model (Text Rank).
6.   Evaluate is using ROUGE and show some of its limitations.







 




In [None]:
# Comment out the below lines to install the packages that we will need. 

# Data
#!pip install datasets
#!pip install pandarallel
#!pip install pandarallel
#
## NER: spaCy
#!pip install -U pip setuptools wheel
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
#
## Summarization
#!pip install summa
#
##Evaluation
#!pip install sklearn_crfsuite
#!pip install seqeval
#!pip install rouge
#
## TS: TextRank
#!pip install pytextrank

In [None]:
# Imports

# Standard
from typing import List, Dict
from collections import namedtuple

# Data & Data Preprocessing
from datasets import load_dataset, get_dataset_split_names
import pandas as pd
import itertools

# Evaluation
from sklearn import metrics
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from seqeval.scheme import IOB2
from rouge import Rouge

# Models
from summa import summarizer
import spacy
print(f'using spaCy version {spacy.__version__}')

using spaCy version 3.2.1


# Named Entity Recognition (NER)

## Loading the Data

In [None]:
# For NER we will use the conllpp++ dataset. You can read more about it here: https://huggingface.co/datasets/conllpp
# You can also get all splits: e.g., load_dataset('conllpp'). For this tutorial we need only the test split.

test = load_dataset('conllpp', split = 'test')

Downloading:   0%|          | 0.00/2.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading and preparing dataset conllpp/conllpp (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/141k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset conllpp downloaded and prepared to /root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2. Subsequent calls will reuse this data.


In [None]:
# Checking what splits are available.
get_dataset_split_names('conllpp')

['train', 'validation', 'test']

In [None]:
test.shape

(3453, 5)

In [None]:
test

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 3453
})

We can read the dataset also directly in **pandas** as a DataFrame.

In [None]:
df = pd.DataFrame(test, columns=["tokens", 'ner_tags'])
df.head(9)

Unnamed: 0,tokens,ner_tags
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, ...","[0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0]"
1,"[Nadim, Ladki]","[1, 2]"
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]","[5, 0, 5, 6, 6, 0]"
3,"[Japan, began, the, defence, of, their, Asian,...","[5, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, ..."
4,"[But, China, saw, their, luck, desert, them, i...","[0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5,"[China, controlled, most, of, the, match, and,...","[5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
6,"[Oleg, Shatskiku, made, sure, of, the, win, in...","[1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
7,"[The, former, Soviet, republic, was, playing, ...","[0, 0, 7, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, ..."
8,"[Despite, winning, the, Asian, Games, title, t...","[0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, ..."


## Tagging schemas
IOB SCHEME:

    I – Token is inside an entity.
    O – Token is outside an entity.
    B – Token is the beginning of an entity.

BILUO SCHEME: 

    B – Token is the beginning of a multi-token entity.
    I – Token is inside a multi-token entity.
    L – Token is the last token of a multi-token entity.
    U – Token is a single-token unit entity.
    O – Token is outside an entity.

For the purposes of this tutorial we will use the IOB scheme. You can find and read more information [here](https://spacy.io/usage/linguistic-features#accessing-ner).

In [None]:
print(df.iloc[3].tokens)

['Japan', 'began', 'the', 'defence', 'of', 'their', 'Asian', 'Cup', 'title', 'with', 'a', 'lucky', '2-1', 'win', 'against', 'Syria', 'in', 'a', 'Group', 'C', 'championship', 'match', 'on', 'Friday', '.']


In [None]:
' '.join(df.iloc[3].tokens)

'Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .'

In [None]:
df.iloc[3].ner_tags

[5, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The NER tags have the following map:

In [None]:
ner_tags_map = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

The entities that we are interested in are **ORG**, **LOC** and **PER**!
Therefore, when we are evaluating we need to null the ner_tags 7 and 8.

Therefore, we want to create a 3rd column, which will contain the gold standard labels in the IOB scheme for PER, LOC and ORG.

In [None]:
def reverse_map(dictionary: Dict)->Dict:
  
  return {v:k for k,v in dictionary.items()}

In [None]:
def ner_tags_to_gold_standard(ner_tags: List[int], ner_tags_map=ner_tags_map)->List[str]:
  '''
  Args:
      ner_tags: A list with the size of the tokens list labelled 1-8, where 
      each number means a different entity, and whetehr it is the beginning or 
      end of it.

      ner_tags_map: A mapping between IOB and the numbers defined in the
      dataset description.

  Returns:
      A list of the original tokens mapped to the IOB scheme without the 
      MISC entity.
    
  Example Usage:
    ner_tags_to_gold_standard(
      [7, 0, 0, 0, 1, 2, 2, 0, 0, 0, 3, 4], 
      {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8})

    return: ['O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG']
  '''

  reversed_ner_tags_map = reverse_map(ner_tags_map)
  
  result = []
  for element in ner_tags:
    if (element == 7) or (element == 8):
      result.append('O')
    else:
      result.append(reversed_ner_tags_map.get(element))

  return result

  # # If you want to write the function in one line, you can use the code below.
  # return [reversed_ner_tags_map.get(element) if ((element !=7) and (element !=8)) else 'O' for element in ner_tags]

Below is another example using directly our dataset.

In [None]:
print(df.iloc[3].tokens)
print(df.iloc[3].ner_tags)
print(ner_tags_to_gold_standard(df.iloc[3].ner_tags))

['Japan', 'began', 'the', 'defence', 'of', 'their', 'Asian', 'Cup', 'title', 'with', 'a', 'lucky', '2-1', 'win', 'against', 'Syria', 'in', 'a', 'Group', 'C', 'championship', 'match', 'on', 'Friday', '.']
[5, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [None]:
df['gold_standard'] = df.ner_tags.apply(ner_tags_to_gold_standard)
df.head(9)

Unnamed: 0,tokens,ner_tags,gold_standard
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, ...","[0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0]","[O, O, B-LOC, O, O, O, O, B-LOC, O, O, O, O]"
1,"[Nadim, Ladki]","[1, 2]","[B-PER, I-PER]"
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]","[5, 0, 5, 6, 6, 0]","[B-LOC, O, B-LOC, I-LOC, I-LOC, O]"
3,"[Japan, began, the, defence, of, their, Asian,...","[5, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, ...","[B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O,..."
4,"[But, China, saw, their, luck, desert, them, i...","[0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O,..."
5,"[China, controlled, most, of, the, match, and,...","[5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O,..."
6,"[Oleg, Shatskiku, made, sure, of, the, win, in...","[1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[B-PER, I-PER, O, O, O, O, O, O, O, O, O, O, O..."
7,"[The, former, Soviet, republic, was, playing, ...","[0, 0, 7, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
8,"[Despite, winning, the, Asian, Games, title, t...","[0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, ...","[O, O, O, O, O, O, O, O, O, O, B-LOC, O, O, O,..."


## Out of the Box NER Model
Often in Named Entity Recognition we will find out of the box tools that can handle common entities such as organizations, locations, people, monetary values, dates, addresses, etc. They are usually a good baseline in our experiments.

One such tool is the [spaCy NER Tagger](https://spacy.io/usage/linguistic-features#accessing-ner). There is also an interactive visualizer that you can play with [here](https://explosion.ai/demos/displacy-ent).

#### Hands-On Exercise 1 (5 min)
Go to news websites or any other website with text information and feed it to the Vizualizer from above. Get familiar with what entities the out of the box tool offers and let's discuss if you have found anything interesting or unexpected from what you saw.

In [None]:
# Add your code here. 
text = """
Federal Reserve Board (FRB) Chair Jerome Powell has warned the U.S. Congress that the Omicron variant of COVID-19 could threaten the U.S. economic recovery. 
In prepared remarks that he delivered before the U.S. Senate Committee on Banking, Housing, and Urban Affairs on Nov. 30, 2021, Powell said that the Omicron variant could threaten the U.S. labor market and cloud the central bank's inflation forecast.
In his prepared testimony, Powell noted: "The recent rise in COVID-19 cases and the emergence of the Omicron variant pose downside risks to employment and economic activity and increased uncertainty for inflation. 
Greater concerns about the virus could reduce people's willingness to work in person, which would slow progress in the labor market and intensify supply chain disruptions."1
"""

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
print('The length of the text is:\t',len(text))
print('The number of entities is:\t',len(ents))

# Hint: You can reuse code from previous lectures ;)

[('Federal Reserve Board', 1, 22, 'ORG'), ('FRB', 24, 27, 'ORG'), ('Jerome Powell', 35, 48, 'PERSON'), ('the U.S. Congress', 60, 77, 'ORG'), ('Omicron', 87, 94, 'ORG'), ('U.S.', 134, 138, 'GPE'), ('the U.S. Senate Committee', 204, 229, 'ORG'), ('Urban Affairs', 255, 268, 'ORG'), ('Nov. 30, 2021', 272, 285, 'DATE'), ('Powell', 287, 293, 'PERSON'), ('Omicron', 308, 315, 'ORG'), ('U.S.', 343, 347, 'GPE'), ('Powell', 437, 443, 'PERSON'), ('Omicron', 511, 518, 'ORG')]
The length of the text is:	 799
The number of entities is:	 14


#### spaCy NER Tagger in Code

In [None]:
import spacy
from spacy import displacy

text = 'Nina Hristozova teaches at Hochschule Luzern and after the lectures end she is traveling to Bulgaria.'

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
doc

Nina Hristozova teaches at Hochschule Luzern and after the lectures end she is traveling to Bulgaria.

After we look at the entities below, we see that my name is not extracted, which makes us wonder how well it deals with slavic names.

In [None]:
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

[('Hristozova', 5, 15, 'GPE'), ('Hochschule Luzern', 27, 44, 'ORG'), ('Bulgaria', 92, 100, 'GPE')]


Now, the tricky part is to map exactly the position of the tokens from spaCy to the ones in the **df** (or our test dataset). This means that we are aiming to create a new column **spacy_oob_pred** in the IOB scheme format. And we have to make sure that the lenght of the array in **gold_standard** equals the length of the list in spacy_oob_pred!

One way to do that would be to turn the entities above into list of strings, e.g. 

```
['Hochschule', 'Luzern']
```

In [None]:
Entity = namedtuple('Entity', 'entity_tokens label indexes')

def get_entities(text:str)->List[Entity]:
  '''Preprocessing the format we get from spaCy in a format that we can easily use.
  Args:
    text: The source text that we want to extract entities from.
  
  Returns:
    A list of entities, where Entity is a namedtuple with attributes entity_tokens, label and indexes.
  
  Example Usage:
    get_entities('Nina Hristozova teaches at Hochschule Luzern and after the lectures end she is traveling to Bulgaria.')
    return: [Entity(entity_tokens=['Nina', 'Hristozova'], label='PER', indexes=None), Entity(entity_tokens=['Hochschule', 'Luzern'], label='ORG', indexes=None), Entity(entity_tokens=['Bulgaria'], label='GPE'), indexes=None])
  '''

  return [Entity(e.text.split(' '), e.label_, None) for e in nlp(text).ents if e.label_ in ['PER', 'LOC', 'GPE', 'ORG']]

In [None]:
get_entities(' '.join(df.iloc[3].tokens))

[Entity(entity_tokens=['Japan'], label='GPE', indexes=None),
 Entity(entity_tokens=['Syria'], label='GPE', indexes=None),
 Entity(entity_tokens=['Group', 'C'], label='ORG', indexes=None)]

In [None]:
def get_entity_indexes(tokenized_text: List[str], tokenized_entities: List[Entity])->List[str]:
  """
    Parameters:
       tokenized_text (List[str]): The tokenized source text where we want do match in.
       tokenized_entities (List[Entity]): The tokenized patterns/entities that we want to match into the tokenized source. 
    Example Usage:
        matching(
          ['Nina', 'Hristozova', 'teaches', 'at', 'Hochschule', 'Luzern', 'and', 'after', 'the', 'lectures', 'end', 'she', 'is', 'traveling', 'to', 'Bulgaria', '.'], 
          [Entity(entity_tokens=['Nina', 'Hristozova'], label='PER', indexes=None), Entity(entity_tokens=['Hochschule', 'Luzern'], label='ORG', indexes=None), Entity(entity_tokens=['Bulgaria'], label='GPE'), indexes=None])
    Returns:
        [Entity(entity_tokens=['Nina', 'Hristozova'], label='PER', indexes=[0, 1]]), Entity(entity_tokens=['Hochschule', 'Luzern'], label='ORG', indexes=[4, 5]]), Entity(entity_tokens=['Bulgaria'], label='GPE', indexes=[15]])])
  """

  def _token_lists_equal(substr, text):
    """Pattern match. 

    :param substr: the entity, which is a substring of the original text
    :param text: the original text

    :return: True if the entity pattern is found in the text window, False otherwise
    """

    return len(substr) == len(text) and all(a == b for a, b in zip(substr, text))
  
  entity_indexes = []
  for entity in tokenized_entities:
    for start_index in range(len(tokenized_text) - len(entity.entity_tokens) + 1):
      
      if _token_lists_equal(entity.entity_tokens, tokenized_text[start_index: start_index + len(entity.entity_tokens)]):
        entity = entity._replace(indexes = [ i for i in range(start_index, start_index + len(entity.entity_tokens))])
        entity_indexes.append(entity)

  return entity_indexes


In [None]:
spacy_pred_example = get_entity_indexes(df.iloc[3].tokens, get_entities(' '.join(df.iloc[3].tokens)))
spacy_pred_example

[Entity(entity_tokens=['Japan'], label='GPE', indexes=[0]),
 Entity(entity_tokens=['Syria'], label='GPE', indexes=[15]),
 Entity(entity_tokens=['Group', 'C'], label='ORG', indexes=[18, 19])]

In [None]:
def get_IOB_scheme(example, tokens):
  result = ['O']*len(tokens)

  for e in example:
    # There are two labels returned from spacy that represent LOC -- GPE and LOC, 
    # therefore we turn all GPE to LOC.
    if e.label == 'GPE':
      e = e._replace(label='LOC')

    
    if not e.indexes:
      continue
    
    result[e.indexes[0]] = f'B-{e.label}'
    if len(e.indexes) > 1:
      for i in e.indexes[1:]:
        result[i] = f'I-{e.label}'

  return result

In [None]:
print(df.iloc[3].gold_standard)

['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [None]:
print(get_IOB_scheme(spacy_pred_example, df.iloc[3].tokens))

['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O']


Now, we finally have done all the preprocessing to get us to the decired data format, we can apply it on all the spacy predictions and next we can compare to our gold standard!

In [None]:
spacy_oob_pred = []
for i,row in df.iterrows():

  example = get_entity_indexes(row.tokens, get_entities(' '.join(row.tokens)))
  spacy_oob_pred.append(get_IOB_scheme(example, row.tokens))

df['spacy_oob_pred'] = spacy_oob_pred

In [None]:
df.head()

Unnamed: 0,tokens,ner_tags,gold_standard,spacy_oob_pred
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, ...","[0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0]","[O, O, B-LOC, O, O, O, O, B-LOC, O, O, O, O]","[O, O, O, O, O, O, O, B-LOC, O, O, O, O]"
1,"[Nadim, Ladki]","[1, 2]","[B-PER, I-PER]","[O, O]"
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]","[5, 0, 5, 6, 6, 0]","[B-LOC, O, B-LOC, I-LOC, I-LOC, O]","[B-ORG, O, B-LOC, I-LOC, I-LOC, O]"
3,"[Japan, began, the, defence, of, their, Asian,...","[5, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, ...","[B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O,...","[B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O,..."
4,"[But, China, saw, their, luck, desert, them, i...","[0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O,...","[O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O,..."


## Evaluation
Below we explore two libraries - sklearn and seqeval. 

## sklearn
For sklearn we need to flatten the lists, that means that from 

```
[['B-PER', 'I-PER', 'O'], ['O', 'O', 'O', B-LOC', 'O']]
```

we have to get:


```
['B-PER', 'I-PER', 'O', 'O', 'O', 'O', B-LOC', 'O']
```





In [None]:
# Flattening the lists.
gold_flat = list(itertools.chain(*df.gold_standard.values))
pred_flat = list(itertools.chain(*df.spacy_oob_pred.values))

In [None]:
target_names = ['B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']
print(metrics.classification_report(gold_flat, pred_flat, labels=target_names))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       B-PER       0.00      0.00      0.00      1618
       I-PER       0.00      0.00      0.00      1161
       B-LOC       0.78      0.73      0.76      1646
       I-LOC       0.63      0.61      0.62       259
       B-ORG       0.52      0.28      0.37      1715
       I-ORG       0.47      0.47      0.47       882

   micro avg       0.63      0.31      0.41      7281
   macro avg       0.40      0.35      0.37      7281
weighted avg       0.38      0.31      0.34      7281



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The above gives us the scores on a token level - we can see the performance for each label. However, in NER we ideally would want to see the performance on full entity-level. The next library that we are going to see in action is called seqeval and it merges the B and I flags into one entity.

In [None]:
gold = df.gold_standard
pred = df.spacy_oob_pred
print(classification_report(gold, pred, scheme=IOB2))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         LOC       0.78      0.72      0.75      1646
         ORG       0.45      0.25      0.32      1715
         PER       0.00      0.00      0.00      1618

   micro avg       0.65      0.32      0.43      4979
   macro avg       0.41      0.32      0.36      4979
weighted avg       0.41      0.32      0.36      4979



What about https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

# Text Summarization

## The Data
For this exersise we will use the CNN Daily main dataset. This dataset is widely used in the Research community and state of the art models are benchmarked against one another based on it.

In [None]:
summarization_data = test = load_dataset('cnn_dailymail', '3.0.0', split = 'test')

Downloading:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234...


  0%|          | 0/5 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/661k [00:00<?, ?B/s]

  0%|          | 0/5 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234. Subsequent calls will reuse this data.


In [None]:
summarization_data

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 11490
})

Below we can see the dataset in a pandas dataframe. The article is our source text and the highlights is our gold standard summary, or also known as our target. Next we will geenrate a model prediction summary with an out of the box extractive summarization model.

In [None]:
df = pd.DataFrame(summarization_data)
df.head(9)

Unnamed: 0,article,highlights,id
0,"(CNN)James Best, best known for his portrayal ...","James Best, who played the sheriff on ""The Duk...",00200e794fa41d3f7ce92cbf43e9fd4cd652bb09
1,(CNN)The attorney for a suburban New York card...,A lawyer for Dr. Anthony Moschetto says the ch...,0021fe8d65bd0d6d76d5fefba2ac02f0c48a43f4
2,(CNN)President Barack Obama took part in a rou...,"""No challenge poses more of a public threat th...",0041698b4463a633f912681b96f73648cb012e33
3,Moscow (CNN)A Russian TV channel aired Hillary...,"Presidential hopeful's video, featuring gay co...",0095ce085581314285f894af73a55ea9ef003412
4,(CNN)Marco Rubio is all in. The Republican se...,"Raul Reyes: In seeking Latino vote, Marco Rubi...",00a51d5454f2ef7dbf4c53471223a27fb9c20681
5,(CNN)SPOILER ALERT! It's not just women gettin...,"Critically acclaimed series ""Orphan Black"" ret...",00dddbedf41ec993a8b976f3cce2dd8ca2c7efed
6,(CNN)Emergency operators get lots of crazy cal...,The ramp agent fell asleep in the plane's carg...,00e5eb6c1af59233b661e59b0954a4f84a2f3904
7,"(CNN)Mullah Mohammed Omar is ""still the leader...","Mullah Omar, the reclusive founder of the Afgh...",012819ffa2547138101055add33deebe7beaa3d4
8,"(CNN)Wanted: film director, must be eager to s...",Michelle MacLaren is no longer set to direct t...,027936fedb5e785fe79e84cb6e55c9cc26042ad3


## Out of the Box Summarization Model
### TextRank
The model that we are going to use in this exercise is called Textrank, which is inspired by the [PageRank](https://towardsdatascience.com/pagerank-3c568a7d2332) algorithm. You can read more about TextRank [here](https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390). Below you see an example of how you can use TextRank.

In [None]:
text = '''
The sun was setting on a beautiful autumn day in October. Ricky was lying lazily in the garden when a quick brown fox jumped over him. 
Ricky is our beloved family dog who loves playing with the apples fallen on the ground. At 10:04 the phone rang. Iwas for my father - he works at Apple as a Data Scientist. 
he loves his team and they all love him. The only thing that I find annoying is that everyone calls him John, when his oriignal name is Ivan.
'''
summarizer.summarize(text)

'Ricky is our beloved family dog who loves playing with the apples fallen on the ground.'

### Hands-On Exercise 2 (10 min)
First, use textrank with a few news article texts to get a feeling of how it summarizes. Then, use the textrank library from above and add a new column to the dataframe, with the textrank summary predictions.

*Hint: Use the article as input to textrank.*

In [None]:
# Add your code here.
article = """
The Swiss National COVID-19 Science Task Force advises the public authorities in the current COVID-19 crisis. 
While the Task Force does not make decisions about measures or actions taken, the volunteer group of experts represents relevant scientific fields and ensures that impartial scientific advice is given.
The members of the Swiss National COVID-19 Science Task Force do not receive any remuneration or compensation for their work in the Task Force. 
Each member has disclosed any potential conflicts of interest. These documents are available on the Organisation page.
"""

summarizer.summarize(article)

'The members of the Swiss National COVID-19 Science Task Force do not receive any remuneration or compensation for their work in the Task Force.'

### Solution

In [None]:
# We can use the pandarallel library to speed up the execution of the below lines. 
# This library paralelizes the execution.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

df['textrank_pred'] = df.article.parallel_apply(summarizer.summarize)

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
   0.00%                                          |        0 /     5745 |      
   0.00%                                          |        0 /     5745 |      M   0.00%                                          |        0 /     5745 |      
   0.99%                                          |       57 /     5745 |      M   0.99%                                          |       57 /     5745 |      
   1.08%                                          |       62 /     5745 |      M   1.20%                                          |       69 /     5745 |      
   1.08%                                          |       

## Evaluation
For the evaluation we will use the ROUGE score. We learned about it in the lecture series. Do you remember what its limitations are?

In [None]:
from rouge import Rouge 

model_prediction = "The brown fox jumped over the dog."

gold_standard = "The quick brown fox jumpled over the lazy dog."

rouge = Rouge()
scores = rouge.get_scores(model_prediction, gold_standard)
scores

[{'rouge-1': {'f': 0.7499999950781251,
   'p': 0.8571428571428571,
   'r': 0.6666666666666666},
  'rouge-2': {'f': 0.2857142808163266, 'p': 0.3333333333333333, 'r': 0.25},
  'rouge-l': {'f': 0.7499999950781251,
   'p': 0.8571428571428571,
   'r': 0.6666666666666666}}]

In [None]:
df.head()

Unnamed: 0,article,highlights,id,textrank_pred
0,"(CNN)James Best, best known for his portrayal ...","James Best, who played the sheriff on ""The Duk...",00200e794fa41d3f7ce92cbf43e9fd4cd652bb09,"(CNN)James Best, best known for his portrayal ..."
1,(CNN)The attorney for a suburban New York card...,A lawyer for Dr. Anthony Moschetto says the ch...,0021fe8d65bd0d6d76d5fefba2ac02f0c48a43f4,(CNN)The attorney for a suburban New York card...
2,(CNN)President Barack Obama took part in a rou...,"""No challenge poses more of a public threat th...",0041698b4463a633f912681b96f73648cb012e33,(CNN)President Barack Obama took part in a rou...
3,Moscow (CNN)A Russian TV channel aired Hillary...,"Presidential hopeful's video, featuring gay co...",0095ce085581314285f894af73a55ea9ef003412,Moscow (CNN)A Russian TV channel aired Hillary...
4,(CNN)Marco Rubio is all in. The Republican se...,"Raul Reyes: In seeking Latino vote, Marco Rubi...",00a51d5454f2ef7dbf4c53471223a27fb9c20681,Yet Rubio has been his own worst enemy on what...


In [None]:
df.iloc[0].article

'(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such a

In [None]:
df.iloc[0].highlights

'James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .'

In [None]:
df.iloc[0].textrank_pred

'(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P.\n"Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post.\nIn the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns.'

In [None]:
# Checking if there are any None values in the predicted summaries.
df.textrank_pred.isna().sum()

0

Let's see the ROUGE score fro the first 100 elements in the dataframe. 

In [None]:
rouge.get_scores(df.iloc[:100].textrank_pred.values, df.iloc[:100].highlights.values, avg=True)

{'rouge-1': {'f': 0.19379958477895773,
  'p': 0.131795553382348,
  'r': 0.4553012185585772},
 'rouge-2': {'f': 0.054097447870300144,
  'p': 0.03512628425617172,
  'r': 0.15625245300877005},
 'rouge-l': {'f': 0.17894739086812603,
  'p': 0.12120719616227973,
  'r': 0.42440605074834076}}

# Take-home Exercise
We see from above that the rouge-l f score is pretty low - 17%. 
*   Why could that be?
*   How would you investigate why are the predictions so different from the gold standard?


*Hint:* Are the gold standard summaries more abstractive or more extractive based on the source text? What about any differences in the text features between the gold standard and the model predictions (e.g. length)?


In [None]:
# Add your code here.