## Exercize 2.1
### Summarization using Nasari
#### Francesco Sannicola

In [1]:
import string

from nltk.corpus import stopwords
import nltk

Reads the NASARI file and calculate a dict as output

In [2]:
def read_nasari(file):
    """
    :return: a dictionary as {word: {{term:score},...}}
    """
    nasari = {}
    with open(file, 'r', encoding="utf8") as file:
        for row in file.readlines():
            line_splitted = row.split(";")
            dict_entry = {}
            
            # Start from 2 letter (delete "bn:")
            for term in line_splitted[2:]:
                # term and score written like this: "serotonin_1841.0"
                term_score = term.split("_")
                if len(term_score) > 1:
                    dict_entry[term_score[0]] = term_score[1]

            nasari[line_splitted[1].lower()] = dict_entry

    return nasari

Reads the given document.

In [3]:
def read_doc(file):
    """
    :param file: document's path
    :return: a list of all paragraph.
    """
    document = []
    with open(file, 'r', encoding="utf8") as file:
        for row in file.readlines():
            # does not consider lines starting with "#"
            if '#' not in row:
                row = row[:-1]
                if row != '':
                    document.append(row)
    return document

Computes the rank of the given vector. Method used to calculate the weighted overlap between nasari vectors.

In [4]:
def calculate_rank(vector, nasari_vector):
    """
    :param vector: input vector
    :param nasari_vector: input Nasari vector
    :return: vector's rank (position inside the nasari_vector)
    """

    j=9
    for i in range(len(nasari_vector)):
        if nasari_vector[i] == vector:
            # returns index of nasari_vector egual to input vector
            return i + 1

Implementation of the Weighted Overlap between two nasari vectors.
$$
  WO(w_1,w_2) = \frac{\sum_{q \in O} (rank(q, v1) + rank(q, v2))^{-1}}{\sum_{i=1}^{|O|} (2i)^{-1}}
$$

More is WO and more will be similar that 2 vectors.

In [5]:
def weighted_overlap(nasari_vector_1, nasari_vector_2):
    """
    :param nasari_vector_1: in our case represents the Nasari of topic
    :param nasari_vector_2: in our case represents the Nasari of paragraph
    :return: Weighted Overlap square-rooted of two nasari vectors
    """

    overlap_keys = nasari_vector_1.keys() & nasari_vector_2.keys()

    list_overlap_keys = list(overlap_keys)

    if len(list_overlap_keys) != 0:
        return (sum(1 / (calculate_rank(vector, list(nasari_vector_1)) + calculate_rank(vector, list(nasari_vector_2))) for vector in list_overlap_keys)) \
               / \
               (sum(list(map(lambda x: 1 / (2 * x), list(range(1, len(list_overlap_keys) + 1))))))
    return 0

A bag of word algorithm based approach. It calculates a list of word given a text doing stop word and punctuation removal.

In [6]:
def bag_of_word_approach(text):
    """
    :param text: input text
    :return: BoW representation of the text.
    """

    text = text.lower()
    
    stop_words = set(stopwords.words('english'))
    wordnet_lemmatizer = nltk.WordNetLemmatizer()
    
    # text tokenizzation
    tokens = nltk.word_tokenize(text)
    
    # remove stop_word and punctuation
    tokens = list(filter(lambda x: x not in stop_words and x not in string.punctuation, tokens))
    
    return set(wordnet_lemmatizer.lemmatize(token) for token in tokens)

Create a list of nasari vectors depending on the document title (topic).

In [7]:
def get_topic_from_title(document, nasari):
    """
    :param document: input document
    :param nasari: Nasari dictionary
    :return: a list of Nasari vectors.
    """

    # title in on the first document entry
    title = document[0]
    
    # topic calculated with BOW approach
    topic = bag_of_word_approach(title)

    # NB nasari_vectors is a dict of dicts {word: {{term:score},...}}
    nasari_vectors = []

    for word in topic:
        if word in nasari.keys():
            nasari_vectors.append(nasari[word])

    return nasari_vectors


Create a list of nasari vectors depending of text's terms. Very similar to the previous one.

In [8]:
def text_to_nasari(text, nasari):
    """
    :param text: the list of text's terms
    :param nasari: Nasari dictionary
    :return: list of Nasari's vectors.
    """

    tokens = bag_of_word_approach(text)
    
    nasari_vectors = []

    for word in tokens:
        if word in nasari.keys():
            nasari_vectors.append(nasari[word]) 

    return nasari_vectors

Given a list of paragraph from a document, calculate how many of these are preserved depending to a percentage.

In [9]:
def calculate_lines_to_keep(doc_paragraphs, percentage):
    """
    :param doc_paragraphs: document's paragraphs as a list
    :param percentage: reduction percentage
    :return: number of paragraphs to preserve
    """
    return len(doc_paragraphs) - int(round((percentage / 100) * len(doc_paragraphs), 0))
    

Given a list of paragraphs annotated with overlap scores, compute the summarized text.

In [10]:
def reduce_document(doc_paragraphs_overlaps, lines_to_keep):
    """
    :param doc_paragraphs_overlaps: document's paragraphs as a list with an overlap score
    :param lines_to_keep: number of paragraph to keep
    :return: reduced document
    """
    # Order by weighted overlap
    document_sorted  = sorted(doc_paragraphs_overlaps, key=lambda x: x[1], reverse=True)
    reduced_document = document_sorted[:lines_to_keep]
    
    # Restore original order
    reduced_document = sorted(reduced_document, key=lambda x: x[0], reverse=True)

    # Obtain the text
    reduced_document_text = list(map(lambda x: x[2], reduced_document))
    
    # Add the title
    reduced_document_text = [document[0]] + reduced_document_text
    
    return reduced_document_text
    

Applies summarization to the given document, with the specific redution percentage.
I will use title approach.

In [11]:
def summarization(document, nasari, percentage):
    """
    :param document: document
    :param nasari_dict: nasari dict of dicts
    :param percentage: reduction percentage
    :return: document summarized
    """

    # First step: calculate topic/topics from title
    topics = get_topic_from_title(document, nasari)
    doc_paragraphs = []
    i = 0
    
    for doc_paragraph in document[1:]:
        
        # obtain nasari rappresentation of the paragraph
        nasari_text_par = text_to_nasari(doc_paragraph, nasari)
        paragraph_weighted_overlap = 0
        
        # word is a nasari rappresentation of the term {word: {{term:score},...}}
        for word in nasari_text_par:
            topic_weighted_overlap = 0
            
            for topic in topics:
                # for each topic compute the WO for topic and word (comulative)
                topic_weighted_overlap += weighted_overlap(word, topic)
            
            # Mean of WO (based on number of topic)
            if topic_weighted_overlap != 0:
                topic_weighted_overlap /= len(topics)
            
            
            # Comulative paragraph's WO
            paragraph_weighted_overlap += topic_weighted_overlap

        if len(nasari_text_par) != 0:
            # Mean of paragraph's WO
            paragraph_weighted_overlap /= len(nasari_text_par)
            # Create a tuple with paragraph's number, WO and text. Append it.
            doc_paragraphs.append((i, paragraph_weighted_overlap, doc_paragraph))

        i += 1

    # Obtain number of lines to keep
    lines_to_keep = calculate_lines_to_keep(doc_paragraphs, percentage)
    
    # Finally we can execute summarization
    reduced_document = reduce_document(doc_paragraphs, lines_to_keep)
    
    return reduced_document

Call all previously defined methods.
Also calcule BLEU and ROUGE score to see how similar the results are compared to the original documents.

In [12]:
from nltk.translate.bleu_score import sentence_bleu

from bleu import multi_list_bleu

from rouge import Rouge


nasari_file = 'utils/NASARI_vectors/dd-small-nasari-15.txt'
all_docs = ['utils/docs/Andy-Warhol.txt', 'utils/docs/Ebola-virus-disease.txt', 'utils/docs/Life-indoors.txt', 'utils/docs/Napoleon-wiki.txt', 'utils/docs/Trump-wall.txt']

nasari = read_nasari(nasari_file)
rouge = Rouge()


for doc in all_docs:
    
    document = read_doc(doc)
    print('--------------------------------------------------------------------')
    print('\033[1m' +doc.replace("utils/docs/", "").replace(".txt", ""), '\033[0m' + "\tOriginal lenght:", len(document))
    
    for percentage_decrease in [10,20,30]:
        
        summary = summarization(document, nasari, percentage_decrease)

        print(percentage_decrease, "% redution", "\tSummary lenght:", len(summary))
        print()
        
        # Compute BLEU only for 1-gram
        print("BLEU score: ", sentence_bleu([document], summary, weights=(1, 0, 0, 0)))
        
        # COmpute rouge scores for unigram, bigram and l-gram. F1, precision and recall.
        rouge_scores = rouge.get_scores(' '.join(summary), ' '.join(document))
        print("Rogue scores: ", rouge_scores)
        
    print()

--------------------------------------------------------------------
[1mAndy-Warhol [0m	Original lenght: 20
10 % redution 	Summary lenght: 18

BLEU score:  0.8948393168143697


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Rogue scores:  [{'rouge-1': {'f': 0.9290555756849036, 'p': 1.0, 'r': 0.8675105485232067}, 'rouge-2': {'f': 0.9136137444598561, 'p': 0.9834469328140214, 'r': 0.8530405405405406}, 'rouge-l': {'f': 0.9369369319568218, 'p': 1.0, 'r': 0.8813559322033898}}]
20 % redution 	Summary lenght: 16

BLEU score:  0.7788007830714049
Rogue scores:  [{'rouge-1': {'f': 0.8523002372398267, 'p': 1.0, 'r': 0.7426160337552743}, 'rouge-2': {'f': 0.8376151187156861, 'p': 0.9829351535836177, 'r': 0.7297297297297297}, 'rouge-l': {'f': 0.8771626248332306, 'p': 1.0, 'r': 0.7812018489984591}}]
30 % redution 	Summary lenght: 14

BLEU score:  0.6514390575310556
Rogue scores:  [{'rouge-1': {'f': 0.8030302982242884, 'p': 1.0, 'r': 0.6708860759493671}, 'rouge-2': {'f': 0.7896865472671786, 'p': 0.9836272040302267, 'r': 0.6596283783783784}, 'rouge-l': {'f': 0.8400357413299089, 'p': 1.0, 'r': 0.724191063174114}}]

--------------------------------------------------------------------
[1mEbola-virus-disease [0m	Original len