<H3>PRI 2023/24: first project delivery</H3>

**GROUP 8**
- Daniele Avolio    , ist1111559
- Michele Vitale	, ist1111558	
- Luís Dias	        , ist198557

<H3>Part I: demo of facilities</H3>

A) **Indexing** (preprocessing and indexing options)

Initially, let's create a `function` to read the `documents` we need to analyze.

In [3]:
import os  

def read_files(location:str):
    filespath = []  
    
    for root, dirs, files in os.walk(location):
        for file in files:
            if file.endswith(".txt"): 
                filespath.append(os.path.join(root, file))  
    
    return filespath  

In [4]:
documents_paths = read_files("../BBC News Summary/News Articles")  
print(documents_paths[:5])
print(f"The number of documents is {len(documents_paths)}")

['../BBC News Summary/News Articles\\business\\001.txt', '../BBC News Summary/News Articles\\business\\002.txt', '../BBC News Summary/News Articles\\business\\003.txt', '../BBC News Summary/News Articles\\business\\004.txt', '../BBC News Summary/News Articles\\business\\005.txt']
The number of documents is 2225


Possible function to index

In [5]:
# code, statistics and/def
import os
import time
from whoosh import index, scoring
from whoosh.fields import Schema, TEXT, NUMERIC
from whoosh.analysis import StandardAnalyzer, StemFilter, LowercaseFilter, StopFilter

stoplist = frozenset(
    [
        "and",
        "is",
        "it",
        "an",
        "as",
        "at",
        "have",
        "in",
        "yet",
        "if",
        "from",
        "for",
        "when",
        "by",
        "to",
        "you",
        "be",
        "we",
        "that",
        "may",
        "not",
        "with",
        "tbd",
        "a",
        "on",
        "your",
        "this",
        "of",
        "us",
        "will",
        "can",
        "the",
        "or",
        "are",
    ]
)


def indexing(document_collection, stem=False, stop_words=True):
    start_time = time.time()

    # It's important to put the stoplist check here because in the constructor
    # of the StandardAnalyzer, the stoplist parameter is set automatically to a default
    # so if we want to remove it, we need to check it during the construction
    analyzer = StandardAnalyzer(stoplist=stoplist if stop_words else None)

    if stem:
        analyzer = analyzer | StemFilter()

    

    # print(analyzer)
    

    schema = Schema(
        id=NUMERIC(stored=True),
        content=TEXT(
            analyzer=analyzer,
            stored=True,
        ),
    )

    index_dir = "indexdirectory"
    if not os.path.exists(index_dir):
        os.mkdir(index_dir)

    ix = index.create_in(index_dir, schema)

    writer = ix.writer()

    for doc_id, document in enumerate(document_collection):
        with open(document, "r") as file:
            file.readline() #To skip the first line of the file
            content = file.read()
            writer.add_document(id=doc_id, content=content)
    writer.commit()

    indexing_time = time.time() - start_time

    return ix, indexing_time

In [6]:
from whoosh.qparser import QueryParser

ix, indexing_time = indexing(documents_paths)

print(f"Indexing time: {indexing_time} seconds")
print(f"Number of indexed documents: {ix.doc_count()}")

Indexing time: 10.372246026992798 seconds
Number of indexed documents: 2225


In [7]:
# Print the index indexes
from whoosh.reading import IndexReader

terms = ix.reader().all_terms()

print(list(terms))



In [8]:
with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
    query = QueryParser("content", ix.schema).parse("PC")
    results = searcher.search(query, limit=5)

    for hit in results:
        print(f"Document id: {hit['id']} document score: {hit.score}")
        print("\n")


Document id: 2124 document score: 54.134826360294376


Document id: 2065 document score: 29.149521886312357


Document id: 2111 document score: 29.149521886312357


Document id: 2206 document score: 29.149521886312357


Document id: 2123 document score: 24.98530447398202




`Things to do more:`
- Add a `function` that gives statistics about the `documents` (e.g. number of words, number of characters, etc.)
- Add a `function` that gives the `frequency` of each word in the `documents` (e.g. word1: 10, word2: 5, etc.)
- Somethink else?


B) **Summarization**

*B.1 Summarization solution: results for a given document*

This needs to be changed. It's better to implement a good BM25 algorithm, and this is not the case.

In [9]:
from nltk.tokenize import sent_tokenize
from whoosh import scoring
from transformers import BertModel, BertTokenizer
import numpy as np

# given two vectors representing embeddings, calculate the euclidian distance between them
def euclidian_distance(v1, v2):
    return np.sum((v1 - v2) ** 2) ** 0.5



def summarization(
    document: str,
    max_sentences: int,
    max_characters: int,
    order: bool,
    ix,
    scoring_type: str = "TF_IDF",
):

    # It's better to tokenize into sentences
    sentences = sent_tokenize(document)

    if scoring_type != "BERT":
        # The main idea is to take a sentence and give it a score based on the frequency of its terms
        # Then we select the sentences with the highest scores
        with ix.searcher(
            weighting=(
                scoring.TF_IDF()
                if scoring_type == "TF_IDF"
                else scoring.BM25F(B=0.75, K1=1.2)
            )
        ) as searcher:
            sentence_scores = {}
            for i, sentence in enumerate(sentences):
                score = 0
                for word in sentence.split():
                    # We use the frequency of the word in the whole collection as a score
                    score += searcher.frequency("content", word)

                sentence_scores[i] = score / len(sentences)

        # Organize the sentences by their scores
        # Uses lambda function to sort the dictionary by value, taking the second element of the tuple
        sorted_sentence_scores = sorted(
            sentence_scores.items(), key=lambda item: item[1], reverse=True
        )

        summary_sentences = []
        summary_length = 0
        for i, score in sorted_sentence_scores:
            sentence = sentences[i]
            if summary_length + len(sentence) > max_characters:
                break
            summary_sentences.append((i, sentence))
            summary_length += len(sentence)
            if len(summary_sentences) >= max_sentences:
                break

        # If order is True, sort the sentences into their original order
        if order:
            summary_sentences.sort(key=lambda item: item[0])

        # Join the sentences together into a single string
        summary = " ".join(sentence for i, sentence in summary_sentences)
    else:
        # Set the entire document as anchor
        # Split the document into sentences
        # Get the most important sentences
        # Return the summary
        sentences = nltk.sent_tokenize(document)
        bert_model = BertModel.from_pretrained("bert-base-uncased")
        bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

        # The input for BERT is a list of strings
        # So we need to tokenize the sentences
        tokenized_sentences = bert_tokenizer(
            sentences, return_tensors="pt", padding=True
        )

        anchor_embedding = bert_model(**tokenized_sentences)['last_hidden_state'].squeeze()[0].detach().numpy()
        sentences_embedding = {}
        for s in sentences:
            sentences_embedding[s] = bert_model(**bert_tokenizer(s, return_tensors="pt"))['last_hidden_state'].squeeze()[0].detach().numpy()
            

        # Use the euclidian distance to compare the sentences
        sentences_distance = {}
        for s in sentences:
            score = euclidian_distance(anchor_embedding, sentences_embedding[s])
            sentences_distance[s] = score


        # Sort the sentences by their distance to the anchor
        sorted_sentences_distance = sorted(
            sentences_distance.items(), key=lambda item: item[1], reverse=False
        )



        summary_sentences = []
        summary_length = 0
        for sentence, score in sorted_sentences_distance:
            if summary_length + len(sentence) > max_characters:
                break
            summary_sentences.append((sentence, score))
            summary_length += len(sentence)
            if len(summary_sentences) >= max_sentences:
                break
        
        # If the order is true, use the original senteces order
        if order:
            summary_sentences.sort(key=lambda item: sentences.index(item[0]))
            

        # Join the sentences together into a single string
        summary = " ".join(sentence for sentence, score in summary_sentences)

    return summary

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
# Test the summarization function
document = documents_paths[0]
with open(document, "r") as file:
    content = file.read()
    summaryBERT = summarization(content, max_sentences=5, max_characters=500, order=True, ix=ix, scoring_type="BERT")
    summaryBM25 = summarization(content, max_sentences=5, max_characters=500, order=True, ix=ix, scoring_type="BM25")
    summaryTFIDF = summarization(content, max_sentences=5, max_characters=500, order=True, ix=ix, scoring_type="TF_IDF")
    print(summaryBERT)
    print()
    print(summaryBM25)
    print()
    print(summaryTFIDF)
    print("\n")

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m.

TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding. "Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility," chairman and chief executive Richard Parsons said. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m.

TimeWarner also has to restate 2

*B.2 IR models (TF-IDF, BM25 and BERT)*

In [11]:
# code, statistics and/or charts here
from whoosh.qparser import *


# probably querying (ask Rui)
def test_model(type: str, query: str, ix):
    score = None
    if type == "BM25":
        score = scoring.BM25F(B=0.75, K1=1.2)
    elif type == "TF-IDF":
        score = scoring.TF_IDF()
    elif type == "BERT":
        pass

    with ix.searcher(weighting=score) as searcher:
        q = QueryParser("content", ix.schema, group=OrGroup).parse(query)
        results = searcher.search(q, limit=10)

        for r in results:
            print(f"Id: {r['id']} Score: {r.score}")


test_model("BM25", "PC", ix)

Id: 2065 Score: 8.139982879732802
Id: 2124 Score: 7.604678636099232
Id: 1934 Score: 7.572117589898127
Id: 249 Score: 7.372295992649777
Id: 2206 Score: 7.229416952524419
Id: 2111 Score: 7.047284836325118
Id: 1844 Score: 6.696682075448143
Id: 2151 Score: 6.696682075448143
Id: 2066 Score: 6.673974256809339
Id: 2067 Score: 6.673974256809339


*B.3 Reciprocal rank funsion*

In [12]:
#code, statistics and/or charts here

*B.4 Maximal Marginal Relevance*

In [13]:
#code, statistics and/or charts here

C) **Keyword extraction**

In [40]:
#code, statistics and/or charts here

from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer as TfIdfVectorizer
import pandas as pd
import nltk

import re
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))


print(stop_words)
nltk.download('averaged_perceptron_tagger')

def preprocess_text(text):
    text = text.lower()

    # remove punctuation 
    text = re.sub(r'[^\w\s]','',text)

    # remove stopwords
    words = nltk.word_tokenize(text)

    text = " ".join([word for word in words if word not in stop_words])

    # join the words back into a single string
    text = " ".join(words)

    return text
    

def tfidf(documents):
    vectorizer = TfIdfVectorizer()
    X = vectorizer.fit_transform(documents)
    return (pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()), vectorizer)

def tdidf_extraction(doc, vectorizer, feature_names):

    doc = preprocess_text(doc)

    print(doc)

    tfidf_vector = vectorizer.transform([doc])
    sorted_items = sort_coo(tfidf_vector.tocoo())
    keywords = extract_topn_from_vector(feature_names, sorted_items, 10)
    return keywords

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    sorted_items = sorted_items[:topn]
    score_vals = []
    feature_vals = []
    for idx, score in sorted_items:
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
    results = {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]] = score_vals[idx]
    return results

def keyword_extraction(document:str, maximum_keyword:int, ix, **kwargs):

    word_tag = nltk.pos_tag(nltk.word_tokenize(document))

    # We can filter the words based on their tags keeping only the important ones for keyword extraction
    # filter are here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    tags = ["NN", "NNS", "NNP", "NNPS"]
    important_words = [word for word, tag in word_tag if tag in tags]


    words_scores = defaultdict(int)

    # assess the score of each word at the document level
    with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
        for word in important_words:
            words_scores[word] = searcher.frequency("content", word)

    sorted_words_scores = sorted(words_scores.items(), key=lambda item: item[1], reverse=True)

    keywords = [word for word in sorted_words_scores[:maximum_keyword]]

    return keywords
    


{'what', 'm', "wasn't", 'don', 'those', 'again', 'been', 'have', "haven't", 'into', 'there', 'haven', 'not', 'too', 'isn', 'hadn', 'are', "shan't", "she's", 'about', 'off', 'now', 'can', 'wasn', 'won', 'during', 'out', 'myself', 'under', "mightn't", 'so', 've', 'until', 'at', 'against', 'hers', "couldn't", 'below', 'this', 'between', 'themselves', 'before', "isn't", 'weren', "you'd", 'while', 'she', 'own', "doesn't", "weren't", 'ourselves', 'we', 'they', 'were', 'when', 'through', 'wouldn', "you're", 'then', 'just', 'these', 'should', 'himself', 'is', 'if', 'for', 't', 'doesn', "hasn't", "you'll", 'the', 'an', 'y', 'doing', 'itself', 'does', "won't", 'such', 'with', 'why', 'will', 'couldn', 'both', "mustn't", 'most', 'down', 'than', 'in', 'mightn', 'was', 'shouldn', "hadn't", 'herself', 'shan', 'theirs', 'only', 'their', 'over', 'further', 'll', "didn't", 'me', 'a', 'on', "don't", 's', 'same', 'because', 'nor', 'hasn', 'any', 'mustn', 'needn', 'very', 'you', 'ours', 'which', 'he', 'ma'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [38]:

document = documents_paths[0]
with open(document, "r") as file:
    content = file.read()
    keywords = keyword_extraction(content, 10, ix)
    print(keywords)
    print("\n")

# try this later: https://whoosh.readthedocs.io/en/latest/keywords.html


[('film', 844.0), ('back', 824.0), ('music', 815.0), ('way', 737.0), ('market', 687.0), ('company', 682.0), ('firm', 583.0), ('part', 575.0), ('sales', 480.0), ('months', 466.0)]




In [39]:
# Use tfidf to extract keywords
documents = []


for document in documents_paths:
    with open(document, "r") as file:
        file.readline() #To skip the first line of the file
        content = file.read()  

        content = preprocess_text(content)
        


        documents.append(content)

tfidf_matrix, vectorizer = tfidf(documents)
feature_names = vectorizer.get_feature_names_out()


doc = documents[0]
keywords = tdidf_extraction(doc, vectorizer, feature_names)
print(keywords)

    

quarterly profits at us media giant timewarner jumped 76 to 113bn â600m for the three months to december from 639m yearearlier the firm which is now one of the biggest investors in google benefited from sales of highspeed internet connections and higher advert sales timewarner said fourth quarter sales rose 2 to 111bn from 109bn its profits were buoyed by oneoff gains which offset a profit dip at warner bros and less users for aol time warner said on friday that it now owns 8 of searchengine google but its own internet business aol had has mixed fortunes it lost 464000 subscribers in the fourth quarter profits were lower than in the preceding three quarters however the company said aols underlying profit before exceptional items rose 8 on the back of stronger internet advertising revenues it hopes to increase subscribers by offering the online service free to timewarner internet customers and will try to sign up aols existing customers for highspeed broadband timewarner also has to res

D) **Evaluation**

In [None]:
#code, statistics and/or charts here

<H3>Part II: questions materials (optional)</H3>

**(1)** Corpus *D* and summaries *S* description.

In [None]:
#code, statistics and/or charts here

**(2)** Summarization performance for the overall and category-conditional corpora.

In [None]:
#code, statistics and/or charts here

**...** (additional questions with empirical results)

<H3>END</H3>