<H3>PRI 2023/24: first project delivery</H3>

**GROUP 8**
- Daniele Avolio    , ist1111559
- Michele Vitale	, ist1111558	
- Luís Dias	        , ist198557

<H3>Part I: demo of facilities</H3>

A) **Indexing** (preprocessing and indexing options)

Initially, let's create a `function` to read the `documents` we need to analyze.

In [1]:
import os  

def read_files(location:str):
    filespath = []  
    
    for root, dirs, files in os.walk(location):
        for file in files:
            if file.endswith(".txt"): 
                filespath.append(os.path.join(root, file))  
    
    return filespath  

In [2]:
documents_paths = read_files("../BBC News Summary/News Articles")  
print(documents_paths[:5])
print(f"The number of documents is {len(documents_paths)}")

['../BBC News Summary/News Articles\\business\\001.txt', '../BBC News Summary/News Articles\\business\\002.txt', '../BBC News Summary/News Articles\\business\\003.txt', '../BBC News Summary/News Articles\\business\\004.txt', '../BBC News Summary/News Articles\\business\\005.txt']
The number of documents is 2225


Possible function to index

In [3]:
# code, statistics and/def
import os
import time
from whoosh import index, scoring
from whoosh.fields import Schema, TEXT, NUMERIC
from whoosh.analysis import StandardAnalyzer, StemFilter, LowercaseFilter, StopFilter

stoplist = frozenset(
    [
        "and",
        "is",
        "it",
        "an",
        "as",
        "at",
        "have",
        "in",
        "yet",
        "if",
        "from",
        "for",
        "when",
        "by",
        "to",
        "you",
        "be",
        "we",
        "that",
        "may",
        "not",
        "with",
        "tbd",
        "a",
        "on",
        "your",
        "this",
        "of",
        "us",
        "will",
        "can",
        "the",
        "or",
        "are",
    ]
)


def indexing(document_collection, stem=True, stop_words=True):
    start_time = time.time()

    # It's important to put the stoplist check here because in the constructor
    # of the StandardAnalyzer, the stoplist parameter is set automatically to a default
    # so if we want to remove it, we need to check it during the construction
    analyzer = StandardAnalyzer(stoplist=stoplist if stop_words else None)

    if stem:
        analyzer = analyzer | StemFilter()

    

    # print(analyzer)
    

    schema = Schema(
        id=NUMERIC(stored=True),
        content=TEXT(
            analyzer=analyzer,
            stored=True,
        ),
    )

    index_dir = "indexdirectory"
    if not os.path.exists(index_dir):
        os.mkdir(index_dir)

    ix = index.create_in(index_dir, schema)

    writer = ix.writer()

    for doc_id, document in enumerate(document_collection):
        with open(document, "r") as file:
            content = file.read()
            writer.add_document(id=doc_id, content=content)

    writer.commit()

    indexing_time = time.time() - start_time

    return ix, indexing_time

In [4]:
from whoosh.qparser import QueryParser

ix, indexing_time = indexing(documents_paths)

print(f"Indexing time: {indexing_time} seconds")
print(f"Number of indexed documents: {ix.doc_count()}")

Indexing time: 28.83882164955139 seconds
Number of indexed documents: 2225


In [50]:
with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
    query = QueryParser("content", ix.schema).parse("PC")
    results = searcher.search(query, limit=5)

    for hit in results:
        print(f"Document id: {hit['id']} document score: {hit.score}")
        print("\n")


`Things to do more:`
- Add a `function` that gives statistics about the `documents` (e.g. number of words, number of characters, etc.)
- Add a `function` that gives the `frequency` of each word in the `documents` (e.g. word1: 10, word2: 5, etc.)
- Somethink else?


B) **Summarization**

*B.1 Summarization solution: results for a given document*

This needs to be changed. It's better to implement a good BM25 algorithm, and this is not the case.

In [26]:
from nltk.tokenize import sent_tokenize
from whoosh import scoring


def summarization(
    document: str,
    max_sentences: int,
    max_characters: int,
    order: bool,
    ix,
    scoring_type: str = "TF_IDF",
):

    # It's better to tokenize into sentences
    sentences = sent_tokenize(document)

    if scoring_type != "BERT":
        # The main idea is to take a sentence and give it a score based on the frequency of its terms
        # Then we select the sentences with the highest scores
        with ix.searcher(
            weighting=scoring.TF_IDF() if scoring_type == "TF_IDF" else scoring.BM25F(
                B=0.75, K1=1.2
            )
        ) as searcher:
            sentence_scores = {}
            for i, sentence in enumerate(sentences):
                score = 0
                for word in sentence.split():
                    # We use the frequency of the word in the whole collection as a score
                    score += searcher.frequency("content", word)

                sentence_scores[i] = score / len(sentences)

        # Organize the sentences by their scores
        # Uses lambda function to sort the dictionary by value, taking the second element of the tuple
        sorted_sentence_scores = sorted(
            sentence_scores.items(), key=lambda item: item[1], reverse=True
        )

        summary_sentences = []
        summary_length = 0
        for i, score in sorted_sentence_scores:
            sentence = sentences[i]
            if summary_length + len(sentence) > max_characters:
                break
            summary_sentences.append((i, sentence))
            summary_length += len(sentence)
            if len(summary_sentences) >= max_sentences:
                break

        # If order is True, sort the sentences into their original order
        if order:
            summary_sentences.sort(key=lambda item: item[0])

        # Join the sentences together into a single string
        summary = " ".join(sentence for i, sentence in summary_sentences)
    else:
        pass

    return summary

In [31]:
# Test the summarization function
document = documents_paths[0]
with open(document, "r") as file:
    content = file.read()
    summary = summarization(content, max_sentences=5, max_characters=500, order=True, ix=ix, scoring_type="TF_IDF")
    print(summary)
    print("\n")

Time Warner said on Friday that it now owns 8% of search-engine Google. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue.




*B.2 IR models (TF-IDF, BM25 and BERT)*

In [36]:
# code, statistics and/or charts here
from whoosh.qparser import *


# probably querying (ask Rui)
def test_model(type: str, query: str, ix):
    score = None
    if type == "BM25":
        score = scoring.BM25F(B=0.75, K1=1.2)
    elif type == "TF-IDF":
        score = scoring.TF_IDF()
    elif type == "BERT":
        pass

    with ix.searcher(weighting=score) as searcher:
        q = QueryParser("content", ix.schema, group=OrGroup).parse(query)
        results = searcher.search(q, limit=10)

        for r in results:
            print(f"Id: {r['id']} Score: {r.score}")


test_model("BM25", "PC", ix)

Id: 2065 Score: 7.881376447520268
Id: 2206 Score: 7.249861037345842
Id: 1934 Score: 7.1905429154835865
Id: 2073 Score: 7.102620146365696
Id: 2124 Score: 7.095863705043671
Id: 1906 Score: 7.029048064877089
Id: 2131 Score: 7.029048064877089
Id: 2159 Score: 6.845299733583026
Id: 2066 Score: 6.798863188653262
Id: 249 Score: 6.795487552532308


*B.3 Reciprocal rank funsion*

In [9]:
#code, statistics and/or charts here

*B.4 Maximal Marginal Relevance*

In [10]:
#code, statistics and/or charts here

C) **Keyword extraction**

In [65]:
#code, statistics and/or charts here

from collections import defaultdict
import nltk

nltk.download('averaged_perceptron_tagger')


def keyword_extraction(document:str, maximum_keyword:int, ix, **kwargs):

    word_tag = nltk.pos_tag(nltk.word_tokenize(document))



    # We can filter the words based on their tags keeping only the important ones for keyword extraction
    # filter are here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    tags = ["NN", "NNS", "NNP", "NNPS"]
    important_words = [word for word, tag in word_tag if tag in tags]


    words_scores = defaultdict(int)

    # assess the score of each word at the document level
    with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
        for word in important_words:
            words_scores[word] = searcher.frequency("content", word)

    sorted_words_scores = sorted(words_scores.items(), key=lambda item: item[1], reverse=True)

    keywords = [word for word in sorted_words_scores[:maximum_keyword]]

    return keywords
    


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [66]:

document = documents_paths[0]
with open(document, "r") as file:
    content = file.read()
    keywords = keyword_extraction(content, 10, ix)
    print(keywords)
    print("\n")

# try this later: https://whoosh.readthedocs.io/en/latest/keywords.html


[('film', 1177.0), ('back', 1039.0), ('firm', 1004.0), ('music', 949.0), ('market', 924.0), ('sale', 738.0), ('part', 634.0), ('deal', 514.0), ('growth', 467.0), ('profit', 333.0)]




D) **Evaluation**

In [12]:
#code, statistics and/or charts here

<H3>Part II: questions materials (optional)</H3>

**(1)** Corpus *D* and summaries *S* description.

In [13]:
#code, statistics and/or charts here

**(2)** Summarization performance for the overall and category-conditional corpora.

In [14]:
#code, statistics and/or charts here

**...** (additional questions with empirical results)

<H3>END</H3>