<a href="https://colab.research.google.com/github/IgnatiusEzeani/welsh-text-summarizer/blob/main/code/summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summarizer v0.1
This is a version of the Welsh text summarizer is based on the tutorial from [Praveen Dubey](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)

#### 0. Import all necessary libraries

In [16]:
# import nltksummarize_text
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

#### 1. Generate clean sentences

In [37]:
def read_article(file_name):
    file = open(file_name, "r", encoding="utf8")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []

    for sentence in article:
#         print(sentence) #(un)comment as required
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop() 
    
    return sentences

#### 2. Define Sentence Similarity function

In [10]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w not in stopwords:
            vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w not in stopwords: 
            vector2[all_words.index(w)] += 1
    
    return 1 - cosine_distance(vector1, vector2)

#### 3. Define the Sentence Matrix function

In [11]:
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 != idx2: #ignore if both are same sentences
                similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    return similarity_matrix

#### 4. Generate summary method

In [28]:
def generate_summary(file_name, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []
    
    # Step 1 - Read text and tokenize
    sentences =  read_article(file_name)
    
    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
    
    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    
    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
#     print("Indexes of top ranked_sentence order are ", ranked_sentence)
    
    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i][1]))
        # Step 5 - Offcourse, output the summarize text
    summarize_text = '.\n'.join(summarize_text)
    print(f"Summarize Text: \n{summarize_text}")

#### 5. Testing the summariser

In [29]:
generate_summary("automatic_summarisation.txt", 3)

Summarize Text: 
This expanding availability of documents has demanded exhaustive research in the area of automatic text summarization.
Automatic text summarization gained attraction as early as the 1950s.
Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning


## Transformer-based Summariser

In [32]:
# !pip install transformers #use if necessary
from transformers import pipeline

In [33]:
summariser = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [41]:
ARTICLE = open("automatic_summarisation.txt", "r", encoding="utf8").read()
ARTICLE

'This expanding availability of documents has demanded exhaustive research in the area of automatic text summarization. According to Radef et al. [53], a summary is defined as “a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually, significantly less than that”. Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning. In recent years, numerous approaches have been developed for automatic text summarization and applied widely in various domains. For example, search engines generate snippets as the previews of the documents [73]. Other examples include news websites which produce condensed descriptions of news topics usually as headlines to facilitate browsing or knowledge extractive approaches [8, 61, 72]. Automatic text summarization is very challenging, because when we as humans

In [40]:
print(summariser(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': ' A summary is defined as “a text that conveys important information in the original text(s) and that is no longer than half of the original texts’s . This expanding availability of documents has demanded exhaustive research in the area of automatic text summarization . The task of producing a concise and fluent summary while preserving key information content .'}]
