<a href="https://colab.research.google.com/github/ciepielajan/NLP_Text-Summarization/blob/main/3_0_Text_Summarizaton_Word_Frequency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

3_0_Text Summarizaton_Word_Frequency.ipynb

`pobranie bibliotek`

In [16]:
import pandas as pd
# import math
import re
import nltk
nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize, PorterStemmer
nltk.download('stopwords')
from nltk.corpus import stopwords
stopWords = set(stopwords.words("english"))
ps = PorterStemmer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
!pip -q  install wikipedia    # -q = quite (hide output)
import wikipedia

  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


`surowy tekst`

In [3]:
text_row = wikipedia.page("Machine Learing").content
# text_row

In [4]:
print("format: ", type(text_row))
print("ilość znaków: ", len(text_row))
print("ilość słow: ", len(text_row.split(" ")))
print("ilość zdań: ", len(sent_tokenize(text_row)))

format:  <class 'str'>
ilość znaków:  45879
ilość słow:  6956
ilość zdań:  263


`oczyszczenie tekstu`

> wikipedia formatuje paragrafy wg wzorca:

'=== JAKAŚ TREŚĆ ==='

W pierwszej kolejności wyszukajmy wszystkie paragrafy

In [5]:
wzorzec = r"\=+\s+[\w\s\-]+\=+"
print("znalezionych wystąpień: ", len(re.findall(wzorzec,text_row)))
print(re.findall(wzorzec,text_row))

znalezionych wystąpień:  46
['== Overview ==', '=== Machine learning approaches ===', '== History and relationships to other fields ==', '=== Artificial intelligence ===', '=== Data mining ===', '=== Optimization ===', '=== Generalization ===', '=== Statistics ===', '== Theory ==', '== Approaches ==', '=== Types of learning algorithms ===', '==== Supervised learning ====', '==== Unsupervised learning ====', '==== Semi-supervised learning ====', '==== Reinforcement learning ====', '==== Self learning ====', '==== Feature learning ====', '==== Sparse dictionary learning ====', '==== Anomaly detection ====', '==== Robot learning ====', '==== Association rules ====', '=== Models ===', '==== Artificial neural networks ====', '==== Decision trees ====', '==== Support-vector machines ====', '==== Regression analysis ====', '==== Bayesian networks ====', '==== Genetic algorithms ====', '=== Training models ===', '==== Federated learning ====', '== Applications ==', '== Limitations ==', '=== Bi

In [6]:
text = re.sub(wzorzec, " ",text_row) # usuń paragrafy
text = re.sub(r'\n+', " ",text)  # usuń znaki nowej linii 
text = re.sub(r'\s{2,}', " ",text) # usuń wielokrotności spacji
text = text.strip() # usunięcie pierwszej i ostatniej spacji
# text

In [7]:
print("format: ", type(text))
print("ilość znaków: ", len(text))
print("ilość słow: ", len(text.split(" ")))
print("ilość zdań: ", len(sent_tokenize(text)))

format:  <class 'str'>
ilość znaków:  44241
ilość słow:  6603
ilość zdań:  263


`Word_Frequency`

`1 Sentence Tokenize`

In [8]:
def _create_frequency_table(text_string) -> dict:
    """
    we create a dictionary for the word frequency table.
    For this, we should only use the words that are not part of the stopWords array.
    Removing stop words and making frequency table
    Stemmer - an algorithm to bring words to its root word.
    :rtype: dict
    """
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable

In [9]:
def _score_sentences(sentences, freqTable) -> dict:
    """
    score a sentence by its words
    Basic algorithm: adding the frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = dict()

    for sentence in sentences:
        word_count_in_sentence = (len(word_tokenize(sentence)))
        word_count_in_sentence_except_stop_words = 0
        for wordValue in freqTable:
            if wordValue in sentence.lower():
                word_count_in_sentence_except_stop_words += 1
                if sentence[:10] in sentenceValue:
                    sentenceValue[sentence[:10]] += freqTable[wordValue]
                else:
                    sentenceValue[sentence[:10]] = freqTable[wordValue]

        if sentence[:10] in sentenceValue:
            sentenceValue[sentence[:10]] = sentenceValue[sentence[:10]] / word_count_in_sentence_except_stop_words

        '''
        Notice that a potential issue with our score algorithm is that long sentences will have an advantage over short sentences. 
        To solve this, we're dividing every sentence score by the number of words in the sentence.
        
        Note that here sentence[:10] is the first 10 character of any sentence, this is to save memory while saving keys of
        the dictionary.
        '''

    return sentenceValue

In [10]:
def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original text
    average = (sumValues / len(sentenceValue))

    return average

In [11]:
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:10] in sentenceValue and sentenceValue[sentence[:10]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

`Wywołanie wszystkich funkcji`

In [12]:
def run_summarization(text):
    # 1 Create the word frequency table
    freq_table = _create_frequency_table(text)

    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''

    # 2 Tokenize the sentences
    sentences = sent_tokenize(text)

    # 3 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(sentences, freq_table)

    # 4 Find the threshold
    threshold = _find_average_score(sentence_scores)

    # 5 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1.6 * threshold)

    return summary

In [13]:
summarization = run_summarization(text)
summarization

' In the early days of AI as an academic discipline, some researchers were interested in having machines learn from data. Some statisticians have adopted methods from machine learning, leading to a combined field that they call statistical learning. If the complexity of the model is increased in response, then the training error decreases. The data is known as training data, and consists of a set of training examples. The algorithms, therefore, learn from test data that has not been labeled, classified or categorized. In machine learning, the environment is typically represented as a Markov decision process (MDP). In supervised feature learning, features are learned using labeled input data. It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. In machine learning, genetic algorithms were used in the 1980s and 1990s. Usually, machine learning models require a lot of data in order for them to perform well.'

In [14]:
result = {
  "id":"3_0",
  "ilosc znakow": len(summarization),
  "ilosc slow": len(summarization.split(" ")),
  "ilosc zdan": len(sent_tokenize(summarization))
}

In [17]:
print(pd.DataFrame.from_dict(result, orient='index').T.set_index("id").to_markdown())

|   id |   ilosc znakow |   ilosc slow |   ilosc zdan |
|-----:|---------------:|-------------:|-------------:|
|  3_0 |            966 |          153 |           10 |
