## Luhn Algorithm Application

The Luhn algorithm is a text summarization technique that uses statistical properties of the text to identify and extract the most important sentences from a document. The algorithm was developed by H.P. Luhn in the 1950s, and is still widely used in various forms today.

Below I summarize the topic of fructose malabsorption by generating a summary using the Luhn algorithm. To create the summary, I selected several articles from sources like Wikipedia and PubMed. The important words were selected based on their total frequency in all of the text. I chose the top 25 words to focus on, and then used the algorithm to identify the most important sentences based on the frequency and distribution of these words. The summary was generated using the top 15 sentences.

#### Import libraries

In [1]:
import re
import nltk
import string
import heapq
import wikipedia
import requests
from bs4 import BeautifulSoup

nltk.download('punkt')
nltk.download('stopwords')
wikipedia.set_lang("en")

# get article text
from goose3 import Goose

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Joukovaa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Joukovaa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Get pubmed and wiki articles on the topic of fructose malabsorption

In [2]:
# article list
article_list = ['https://pubmed.ncbi.nlm.nih.gov/11336160/',
                'https://pubmed.ncbi.nlm.nih.gov/28965810/'
                ]
wiki_article_title = "Fructose malabsorption"

page = wikipedia.page(wiki_article_title)
wiki_article_text = page.content

article_text_list = []
article_text_list.append(wiki_article_text)

#### Get articles' text

In [3]:
print(wiki_article_text[:700])

Fructose malabsorption, formerly named dietary fructose intolerance (DFI), is a digestive disorder in which absorption of fructose is impaired by deficient fructose carriers in the small intestine's enterocytes. This results in an increased concentration of fructose. Intolerance to fructose was first identified and reported in 1956.Similarity in symptoms means that patients with fructose malabsorption often fit the profile of those with irritable bowel syndrome.Fructose malabsorption is not to be confused with hereditary fructose intolerance, a potentially fatal condition in which the liver enzymes that break up fructose are deficient. Hereditary fructose intolerance is quite rare, affecting


In [4]:
for url in article_list:
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        article_html = str(soup.find_all('div', {'class': 'abstract-content selected'})[0])
        g = Goose()
        article = g.extract(raw_html=article_html)
        print("\ncurrent url: %s" % url)
        print(article.cleaned_text)
        article_text_list.append(article.cleaned_text)
    except Exception as e:
        print(f"Error: Could not find article in {url}: {e}")


current url: https://pubmed.ncbi.nlm.nih.gov/11336160/
Background: Fructose malabsorption is characterized by the inability to absorb fructose efficiently. As a consequence fructose reaches the colon where it is broken down by bacteria to short fatty acids, CO2, H2, CH4 and lactic acid. Bloating, cramps, osmotic diarrhea and other symptoms of irritable bowel syndrome are the consequence and can be seen in about 50% of fructose malabsorbers. Recently it was found that fructose malabsorption was associated with early signs of depressive disorders. Therefore, it was investigated whether fructose malabsorption is associated with abnormal tryptophan metabolism.

Methods: Fifty adults (16 men, 34 women) with gastrointestinal discomfort were analyzed by measuring breath hydrogen concentrations after an oral dose of 50 g fructose after an overnight fast. They were classified as normals or fructose malabsorbers according to their breath H2 concentrations. All patients filled out a Beck depress

#### Get English stopwords
Stopwords are common words like 'the', 'a', 'and', etc., that are considered to be of little value in text analysis because they occur frequently in the language and do not carry significant meaning. Removing stopwords helps in reducing the dimensionality of the data, speeding up the analysis, and improving the performance of natural language processing tasks such as text summarization, topic modeling, and text classification. It helps the model to focus on the important words, which are usually more indicative of the content of the text.

In [5]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Preprocess text
The preprocess function cleans a given text by converting it to lowercase, tokenizing it into words, and removing stopwords and punctuation. The cleaned and processed text is then returned as a single string. This function is essential for preparing text data for further natural language processing tasks.

In [6]:
def preprocess(text):
    text = re.sub(r'[^a-zA-Z0-9 ]', '', text)
    formatted_text = text.lower()
    tokens = []
    for token in nltk.word_tokenize(formatted_text):
        tokens.append(token)
    tokens = [word for word in tokens if word not in stopwords and word not in string.punctuation]
    formatted_text = ' '.join(element for element in tokens)
    
    return formatted_text

formatted_article_list = [preprocess(article_text) for article_text in article_text_list]
print(formatted_article_list[0][0:750])

fructose malabsorption formerly named dietary fructose intolerance dfi digestive disorder absorption fructose impaired deficient fructose carriers small intestines enterocytes results increased concentration fructose intolerance fructose first identified reported 1956similarity symptoms means patients fructose malabsorption often fit profile irritable bowel syndromefructose malabsorption confused hereditary fructose intolerance potentially fatal condition liver enzymes break fructose deficient hereditary fructose intolerance quite rare affecting 1 20000 30000 people symptoms signs fructose malabsorption may cause gastrointestinal symptoms abdominal pain bloating flatulence diarrhea pathophysiology fructose absorbed small intestine without h


#### Calculate sentences score function
The calculate_sentences_score function calculates the importance score of each sentence in a list of sentences.

It takes three parameters: sentences (a list of sentences), important_words (a list of words considered important), and distance (a threshold for grouping words in a sentence).
It tokenizes each sentence into words.
For each sentence, it finds the indices of the important words in the sentence.
It groups the indices of the important words based on the distance parameter. For example, if distance is 2, then indices [0, 1, 5] will be grouped into [[0, 1], [5]].
For each group of indices, it calculates a score based on the number of important words in the group and the total number of words in the group.
The maximum score among all groups in a sentence is considered as the score of the sentence.
It returns a list of tuples, each containing the score of a sentence and its index in the list of sentences.
This function is used to rank the sentences in the text based on the occurrence of important words, which is useful for text summarization.

In [7]:
# Function to calculate sentences score
def calculate_sentences_score(sentences, important_words, distance):
  scores = []
  sentence_index = 0

  for sentence in [nltk.word_tokenize(sentence) for sentence in sentences]:
    #print('------------')
    #print(sentence)

    word_index = []
    for word in important_words:
      #print(word)
      try:
        word_index.append(sentence.index(word))
      except ValueError:
        pass

    word_index.sort()
    #print(word_index)

    if len(word_index) == 0:
      continue

    # [0, 1, 5]
    groups_list = []
    group = [word_index[0]]
    i = 1 # 3
    while i < len(word_index): # 3
      # first execution: 1 - 0 = 1
      # second execution: 2 - 1 = 1
      if word_index[i] - word_index[i - 1] < distance:
        group.append(word_index[i])
        #print('group', group)
      else:
        groups_list.append(group[:])
        group = [word_index[i]]
        #print('group', group)
      i += 1
    groups_list.append(group)
    #print('all groups', groups_list)

    max_group_score = 0
    for g in groups_list:
      #print(g)
      important_words_in_group = len(g)
      total_words_in_group = g[-1] - g[0] + 1
      score = 1.0 * important_words_in_group**2 / total_words_in_group
      #print('group score', score)

      if score > max_group_score:
        max_group_score = score

    scores.append((max_group_score, sentence_index))
    sentence_index += 1

  #print('final scores', scores)
  return scores


#### Funcion to summarize the text
The summarize function generates a summary of a given text.

It takes five parameters: text (the text to be summarized), top_n_words (the number of most common words to consider as important), distance (a threshold for grouping words in a sentence), number_of_sentences (the number of sentences to include in the summary), and percentage (an optional parameter to specify the percentage of sentences to include in the summary).
It tokenizes the text into sentences and words, and preprocesses the sentences to remove stopwords and punctuation.
It calculates the frequency of each word in the text and selects the top_n_words most common words as important words.
It calculates the score of each sentence using the calculate_sentences_score function.
It selects the number_of_sentences highest-scoring sentences, or if percentage is provided, it selects the highest-scoring sentences that make up the specified percentage of the total number of sentences.
It returns the list of all original sentences, the selected best sentences, and the scores of all sentences.
This function is used to generate a summary of the text by selecting the most important sentences based on the occurrence of important words.

In [26]:
# Function to summarize the text
def summarize(text, top_n_words, distance, number_of_sentences, percentage = 0):
    #text = text.replace('==', '.')
    text = text.replace(';', '.').replace('\n', '.').replace('\t', '.')
    text = re.sub(r'\.{2,}', '.', text)
    text = re.sub(r'\.([A-Z])', r'. \1', text)
    text = re.sub(r'[^a-zA-Z0-9 .!?;]', '', text)
    original_sentences = [sentence for sentence in nltk.sent_tokenize(text)]
    
    #print(original_sentences)
    formatted_sentences = [preprocess(original_sentence) for original_sentence in original_sentences]
    #print(formatted_sentences)
    words = [word for sentence in formatted_sentences for word in nltk.word_tokenize(sentence)]
    #print(words)
    frequency = nltk.FreqDist(words)
    #print(frequency)
    #return frequency
    top_n_words = [word[0] for word in frequency.most_common(top_n_words)]
    #print(top_n_words)
    sentences_score = calculate_sentences_score(formatted_sentences, top_n_words, distance)
    #print(sentences_score)
    if percentage > 0:
      best_sentences = heapq.nlargest(int(len(formatted_sentences) * percentage), sentences_score)
    else:  
      best_sentences = heapq.nlargest(number_of_sentences, sentences_score)
    #print(best_sentences)
    best_sentences_fmtd = [original_sentences[i] for (score, i) in best_sentences]
    
    return original_sentences, best_sentences_fmtd, sentences_score

original_sentences, best_sentences, sentences_score = summarize(article_text_list[0], 25, 2, 10)

print('Best sentences:\n')
for sentence in best_sentences:
    print(sentence)

Best sentences:

Fructose malabsorption may cause gastrointestinal symptoms such as abdominal pain bloating flatulence or diarrhea.
Foods with 3 g of fructose per serving are termed a high fructose load and possibly present a risk of inducing symptoms.
This can cause some surprises and pitfalls for fructose malabsorbers.
Foodlabeling .
Other fruits ripe banana jackfruit passion fruit pineapple rhubarb tamarillo.
Citrus fruit kumquat grapefruit lemon lime mandarin orange tangelo.
Berry fruit blackberry boysenberry cranberry raspberry strawberry loganberry.
Stone fruit apricot nectarine peach plum caution  these fruits contain sorbitol.
The following list of favorable foods was cited in the paper Fructose malabsorption and symptoms of Irritable Bowel Syndrome Guidelines for effective dietary management.
Foods containing added sugars such as agave nectar some corn syrups and fruit juice concentrates.


#### Concatenate all the articles into a single string

In [28]:
# Concatenate all the articles into a single string
all_text = ""
for article_text in article_text_list:
    all_text += article_text

#### Summarize the concatenated text

In [31]:
# Summarize the concatenated text
original_sentences, best_sentences, sentences_score = summarize(all_text, 25, 2, number_of_sentences=25)

In [32]:
print('Best sentences:\n')
for sentence in best_sentences:
    print(sentence)

Best sentences:

Fructose malabsorption may cause gastrointestinal symptoms such as abdominal pain bloating flatulence or diarrhea.
Nevertheless its longterm followup can have negative effects because it causes a detrimental impact on the gut microbiota and metabolome.
This can cause some surprises and pitfalls for fructose malabsorbers.
Foodlabeling .
Other fruits ripe banana jackfruit passion fruit pineapple rhubarb tamarillo.
Citrus fruit kumquat grapefruit lemon lime mandarin orange tangelo.
Berry fruit blackberry boysenberry cranberry raspberry strawberry loganberry.
Stone fruit apricot nectarine peach plum caution  these fruits contain sorbitol.
The following list of favorable foods was cited in the paper Fructose malabsorption and symptoms of Irritable Bowel Syndrome Guidelines for effective dietary management.
Researchers at Monash University in Australia developed dietary guidelines for managing fructose malabsorption particularly for individuals with IBS.
Glucose enhances abs

#### Results Discussion
The final implementation of the summarize function successfully processes and summarizes the given text, accounting for various sentence delimiters (.;\n\t) and extracting the most relevant sentences.

Here's a summary of the process and the findings:

The entire corpus of articles was concatenated into a single text string.

The summarize function was then applied to this text. The function first replaced various sentence delimiters with periods and ensured that there is a space after each period. It then removed all non-alphanumeric characters except for the standard punctuation marks (.,!?, and ;).

The nltk.sent_tokenize function was used to split the text into sentences.

The function then processed the sentences, tokenized them, and created a frequency distribution of the words.

The 25 most common words were selected, and a score was assigned to each sentence based on the frequency of these common words and their distance from each other.

The 25 highest-scoring sentences were then selected as the most relevant sentences.

The result is a collection of sentences that highlight the key aspects of the corpus, such as the symptoms of fructose malabsorption, its impacts on gut microbiota and mental health, the role of fructose-to-glucose ratio, foods with high fructose load, and dietary management strategies. The output includes specific details about foods, symptoms, and dietary guidelines. Some sentences like "Foodlabeling ." and "Gastroenterology." seem out of context, but this could be due to the lack of context in the original text or the fact that these terms appear frequently in the text, thereby getting a high score.

Overall, the summarization function works well for the given text, but there's always room for improvement. For example, the scoring mechanism could be improved by using more advanced techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or by incorporating a pre-trained language model. Additionally, the function could be modified to handle other non-standard sentence delimiters and formatting issues.