<a href="https://colab.research.google.com/github/georgiacc/week-8/blob/main/text_sumarizier_extended.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarisation

One of the themes has been to summarise online content and produce a report throughout the semester. Our solution is to convert everything to text, summarise the text, and use the summary in the final report. This notebook is another step in the process. The lessons from this notebook are:
* understanding how to **reference** online tutorials, videos etc.
* extending and reusing working code. In this case, a text summariser.
* build on and use advanced concepts without implementing them, in this case, machine learning.

> This notebook contains a lot of exploration. If you want to see the final answer, jump to the bottom of the notebook.

# Build a summariser

This section is based on the YouTube video [AI Text Summarization with Hugging Face Transformers in 4 Lines of Python](https://youtu.be/TsfLm5iiYb4)

As Information Systems professionals, we use our skills to be aware of advanced concepts and think about how you can meet the organisational Using *Hugging Face Transformers*, you can leverage a pre-trained summarisation pipeline to start summarising content. In this section, we will:
1. Installing Hugging Face Transformers
2. Building a summarisation pipeline
3. Run model/pipeline to summarisation
4. **Investigate way to reuse the pipeline**

> [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) free state-of-the-art pre-trained machine learning models for processing text, images, audio and video. See the project website for more information.


In [None]:
# Install Dependencies
!pip install transformers -q

In [None]:
# import libraries
from transformers import pipeline

# load sumarisation pipeline
summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")


# Let us copy-n-paste some text
article = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and
relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The
metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats,
data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however,
they tend to be infexible and rely on the user to adequately report their methods and results. To enable FAIR data science
in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully
integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive
and intuitive for both computational novices and experts alike
'''

# Run the summariser pipeline
summary = summary_pipeline(article, max_length = 50, min_length= 20)

# What does a summary look like?
print(summary)

# By inspection of output, 'summary' is a list.  The first element of the list is a dictionary.
# The key to the dictionary is 'summary_text'.

# Extract and display the summarised text
text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
print(text)

Okay that seemed to work.  How can we reuse this code?  How about we create a function that take an 'article' and returns a summary

In [None]:
from transformers import pipeline

def summarise(article):
  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 50, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text


A quick test.

In [None]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

print(summarise(some_text))

Umm... it worked, but with a warning on max_length.   We could reduce the max length or add a check that we have at least 50 words.  Our reasoning (design decision) is that it doesn't really make sense to sumarise say one sentance. We could pick any minimun size, but 50 seems like a good number.

But first, how do I count words in a string?  We did something like this in an earlier notebook where we counted the spaces.  We could search the internat for some code snippets.  We can use the the string method `split()`.

In [None]:
help(str.split)

So `split()` returns a list of words.  The `len()` of the list will be the word count.  Let us try it.


In [None]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

count = len(some_text.split())
print(count)

Let us update the function to include this check.  We will also add a doc string.  I choosen to use an `assert` statement, but you could do something similar with an `if` statement.

In [None]:
from transformers import pipeline

def summarise(article):
  '''Returns a summary of a text.  The length of the text has to be greater than 50 words'''
  assert len(article.split()) > 50, 'Please make sure your text has at least 50 words'

  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 50, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text

In [None]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

print(summarise(some_text))

Great the assertion worked.

In [None]:
bigger_text='''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and
relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The
metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats,
data repositories, online spectral libraries, and metabolite databases.
'''

print(summarise(bigger_text))

Okay that is working well. Let us start to use our hard work

How about we summarise each page of a PDF.

    Get/Download the PDF
    for each page in the PDF
        extract the text
        summarise the text

Google search find a simple tutorial [How to Extract Text From PDF File In Python - PyMuPDF](https://youtu.be/RQTiyQzowLQ)

# Summarise PDF

In [None]:
# Setup requirements

# Get a PDF
# Google Scholar 'Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing'
!wget https://link.springer.com/content/pdf/10.1007/s11306-019-1588-0.pdf

# Install required packages
!pip install PyMuPDF -q

> **Note: There is maximum sequence length for the default model.  As we are only demonstrating the concept we will truncate text size to 400 words**

In [None]:
# Import libraries
import fitz

# Use PDF downloaded
pdf = "s11306-019-1588-0.pdf"
doc = fitz.open(pdf)
for page in doc:
  article = page.get_text("Text")
  # The model has a limit size, first 400 words on eachpage
  # Implement a better solution.  So split the body of
  # text into word, take the first 400 words and the join
  # the words into a body of text
  article = ' '.join(article.split()[:400])
  # Run the summariser pipeline
  text = summarise(article)
  print(text)

Lets make it a function.

In [54]:
  import fitz

def sumarise_pdf(pdf):
  ''' Sumarise the first 400 words on each page of the PDF'''

  doc = fitz.open(pdf)
  for page in doc:
    article = page.get_text("Text")
    # The model has a limit size, first 400 words on eachpage
    # Implement a better solution.  So split the body of
    # text into word, take the first 400 words and the join
    # the words into a body of text
    article = ' '.join(article.split()[:400])
    # Run the summariser pipeline
    text = summarise(article)
    return text

IndentationError: expected an indented block after function definition on line 3 (<ipython-input-54-0bdaa66031ab>, line 4)

What would you have to do to make this sumarise a folder with many PDFs?

1. Get the list of files in the directory
2. for each file in the list, call `summarise_pdf()`

How would you save the summaries? What information should you save?  Probably need to know the source document, page number and the summary of the page.  Maybe a Python dictionary or SQL database?

# Scrape text from webpage

Lets use the pipeline to summarise a web page.  Another google search and after looking at a few onlien articles, YouTube videos I settled on this page: [2 Ways to Extract Text From HTML Using Python](https://computersciencehub.io/python/extract-text-from-html-using-python/)


In [None]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

# get webpage
req = Request("https://en.wikipedia.org/wiki/Python_(programming_language)")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, features="html.parser")

# remove all 'script' and 'style' elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines =  text.splitlines()

# remove empty lines
lines = [x for x in lines if x]

# combine into one body of text
text = ' '.join(lines)
# split into words
text = text.split()
# get first 400 words
text = text[:400]
# join words into text
text = ' '.join(text)

summarise(text)

Lets make it a function!



In [None]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

def summarise_webpage(URL):
  ''' Sumarise the first 400 words on a website'''
  # get webpage
  req = Request(URL)
  html_page = urlopen(req)
  soup = BeautifulSoup(html_page, features="html.parser")

  # remove all 'script' and 'style' elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  text = soup.get_text() # get text
  lines =  text.splitlines() # break into lines
  lines = [x for x in lines if x] # remove empty lines
  text = ' '.join(lines) # combine into one body of text
  text = text.split() # split into words
  text = text[:400] # get first 400 words
  text = ' '.join(text) # join words into text

  return summarise(text)

text = summarise_webpage("https://en.wikipedia.org/wiki/Python_(programming_language)")
print(text)

What would you have to do to make this sumarise a many URLs?

1. Create a list of URLs
2. for each URL in the list, call `summarise_web_page()`

How would you save the summaries? What information should you save?  Probably need to know the source document, page number and the summary of the page.  Maybe a Python dictionary or SQL database?

# A Final Solution

In [None]:
# Install Dependencies
!pip install transformers -q
!pip install PyMuPDF -q

# Get a PDF
# Google Scholar 'Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing'
!wget https://link.springer.com/content/pdf/10.1007/s11306-019-1588-0.pdf

In [55]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from transformers import pipeline
import fitz

def summarise(article):
  '''Returns a summary of a text.  The length of the text has to be greater than 50 words'''
  assert len(article.split()) > 50, 'Please make sure your text has at least 50 words'

  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 50, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text

def sumarise_pdf(pdf):
  ''' Sumarise the first 400 words on each page of the PDF'''

  doc = fitz.open(pdf)
  for page in doc:
    article = page.get_text("Text")
    # The model has a limit size,
    # first 400 words on eachpage
    article = ' '.join(article.split()[:400])
    # Run the summariser pipeline
    text = summarise(article)
    return text

def summarise_webpage(URL):
  ''' Sumarise the first 400 words on a website'''
  # get webpage
  req = Request(URL)
  html_page = urlopen(req)
  soup = BeautifulSoup(html_page, features="html.parser")

  # remove all 'script' and 'style' elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  text = soup.get_text() # get text
  text =  text.splitlines() # break into lines
  text = [x for x in text if x] # remove empty lines
  text = ' '.join(lines) # combine into one body of text
  text = text.split() # split into words
  text = text[:400] # get first 400 words
  text = ' '.join(text) # join words into text
  return summarise(text)



# Main Program
print("PDF Summary")
print(sumarise_pdf("s11306-019-1588-0.pdf"))

print("Webiste Summary")
print(summarise_webpage("https://en.wikipedia.org/wiki/Python_(programming_language)"))

PDF Summary
Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing. A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results.
Webiste Summary


IndexError: index out of range in self

How can we extend this program?  How about a menu to display different summarising options, e.g. summarise one PDF, summarise many PDF.  The user selects the choice, and your program acts.   Rather than a menu, how about a form interface.  The user can type in a URL or file path and then run the program.

Currently, we are only summarising the first 400 words.  Notice that seem to be repeating logic in `sumarise_pdf()` and `sumarise_web()`.  A better place would be in the `summarise()` function—even better, work out a solution to process the entire body of text.
