# Document Summariser
The goal of this project is for a user to upload a PDF and receive a translated summary of the document. The text will be translated from English to Spanish. At first I wanted to translate the text from English to Dutch or Dutch to English, but I could not find any usable translation models for Dutch to English, or English to Dutch. For this reason I have chosen for Spanish, since I think it is an interesting language. The target users for this application are people who do not speak Spanish but do speak English and need to translate their documents, but also just want a quick summary so they do not have to read the entire app.

## Text Extraction
First I will extract the text from a PDF file. I will also process the text, so the input text is not too long for the model.

In [11]:
import pdfplumber

pdf_path = "./data/cloud-analysis-2.pdf"

text = ""
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text += page.extract_text()

print(text)

What are the benefits of using a cloud platform?
Some benefits of using a cloud platform instead of self-hosting applications include greater
elasticity. This means the application can easily scale up and down. When self-hosting
scaling out is possible by increasing the servers, this means it’s not as easy as when using a
cloud platform. The initial costs of using a cloud platform are also less, because there is no
need to invest in expensive servers and hardware. It also speeds up deployment, by
enabling deployment anywhere in the world in a matter of minutes. It is also safer and more
reliable. Cloud providers invest in security technologies to defend their platforms from threats
and outages, providing stronger security than most organisations can implement for their
own data centres. They are more reliable because distributed cloud platforms involve
multiple servers and sites around the world for greater reliability and faster disaster recovery
(What Is a Cloud Platform?, n.d.).
Wha

In [12]:
def split_text(text, max_lenght=500):
    sentences = text.split('. ')
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_lenght:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [13]:
chunks = split_text(text, max_lenght=500)
chunks

['What are the benefits of using a cloud platform?\nSome benefits of using a cloud platform instead of self-hosting applications include greater\nelasticity. This means the application can easily scale up and down. When self-hosting\nscaling out is possible by increasing the servers, this means it’s not as easy as when using a\ncloud platform. The initial costs of using a cloud platform are also less, because there is no\nneed to invest in expensive servers and hardware.',
 'It also speeds up deployment, by\nenabling deployment anywhere in the world in a matter of minutes. It is also safer and more\nreliable. Cloud providers invest in security technologies to defend their platforms from threats\nand outages, providing stronger security than most organisations can implement for their\nown data centres.',
 'They are more reliable because distributed cloud platforms involve\nmultiple servers and sites around the world for greater reliability and faster disaster recovery\n(What Is a Cloud 

## Translation
Next I will try to translate some text, so I can eventually translate the summary of the uploaded PDF.

In [14]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

For the translation I will use the MarianMTModel, from the Hugging Face transformers library. A MarianTokenizer will be used to prepare and convert the input text so it can be used by the model. On the huggingface website I found the Helsinki-NLP/opus-mt-en-es model, which is a model that translates text from English to Spanish. After translating the text, the result will be decoded so it is readable.

In [15]:
from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-en-es'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_text(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    translated_tokens = model.generate(**inputs)
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

In [16]:
text = "Hello, how are you?"
translated_text = translate_text(text)
translated_text

'Hola, ¿cómo estás?'

Translating this simple text shows the translation works.

In [17]:
translated_chunks = [translate_text(chunk) for chunk in chunks]
full_translation = " ".join(translated_chunks)
full_translation

'¿Cuáles son los beneficios de usar una plataforma en la nube? Algunos beneficios de usar una plataforma en la nube en lugar de aplicaciones de auto-alojamiento incluyen una mayor elasticidad. Esto significa que la aplicación puede escalar y bajar fácilmente. Cuando el auto-alojamiento es posible al aumentar los servidores, esto significa que no es tan fácil como cuando se utiliza una plataforma en la nube. Los costos iniciales de usar una plataforma en la nube también son menores, porque no hay necesidad de invertir en servidores y hardware caros. También acelera el despliegue, al permitir el despliegue en cualquier parte del mundo en cuestión de minutos. También es más seguro y fiable. Los proveedores de Cloud invierten en tecnologías de seguridad para defender sus plataformas de amenazas y interrupciones, proporcionando una seguridad más fuerte de la que la mayoría de las organizaciones pueden implementar para sus propios centros de datos. Son más fiables porque las plataformas de n

This shows the text that is loaded from the PDF is translated from English to Spanish.

## Summary
Next I will try the summary part. For this I will use the pipeline from the hugging face transformers library, this simplifies the process of using a pre-trained model. The summarisation pipeline will be used for summarising the text. The pretrained model that will be used is facebook/bart-large-cnn, I found this model on the hugging face website. Again, a tokenizer will be used for converting the text into tokens.

In [18]:
from transformers import pipeline

summariser = pipeline("summarization", model="facebook/bart-large-cnn", tokenizer="facebook/bart-large-cnn")

Device set to use mps:0


In [22]:
def summarise_text(text, max_length=512, min_length=100):
    summarised_text = []

    chunks = split_text(text, max_lenght=max_length)
    for chunk in chunks:
        summary = summariser(chunk, max_length, min_length, do_sample=False)
        summarised_text.append(summary[0]['summary_text'])
    return " ".join(summarised_text)

Finally I will use the complete models and translate the summarised text.

In [24]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

summarized_translation = summarise_text(full_translation)
print("Summarised Translation:")
print(summarized_translation)

Ignoring args : (512, 100)
Ignoring args : (512, 100)
Your max_length is set to 142, but your input_length is only 104. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=52)
Ignoring args : (512, 100)
Ignoring args : (512, 100)
Your max_length is set to 142, but your input_length is only 140. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=70)
Ignoring args : (512, 100)
Ignoring args : (512, 100)
Ignoring args : (512, 100)
Ignoring args : (512, 100)
Your max_length is set to 142, but your input_length is only 83. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)


Summarised Translation:
Algunos beneficios de usar una plataforma en la nube incluyen una mayor elasticidad. Esto significa que la aplicación puede escalar y bajar fácilmente. Cuando el auto-alojamiento es posible al aumentar los servidores, esto no es tan fáeasy. Los costos iniciales de usar una plataforma en la nube son menores. No hay necesidad of invertir en servidores y hardware caros. También acelera el despliegue. Los proveedores de Cloud invierten en tecnologías de seguridad para defender sus plataformas de amenazas y interrupciones. Son más fiables porque implican múltiples servidores y sitios alrededor del mundo. Los tres proveedores de nube más populares son Amazon Web Services (AWS), Microsoft Azure y Google Cloud Platform (GCP) Hay varios componentes necesarios de la aplicación Concert Meetup que necesitan ser implementados. El backend de la aplicación Concert Meetup se basa en una arquitectura de microservicios. Cada microservicio también tiene su propia base of datos, pa

Next I will print the summarised text to a format that is more readable and save the text to a .txt file.

In [None]:
import textwrap

def print_wrapped_text(text, width=80):
    wrapped_text = textwrap.fill(text, width=width)
    print(wrapped_text)

print_wrapped_text(full_translation)

¿Cuáles son los beneficios de usar una plataforma en la nube? Algunos beneficios
de usar una plataforma en la nube en lugar de aplicaciones de auto-alojamiento
incluyen una mayor elasticidad. Esto significa que la aplicación puede escalar y
bajar fácilmente. Cuando el auto-alojamiento es posible al aumentar los
servidores, esto significa que no es tan fácil como cuando se utiliza una
plataforma en la nube. Los costos iniciales de usar una plataforma en la nube
también son menores, porque no hay necesidad de invertir en servidores y
hardware caros. También acelera el despliegue, al permitir el despliegue en
cualquier parte del mundo en cuestión de minutos. También es más seguro y
fiable. Los proveedores de Cloud invierten en tecnologías de seguridad para
defender sus plataformas de amenazas y interrupciones, proporcionando una
seguridad más fuerte de la que la mayoría de las organizaciones pueden
implementar para sus propios centros de datos. Son más fiables porque las
plataformas de nu

In [26]:
with open("translated_summary.txt", "w", encoding="utf-8") as file:
    file.write(full_translation)

print("Translation saved to translated_summary.txt")

Translation saved to translated_summary.txt


### Streamlit
I have made a streamlit application where the entire process is done, a user can upload a PDF file, the file is then summarised and translated from English into Spanish.