# Imports
Please note that some imports are somewhat hacky to get to work and require some tinkering.

To use pytesseract: <br>
conda install -c conda-forge pytesseract <br>
conda install -c conda-forge tesseract <br>

To use pdf2image: <br>
conda install -c conda-forge pdf2image <br>
conda install -c conda-forge poppler <br>

To use reportlab: <br>
conda install reportlab <br>

In [1]:
from transformers import pipeline # Handles summarization
import requests # Handles translation using the DeepL API
from typing import Optional, Tuple

import pytesseract
from pdf2image import convert_from_path

import PIL
import numpy as np
import cv2
import matplotlib.pyplot as plt
import re

import os

from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Image
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_CENTER, TA_JUSTIFY
from datetime import datetime
from time import time
from os.path import isfile

# Global variables and constants
The DeepL API key is used for the translation API. If the API key becomes invalid for some reason, you can generate your own API key following the instructions at https://www.deepl.com/docs-api.

In [2]:
# At this time of writing this (20.05.2023), the API had 450k characters still unused. 
#Please keep this in mind and use the provided key responsibly.
DEEPL_API_KEY = '898523e2-0911-71ea-8d45-3e60991d2130:fx'
DEEPL_BASE_URL = 'https://api-free.deepl.com'
path = 'images/' #where the images will be saved

<b>Here, the path to tesseract and tessdata need to be specified for the program to work. </b>

In [3]:
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\annut\miniconda3\envs\Tehisintellekt\Library\bin\tesseract.exe' #r'path to tesseract.exe'
os.environ['TESSDATA_PREFIX'] = r'C:\Users\annut\miniconda3\envs\Tehisintellekt\share\tessdata/' #r'path to /tessdata/'

# Slides to text and images

In [4]:
#source: https://stackoverflow.com/questions/57249273/how-to-detect-paragraphs-in-a-text-document-image-for-a-non-consistent-text-stru
def getParagraphs(image, iterat):
    '''This function return the locations of paragraphs on a slide found by using dilation'''
    
    if iterat == 1:
        iterations = 7
    elif iterat == 2:
        iterations = 10
    else:
        iterations = 12
    
    paragraphs = []
    image_np = np.array(image)
    
    gray = cv2.cvtColor(image_np, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (7,7), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
    dilate = cv2.dilate(thresh, kernel, iterations=iterations)

    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]

    for c in cnts:
        x,y,w,h = cv2.boundingRect(c)
        rect = [x, y, w, h]
        paragraphs.append(rect)
    
    return paragraphs

In [5]:
def getContent(filename, lang, skipped, iterat):
    '''This function extracts text and pictures from slides based on the detected paragraphs and returns them 
    (text as a string and pictures as an array of the filenames). The prorgam uses regex to only extract proper 
    text and not for example any functions.'''
    
    text = ""
    pictures = []
    picture_count = 0
    pattern = r"[a-zöüõäA-ZÖÄÜÕ][a-zöüõäA-ZÖÄÜÕ]+[,\/-:;?!]*"
    
    images = convert_from_path(filename)
    
    #Skanning slides one by one.
    for j in range(len(images)):
        if j+1 not in skipped:
            image = images[j]
            paragraphs = getParagraphs(image, iterat)

            for i in range(len(paragraphs)-1, -1, -1):

                if i == 0 and page_numbers:
                        break

                p = paragraphs[i]
                x, y, w, h = p[0], p[1], p[2], p[3]
                segment = image.crop((x, y, x + w, y + h))

                #If the cropped image contains a lot of colors, it is most likely a picture, not text
                unique_colors = set(segment.getdata())
                if len(unique_colors) > 15000:
                    if len(paragraphs) == 1 and skip_images_without_text:
                        break
                    else:
                        picturefilename = "Picture_" + str(picture_count) + ".png"
                        segment.save(path + picturefilename)
                        text += "(Vaata: Pilt " + str(picture_count) + ") "
                        picture_count +=1
                        pictures.append(picturefilename)

                else: 
                    extracted_text = pytesseract.image_to_string(segment, lang=lang)
                    sentence = extracted_text.strip().replace('\n', " ")
                    raw_text = re.findall(pattern, sentence)

                    if len(raw_text) > 0:
                        text += ' '.join(raw_text).replace(";", "").replace(":", "").capitalize() + ". "                        
    
    return text, pictures

In [6]:
def getSkippedSlides(skip):
    '''This function turns the input for skipped slides to a usable form and returns the slide numbers as an array.'''
    skipped = []
    if skip == "":
        return skipped
    slides = skip.replace(" ", "").split(",")
    
    for slide in slides:
        if "-" in slide:
            no = slide.split("-")
            for i in range(int(no[0]), int(no[1])+1):
                skipped.append(i)
        else:
            skipped.append(int(slide))
    return skipped

# A short description of the summarization logic
Originally, the plan was to simply summarize the provided text natively in the language it was provided. There are plenty of examples of this available, such as the open-source Reddit bot "autotldr", which has a similar function.

Problems rose when it was determined that lots of slideshows have only bullet points, which isn't compatible with the style other similar projects use. Other projects use a pattern, where they extract the important sentences from the provided text without editing it. This falls apart with ours.

To bypass this problem, the `SummarizerPipeline` from the huggingface `transformers` library is used. By translating the source text to English and then summarizing it, we can bypass many of the issues that arise from the traditional summarization methods. This also makes it trivial to add additional languages, in fact by default all DeepL supported languages should theoretically be able to be summarized properly. Keep in mind that this is untested functionality and no guarantees are provided.

In [7]:
def translate_text(text: str, target_lang: str ='EN-GB', source_lang: Optional[str] = None) -> Tuple[str, str]:
    """This function returns a tuple of (source_lang, translated_text)."""
    # Build the URL for the translation service
    url = f"{DEEPL_BASE_URL}/v2/translate"
    # Build the payload
    payload = { 'text': [text], 'target_lang': target_lang }
    # In case a manual source language is set, we should pass it along. Otherwise, DeepL will handle it for us
    if source_lang is not None:
        payload[source_lang] = source_lang
    # Headers
    headers = { 'Authorization': f"DeepL-Auth-Key {DEEPL_API_KEY}" }
    # Send the request
    response = requests.post(url, json=payload, headers=headers)
    json_response = response.json()
    # See the DeepL docs for the exact JSON format
    return json_response['translations'][0]['detected_source_language'], json_response['translations'][0]['text']

In [8]:
def summarize_text(text: str, language: Optional[str] = None, test=False) -> str:
    """This functions returns a summary of the provided text. If the source language is known, pass it in the `language`
        argument for a more accurate translation. For testing, please set test to True and pass in English text only."""
    # Get the translated text with its corresponding language
    source_lang = 'EN'
    translated_text = text
    if not test:
        source_lang, translated_text = translate_text(text, source_lang=language)
    # Summarize the text
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    summarized_text_en = summarizer(translated_text, max_length=1024, min_length=500, do_sample=False, truncation=True)[0]['summary_text']
    # Get back the original language
    returnable_text = summarized_text_en
    if not test:
        _, returnable_text = translate_text(summarized_text_en, target_lang=source_lang, source_lang='EN')
    return returnable_text


# Generating the final PDF

In [9]:
def generateTimeStamp() -> str:
    date_time = datetime.fromtimestamp(time())
    timestamp = date_time.strftime("%d-%m-%Y %H:%M")
    return timestamp


def getNewFileWidth(filename:str, definedheight:int) -> int:
    """This method calculates and returns a new width for an image, given the image file name and the desired heigth, while preserving aspect ratio."""    
    im = PIL.Image.open(filename)
    w, h = im.size
    im.close()

    ratio = w / h
    newwidth = int(definedheight * ratio)
    
    return newwidth


def createTitleParagraph(content:str, styles) -> Paragraph:
    paragraph = Paragraph(text=content, style=styles["Title"])
    return paragraph


def createTextParagraph(content:str, styles) -> Paragraph:
    paragraph = Paragraph(text=content, style=styles["NormalJustified"])
    return paragraph


def createTimestampParagraph(content:str, styles, lang:str) -> Paragraph:

    if lang == "eng":
        content = f"Generated {content}"
    elif lang == "est":
        content = f"Genereeritud {content}"
    else:
        #default to eng
        content = f"Generated {content}"

    paragraph = Paragraph(text=content, style=styles["Heading5"])
    return paragraph


def createImageReferenceParagraph(content:str, styles, lang:str) -> Paragraph:

    content = content.strip()
    content = content[:len(content) - 4]
    contents = content.split("_")

    number = contents[len(contents) - 1]
    
    if lang == "eng":
        content = f"Image {number}"
    elif lang == "est":
        content = f"Pilt {number}"
    else:
        #default to eng
        content = f"Image {number}"

    paragraph = Paragraph(text=content, style=styles["Reference"])

    return paragraph


def createImageParagraph(filename:str, desiredheight:int) -> Image:

    w = getNewFileWidth(filename=filename, definedheight=desiredheight)

    image = Image(filename=filename, 
                  height=desiredheight,
                  width=w,
                  hAlign="CENTER",
                  lazy=2)

    return image


def createPDF(filename:str, lang:str, header:str, text:str, imagefiles:list = None) -> bool:
    """Creates a PDF file with the provided inputs. List of images can be an empty list. Returns true if PDF file was created sucessfully."""
    
    timestamp = generateTimeStamp()
    
    #make sure lang variable is good
    lang = str(lang).lower()
    if lang == "eng" or lang == "en":
        lang = "eng"
    if lang == "est" or lang == "ee" or lang == "et":
        lang = "est"

    #create document
    document = SimpleDocTemplate(
        filename=filename,
        pagesize=A4,
        rightMargin=50, leftMargin=50,
        topMargin=50, bottomMargin=50,
    )

    #get some default styles
    styles = getSampleStyleSheet()

    #define some of my own styles
    styles.add(ParagraphStyle(name='Reference',
                                parent=styles['Normal'],
                                fontSize=12,
                                spaceBefore=2,
                                spaceAfter=12,
                                alignment=TA_CENTER))
    
    styles.add(ParagraphStyle(name='NormalJustified',
                                parent=styles['Normal'],
                                fontSize=12,
                                spaceAfter=50,
                                alignment=TA_JUSTIFY))
    
    flowables = []
    
    #add the title
    flowables.append(createTitleParagraph(content=header, styles=styles))

    #add time created
    flowables.append(createTimestampParagraph(content=timestamp, styles=styles, lang=lang))

    #add summary
    flowables.append(createTextParagraph(content=text, styles=styles))

    if not imagefiles is None or len(imagefiles) > 0:
        #add all image files with their references, if they exist
        for imagefilename in imagefiles:
            if isfile(path + imagefilename):
                flowables.append(createImageParagraph(filename=path+imagefilename, desiredheight=150))
                flowables.append(createImageReferenceParagraph(content=imagefilename, styles=styles, lang=lang))

        #build the document
        document.build(flowables=flowables)
    
    return True

## Run program

Here the code can be tested. Before running the program, make sure You have set the variables in the cell after "Global variables and constants" to match your system.

<b>filename</b> - location + name of the pdf file<br>
<b>newfile</b> - the generated pdf name
<b>language</b> - the language in which the pdf is in (est for Estonian, eng for English and so on: https://www.labnol.org/code/19899-google-translate-languages)<br>
<b>page_numbers</b> - whether the pdf pages have page numbers at the end (True) or not (False)<br>
<b>skip_images_without_text</b> - whether the program should skip images that are on pages without any text (True) or not (False)<br>
<b>skip_slides</b> - slide numbers to skip (example: "1-3, 5", slides 1, 2, 3 and 5 will be skipped)<br>
<b>spacing</b> - the estimation of the spacing between rows on slide (1 = little, 2 = medium, 3 = a lot)

In [10]:
filename = ""
newfile = ""
language = ""
page_numbers = True
skip_images_without_text = True
skip_slides = ""
header = ""
spacing = 0

In [11]:
#skipped = getSkippedSlides(skip_slides)

#Convert slides to text and pictures
#text, pictures = getContent(filename, language, skipped, spacing)

#Generate summary
#summary = summarize_text(text, test=True)

#Generate pdf
#createPDF(filename=newfile, lang=language, header=header, text=text, imagefiles=pictures)

#print("The ", header, " notes were saved to file ", newfile)

## Example 1
The slides used in this example are from the course Universum kõigile (LTTO.00.019). They are made by Laurits Leedjärv and are optimal for our program, because they contain comprehensive sentences. We are not going to skip any slides since we want all the pictures from the slides to be in our notes. These slides do not have page numbers. We have also shortened the slideshow down to 14 slides. 

In [12]:
filename = 'example1.pdf'
newfile = 'notes1.pdf'
language = 'est'
page_numbers = False
skip_images_without_text = False
skip_slides = ""
header = "Astronoomia ajalugu"
spacing = 1

skipped = getSkippedSlides(skip_slides)

#Convert slides to text and pictures
text, pictures = getContent(filename, language, skipped, spacing)

#Generate summary
summary = summarize_text(text, test=False)

#Generate pdf
createPDF(filename=newfile, lang=language, header=header, text=summary, imagefiles=pictures)

print("The", header, "notes were saved to file", newfile)

The  Astronoomia ajalugu  notes were saved to file  notes1.pdf


## Example 2
The slides used in this example are from the course Tehisintellekt (LTAT.01.003). They are made by Mark Fišel and are pretty ok for our program, but since they mostly contain bullet points the generated notes are not as comprehensive as they could be.

In [13]:
filename = 'example2.pdf'
newfile = 'notes2.pdf'
language = 'est'
page_numbers = True
skip_images_without_text = True
skip_slides = "1, 71"
header = "Mängud"
spacing = 2

skipped = getSkippedSlides(skip_slides)

#Convert slides to text and pictures
text, pictures = getContent(filename, language, skipped, spacing)

#Generate summary
summary = summarize_text(text, test=False)

#Generate pdf
createPDF(filename=newfile, lang=language, header=header, text=summary, imagefiles=pictures)

print("The", header, "notes were saved to file", newfile)

Your max_length is set to 1024, but you input_length is only 483. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=241)


The  Mängud  notes were saved to file  notes2.pdf


## Example 3
The slides used in this example are from the course Tehisintellekt (LTAT.01.003). They are made by Mark Fišel and are not optimal for our program since the slides contain very little text, but we still wanted to show how our program handels slides like this.

In [15]:
filename = 'example3.pdf'
newfile = 'notes3.pdf'
language = 'est'
page_numbers = True
skip_images_without_text = True
skip_slides = ""
header = "Masinõpe ja andmestikud"
spacing = 3

skipped = getSkippedSlides(skip_slides)

#Convert slides to text and pictures
text, pictures = getContent(filename, language, skipped, spacing)

#Generate summary
summary = summarize_text(text, test=False)

#Generate pdf
createPDF(filename=newfile, lang=language, header=header, text=summary, imagefiles=pictures)

print("The", header, "notes were saved to file", newfile)

Your max_length is set to 1024, but you input_length is only 344. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=172)


The Masinõpe ja andmestikud notes were saved to file notes3.pdf
