## How the programme works

**Input**

The programme needs two paths:
1. A path to a folder that contains pdf-files
2. A path to a .txt file that contains a question on each new line

**What it then does**

It will feed the chatbot your pdfs seperately and ask each question.

**Output**

The programme will save files to the folder where you have your pdf's also stored.
1. One .csv file, which can be opened as a spreadsheet, containing all outputs.
2. For each pdf file, it will save a Word document with the output.

This notebook was made by https://github.com/caspersimon/Ask_many_questions_to_a_pdf. Feel free to use it and share it with whoever might find this useful.

****

### New in v1.1

**Added**
* Added functionality to save the output as word-documents. Each pdf file will get its own word document.
* Added the option to change the standard prompt to be sent before asking a question.

**Fixed**
* Very long texts are now truncated, so the LLM can handle them.

**Other changes**
* Progress is now indicated with a progress bar.
* The programme now waits for a response longer before timing out (120 seconds vs. 45 seconds in v1.0). This can be changed in the optional settings.
* .txt files are no longer saved on your computer, but instead are saved as a dictionary.


### More information

**step-by-step, this is what the code does:**
1. First, it itentifies the filenames and paths of pdf-files in a designated folder: `folderpath`.
2. It will convert each pdf file into a text that the computer can read. These texts are stored in memory as a dictionary.
3. It identifies a list of questions from a designated .txt file: `questionspath`
    * You should create this file manually. Each line should contain a new question. You can see the example of how to format this.
4. For each combination of question and file, it will send a prompt to the chatbot that starts with "Hey! I'm trying to understand this paper. Can you help me with a question about this text?", followed by the text and question.
    * you can change this prompt by changing the prompt in the second-to-last cell of code.
5. It stores the answers in a .csv file (and a word-file) in the same folder as your pdf's.

****
## Setup

### Import/install dependencies

In [1]:
import os #to index files in folders
import PyPDF2 #to convert pdf to txt
import cohere #to talk to the LLM
import csv #to export to csv
import time #to make sure the rate limit is not exceeded
import datetime #to name the csv file
from tqdm import tqdm #for the progress bar

#to work with MS word:
from docx import Document 
from docx.shared import Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.shared import RGBColor
from docx.shared import Inches

### Defining all the functions

In [8]:
# creates dictionary of filepaths for .pdf files
def create_file_dictionary(pdffolder):
    file_dict = {}
    files = os.listdir(pdffolder)
    for file_name in files:
        if file_name.endswith('.pdf'):
            file_dict[file_name] = {"path": os.path.join(pdffolder, file_name), "storedastxt": 0, "text": ""}
    return file_dict

# takes in the filepaths and returns 
def maketxt(file_dictionary):
    for key, value in file_dictionary.items():
        if value["storedastxt"] == 0:
            pdf_path = value["path"]
            text = pdf_to_text(pdf_path)
            if text:
                value["text"] = text
                value['storedastxt'] = 1
                print(f"Stored {key} as text")
            else:
                print(f"An error occurred storing {key}")
        else:
            print(f"{key} is already stored as text")

# takes one filepath and returns a text of that filepath
def pdf_to_text(pdf_path):
    try:
        with open(pdf_path, 'rb') as pdf_file:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            text = ''
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text()
        return text.replace("\n", " ").strip("- ")
    except Exception as e:
        print(f"Error reading PDF file {pdf_path}: {e}")
        return None

# takes one filepath and creates a dictionary of questions, where each line in the
# txt file is a new dictionary
def create_dictionary_of_questions(questionsfile, delimiter='\n'):
    questions_dict = {}
    counter = -1
    with open(questionsfile, 'r') as file:
        for line in file:
            questionstring = line.strip(delimiter)
            counter += 1
            questionnumber = counter
            questions_dict[questionnumber] = {"question": questionstring, "status": "active"}
    return questions_dict

# combines all the functions and actually writes the csv
def write_csv(file_dictionary, questions_list, csv_file_path, api_key, prompt, timeouttime, languagemodel):
    answers_dict = {}
    countdown = len(questions_list) * len(file_dictionary)
    let_totalcount = len(questions_list) * len(file_dictionary)
    print(f"Status: about to start {countdown} requests to the server")
    max_iterations_per_minute = 9
    iterations = 0
    start_time = time.time()
    progress_bar = tqdm(total=let_totalcount, position=0, leave=True)

    for file_name, file_info in file_dictionary.items():
        file_content = file_info["text"]
        answers_dict[file_name] = {}
        print(f"\n\nOperating on file: {file_name}")

        for question in questions_list:
          #  print(f"Asking question: {question}")
            try:
                co = cohere.Client(api_key=api_key, timeout=timeouttime)
                chat_history = [
                    {"role": "USER", "text": f"### Context \n\n{prompt} \n\n### Text \n{file_content}."},
                    {"role": "CHATBOT", "text": "Yes, I will help you. What is your question?"},
                ]
                message = question
                response = co.chat(
                    message=message, 
                    chat_history=chat_history,
                    model = languagemodel,
                    prompt_truncation = "AUTO"
                )
                answer = response.text
                answers_dict[file_name][question] = answer
                countdown -= 1
                progress_bar.update(1)
                iterations += 1
            #    print(f"-->Status: total number of requests remaining: {countdown}")

                elapsed_time = time.time() - start_time
                if elapsed_time >= 60:
                    start_time = time.time()
                    iterations = 0

            #    time_remaining = (countdown / max_iterations_per_minute) * 60
             #   time_remaining_str = str(datetime.timedelta(seconds=int(time_remaining)))
             #   print(f"-->Status: estimated time remaining: {time_remaining_str}")

                if iterations >= max_iterations_per_minute:
                    time.sleep(60 - elapsed_time)
                    print("Paused to avoid exceeding rate limit")
                    start_time = time.time()
                    iterations = 0

            except Exception as e:
                print(f"\nAn error occurred: {e} \n-->Question: {question}, File: {file_name}\n")
                progress_bar.update(1)

    print("\nSlayed. \n\nYes. You can now see the beautiful output. xxxx")
    
    with open(csv_file_path, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['File Name'] + questions_list)
        writer.writeheader()

        for file_name, answers in answers_dict.items():
            row_data = {'File Name': file_name}

            for question, answer in answers.items():
                row_data[question] = answer

            writer.writerow(row_data)
    return answers_dict


def run(pdf_folder, questions_file, api_key, prompt, makedoc, timeouttime, languagemodel):
    questions_dict = create_dictionary_of_questions(questions_file)
    file_dict = create_file_dictionary(pdf_folder)
    maketxt(file_dict)

    questions_list = [v["question"] for v in questions_dict.values() if v["status"] == "active"]

    answers_dict = write_csv(file_dict, questions_list, f"{pdf_folder}/output.csv", api_key, prompt, timeouttime, languagemodel)
    if makedoc == True:
        create_word_files(answers_dict)
        

# --------------------------------
# Functions for creating a docx file



# takes a dictionary of answers and creates doc files for each pdf file
def create_word_files(main_dict):
    for filename, q_and_a in main_dict.items():
        doc = Document()
        try:
            for question, answer in q_and_a.items():
                # Add question as header
                heading = doc.add_heading(level=1)
                run = heading.add_run(question.title())  # Convert to title case
                run.bold = True
                run.font.name = 'Times New Roman'
                run.font.size = Pt(12)
                run.font.color.rgb = RGBColor(0, 0, 0)  # Set color to black
                paragraph_format = heading.paragraph_format
                paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
                paragraph_format.line_spacing = 1.5  # Set line spacing to 1.5
    
                # Split answer into paragraphs at each newline character
                paragraphs = answer.split('\n')
    
                for paragraph in paragraphs:
                    if paragraph.strip():  # Check if paragraph is not empty
                        # Check if paragraph should be formatted as a list
                        if paragraph.startswith('- '):  # Bullet list
                            para = doc.add_paragraph()
                            para.style = 'List Bullet'
                            run = add_bold_text(para, paragraph[2:])  # Remove '- ' from start of paragraph
                            run.font.name = 'Times New Roman'
                            run.font.size = Pt(12)
                            para.paragraph_format.line_spacing = 1.5  # Set line spacing to 1.5
                        elif paragraph.startswith('1. '):  # Numbered list
                            para = doc.add_paragraph()
                            para.style = 'List Number'
                            run = add_bold_text(para, paragraph[3:])  # Remove '1. ' from start of paragraph
                            run.font.name = 'Times New Roman'
                            run.font.size = Pt(12)
                            para.paragraph_format.line_spacing = 1.5  # Set line spacing to 1.5
                        elif paragraph.startswith('### '):  # Header 2
                            heading = doc.add_heading(level=2)
                            run = heading.add_run(paragraph[4:].title())  # Convert to title case and remove '### ' from start of paragraph
                            run.font.name = 'Times New Roman'
                            run.font.size = Pt(12)
                            run.font.color.rgb = RGBColor(0, 0, 0)  # Set color to black
                            paragraph_format = heading.paragraph_format
                            paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT
                            paragraph_format.line_spacing = 1.5  # Set line spacing to 1.5
                        else:  # Normal paragraph
                            para = doc.add_paragraph()
                            para.paragraph_format.first_line_indent = Inches(0.5)  # Indent first line by 0.5 inch
                            run = add_bold_text(para, paragraph)
                            run.font.name = 'Times New Roman'
                            run.font.size = Pt(12)
                            para.paragraph_format.line_spacing = 1.5  # Set line spacing to 1.5
    
                # Save the document
                doc.save(f"{pdf_folder}/{filename}_summary.docx")
        except Exception as e:
            print(f"\nAn error occurred with saving to word: {e} \n-->Document: {doc}")

# checks if text is supposed to be **bold**
def add_bold_text(para, text):
    if '**' in text:  # Check if text contains bold text
        parts = text.split('**')  # Split text into parts at each '**'
        for i, part in enumerate(parts):
            run = para.add_run(part)
            if i % 2 != 0:  # If part is inside '**', make it bold
                run.bold = True
    else:
        run = para.add_run(text)
    return run

***
## Getting started: create files and define paths
### you should create two files:
1. A .txt file with questions you want to ask about the pdf's. Each line will be used as a seperate question.
2. A folder where you store all of the pdf's you want to ask questions about


You should define them as such:
* The path to your folder with pdf's, e.g., `/Users/julius/CCS/awesome_project/Files/pdfs`
* The path to your txt with questions, e.g.,  `/Users/julius/CCS/awesome_project/Files/questionsfolder/questions.txt`

#### define using the cell below
* change the cell below accordingly

In [11]:
api_key = "qaJbcMpbxNOFmcnJfFufekFZYwpHJbLTaujhD3AP" # add your own api-key here. 
pdf_folder = "/Users/juliuseikmans/Downloads/CCS2_w4"
questions_file = "/Users/juliuseikmans/Downloads/CCS2_w4/questions.txt"


# ----------------- optional settings ------------------------------------------------------

# makes a word-doc for each pdf. Can be changed to False.
makedoc = True 

# You can change the prompt that comes before each question.
prompt = "Hey! I'm trying to understand this paper. Can you help me with a question about this text?"

# Change how long the programme waits before it times out. 
# Shorter times make the programme faster, but more questions will be skipped
timeouttime = 120

# Change the language model that is used. See https://docs.cohere.com/docs/models for more info.
languagemodel = "command-r-plus"

#Change how the csv file is saved. If you don't change anything, the file will be saved as the current time.
t = time.localtime()
current_time = time.strftime("%H:%M:%S", t)
csv_file_path = f"{pdf_folder}/{current_time}.csv"

## Doing the the actual programme
* run the cell below to actually do the programme. 

In [12]:
run(pdf_folder, questions_file, api_key, prompt, makedoc, timeouttime, languagemodel)

  1%|▏         | 10/742 [01:34<1:55:52,  9.50s/it]


Stored 1-s2.0-S0738399120306376-main.pdf as text
Stored 1-s2.0-S0167811619300102-main.pdf as text
Status: about to start 20 requests to the server


  0%|          | 0/20 [00:00<?, ?it/s]



Operating on file: 1-s2.0-S0738399120306376-main.pdf


 50%|█████     | 10/20 [01:28<01:16,  7.66s/it]



Operating on file: 1-s2.0-S0167811619300102-main.pdf


100%|██████████| 20/20 [03:37<00:00, 10.87s/it]



Slayed. 

Yes. You can now see the beautiful output. xxxx


In [75]:
import re

def replace_long_words(text):
    return re.sub(r'\b[\w\.\,\!\?\'\"\\/]{20,}\b', '', text)
    
def replace_non_printable_characters(text):
    return re.sub(r'[\x00-\x1F\x7F]{1,}', '', text)

# takes in the filepaths and returns 
def maketxt(file_dictionary):
    for key, value in file_dictionary.items():
        if value["storedastxt"] == 0:
            pdf_path = value["path"]
            text = pdf_to_text(pdf_path)
            if text:
                value["text"] = text
                value['storedastxt'] = 1
                print(f"Stored {key} as text")
            else:
                print(f"An error occurred storing {key}")
        else:
            print(f"{key} is already stored as text")

# takes one filepath and returns a text of that filepath
def pdf_to_text(pdf_path):
    try:
        with open(pdf_path, 'rb') as pdf_file:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            text = ''
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text()
        return replace_non_printable_characters(replace_long_words(text.replace("\n", " ").replace("- ", "").replace("-", "").replace("/", "").strip("\\")))
    except Exception as e:
        print(f"Error reading PDF file {pdf_path}: {e}")
        return None

def run_dev(pdf_folder, questions_file, api_key, prompt, makedoc, timeouttime, languagemodel):
    questions_dict = create_dictionary_of_questions(questions_file)
    file_dict = create_file_dictionary(pdf_folder)
    maketxt(file_dict)
    return file_dict

In [76]:
filestest = run_dev(pdf_folder, questions_file, api_key, prompt, makedoc, timeouttime, languagemodel)

Stored Strategy_Lecture3_Week3.pdf as text
Stored chapter3.pdf as text
Stored chapter2.pdf as text
Stored Strategy_Lecture2_Week2.pdf as text
Stored chapter1.pdf as text
Stored chapter5.pdf as text
Stored chapter4.pdf as text
Stored chapter6.pdf as text
Stored Strategy_Lecture1_Week1.pdf as text


In [47]:
print(replace_long_words("hello this is a test aaaaaaaaaaaaaaaaaaaaaaa"))

hello this is a test  


In [77]:
print(filestest)

