## how the programme works

This programme is made to answer a collection of questions about a collection of pdf files. It uses a Large Language Model similar to ChatGPT, 
but this one is from the company Cohere.

Cohere is a company that serves mostly other businesses for, for example, chatbots that can interact with customers. They made an LLM that can work with an enormous amount of context, and they offer an API to interact with the model. The only limit (for non-business users) is 10 requests per minute, but there is no word limit. This allows us to insert all the text from a pdf article into a prompt and receive an answer.

**step-by-step, this is what the code does:**
1. First, it itentifies the filenames and paths of pdf-files in a designated folder: `folderpath`.
2. It will convert each pdf file into a .txt file that the computer can read.
    * It will store these .txt files in the same folder as where your pdf's live. You can check these files if there are any issues.
3. It identifies a list of questions from a designated .txt file: `questionspath`
    * You should create this file manually. Each line should contain a new question. You can see the example of how to format this.
4. For each combination of question and file, it will send a prompt to the chatbot that starts with "Hey! I'm trying to understand this paper. Can you help me with a question about this text?", followed by the text and question.
    * you can change this prompt by changing the `write_csv` function listed under step 3 below.
5. It stores the answers in a .csv file in the same folder as your pdf's. You can now easily open it in e.g., Excel, Numbers, or Google Sheets

**This notebook was made by https://github.com/caspersimon/Ask_many_questions_to_a_pdf. Feel free to use it and share it with whoever might find this useful.**

## Inital, one-time setup

### Step 1: Retrieve api key

To get an api-key, go to the cohere.com website and sign up for an account. Now, if you go to dashboard --> API keys, you get a free api key. Add that to the cell below and run the cell. After this, you will never have to worry about this api-key again.

In [None]:
api_key = "xxxxxxxxxxxxxxxxxxxxxxx" # add your own api-key here.

***
### Step 2: Import/install dependencies

Make sure the things below are imported. If they are not (yet) installed, you can use e.g., `! pip install cohere` to install the packages

In [None]:
import os
import cohere
import PyPDF2
import csv
import time
import datetime

***
### Step 3: defining all the functions
You can just run the entire block of code below and then never think about it ever again. Unless there is an unexpected error, of course.

In [None]:
# first defining all the functions
def create_file_dictionary(pdffolder):
    # Initialize an empty dictionary
    file_dict = {}
    
    # List all files in the folder
    files = os.listdir(pdffolder)
    
    # Iterate over each file
    for file_name in files:
        # Check if the file has a .pdf extension
        if file_name.endswith('.pdf'):
            # Add the file name to the dictionary with its full path as the value
            file_dict[file_name] = {"path": os.path.join(pdffolder, file_name), "storedastxt": 0} 
    
    return file_dict

def pdf_to_text(pdf_path, txt_path):
    try:
        with open(pdf_path, 'rb') as pdf_file:
            # Create a PDF reader object
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            print(f"I have opened {pdf_path}")
            # Initialize an empty string to store text
            text = ''
            
            # Iterate through each page of the PDF
            for page_num in range(len(pdf_reader.pages)):
                # Get the text content of the page
                page = pdf_reader.pages[page_num]
                text += page.extract_text()
                
        # Write the extracted text to a text file
        with open(txt_path, 'w', encoding='utf-8') as name:
            name.write(text.replace("\n", " ").strip("- "))
    except Exception as e:
        print(f"Error reading PDF file {pdf_path}: {e}")
        # If there's an error reading the PDF file, skip it and continue
        pass


def maketxt(file_dictionary):
    counter = 0
    for key,value in file_dictionary.items():
        if value["storedastxt"]==0:
            pdf_path = value["path"]
            counter +=1
            txt_path = f"{folderpath}/{counter}"
            pdf_to_text(pdf_path, txt_path)
            value['storedastxt'] = 1
            value["filepath"] = txt_path
            print(f"stored {key} as txt")
        elif value["storedastxt"]==1:
            print(f"{key} is already stored as txt")
        else:
            print(f"an error occured storing {key}")

def create_dictionary_of_questions(questionsfile, delimiter='\n'):
    # Initialize an empty dictionary
    questions_dict = {}

    # Initialize counter to count the number of questions, starting at 0
    counter = -1
    
    # Open the file in read mode
    with open(questionsfile, 'r') as file:
        # Read each line in the file
        for line in file:
            # Split the line based on the delimiter
            questionstring = line.strip(delimiter)
            
            counter += 1
            questionnumber = counter
            
            # Add the key-value pair to the dictionary
            questions_dict[questionnumber] = {"question": questionstring, "status":"active"}
    
    return questions_dict


def write_csv(filepath_list, questions_list, csv_file_path):
    answers_dict = {}
    countdown = len(questions_list) * len(filepath_list)
    print(f"status: about to start {countdown} requests to the server")
    
    max_iterations_per_minute = 9
    iterations = 0
    start_time = time.time()
    
    for file_path in filepath_list:
        with open(file_path, 'r') as file:
            file_content = file.read()
        answers_dict[file_path] = {}
        print(f"\n\n Operating on file: {file_path}")
        
        for question in questions_list:
            print(f"asking question: {question}")
            # Your code to get answers for each question
            try:
                co = cohere.Client(api_key=api_key, timeout=45)  # Adjust timeout value as needed
                chat_history = [
                    {"role": "USER", "text": f"### Context \n \n Hey! I'm trying to understand this paper. Can you help me with a question about this text? \n \n ### Text \n {file_content}."},
                    {"role": "CHATBOT", "text": "Yes, I will help you. What is your question?"},
                ]
                message = question
                response = co.chat(message=message, chat_history=chat_history)
                answer = response.text

                answers_dict[file_path][question] = answer
                countdown -= 1
                iterations += 1
                print(f"-->status: total number of requests remaining: {countdown}")

                # Check if 1 minute has elapsed, and reset the iteration count
                elapsed_time = time.time() - start_time
                if elapsed_time >= 60:
                    start_time = time.time()
                    iterations = 0

                # Calculate estimated time remaining
                time_remaining = (countdown / max_iterations_per_minute) * (60)
                time_remaining_str = str(datetime.timedelta(seconds=int(time_remaining)))
                print(f"-->status: estimated time remaining: {time_remaining_str}")

                # Check if the maximum iterations per minute is reached
                if iterations >= max_iterations_per_minute:
                    # Sleep for the remaining time in the current minute
                    time.sleep(60 - elapsed_time)
                    print("paused to avoid exceeding rate limit")
                    start_time = time.time()
                    iterations = 0
                
            except Exception as e:
                print(f"**\nAn error occurred: {e} \n-->question: {question}, file: {file_path} \n**\n")
                countdown -= 1
                iterations +=1

    print("\nSlayed. \n\nYes. You can now see the beautiful output. xxxx")

    with open(csv_file_path, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['File Path'] + questions_list)
        writer.writeheader()

        for file_path, answers in answers_dict.items():
            row_data = {'File Path': file_path}

            for question, answer in answers.items():
                row_data[question] = answer

            writer.writerow(row_data)

def run():
    questions_dict = create_dictionary_of_questions(questionspath, delimiter='\n')
    readings_dict = create_file_dictionary(folderpath)
    maketxt(readings_dict)
    
    questions_list = []
    for k,v in questions_dict.items():
        if v["status"]=="active":
            questions_list.append(v["question"])
            
    filepath_list = []
    for k,v in readings_dict.items():
        if v["storedastxt"]==1:
            filepath_list.append(v["filepath"])
    
    # doing the actual programme
    write_csv(filepath_list, questions_list, csv_file_path)

***
## Every time: create files and define paths
### you should create two files:
1. A .txt file with questions you want to ask about the pdf's. Each line will be used as a seperate question.
2. A folder where you store all of the pdf's you want to ask questions about


You should define them as such:
* The path to your folder with pdf's, e.g., `/Users/julius/CCS/awesome_project/Files/pdfs`
* The path to your txt with questions, e.g.,  `/Users/julius/CCS/awesome_project/Files/questionsfolder/questions.txt`

#### define using the cell below
* change the cell below accordingly

In [None]:
folderpath = "/Users/julius/CCS/Personal_project/Files/pdfs/raw/week_2" # change this to the folder where you stored your original pdf's
questionspath = "/Users/julius/CCS/Personal_project/Files/questions/questions.txt" # change this to the path of a file with questions

#optional: change how the csv file is saved. If you don't change anything, the file will be saved as the current time.
t = time.localtime()
current_time = time.strftime("%H:%M:%S", t)
csv_file_path = f"{folderpath}/{current_time}.csv"

## Doing the the actual programme
* run the cell below to actually do the programme. 

In [None]:
run()