## Exercise A

Embeddings have a lot of uses, when combined with other APIs can do even more. One example is using embeddings with chat completion to extract information from a pdf and then create a function to ask anything about the document.

In the following exercise you will create a program that retrieves information from a pdf and answer questions about it. In order to achieve this you must:
*   Convert a pdf file to embeddings and save them in a csv file 
*   Use embeddings to search a user query in the csv file
*   Send that information to chat completions

The pdf used in this exercise will be `LETI_SISTCA_2023_24_Team2_OpenAI.pdf`.



**Start by importing the requiring dependecies and initialize the client and creating constants**

In [74]:
from openai import OpenAI
import pandas as pd  
import re 
import tiktoken 
import PyPDF2

import ast
from scipy import spatial  

client = OpenAI()


SECTIONS_TO_IGNORE = [
    "Contents",
    "List of Tables",
    "List of Figures",
    "References",
]

MAX_TOKENS = 1600
BATCH_SIZE = 1000   

# TODO #1: Create consts for GPT_MODEL and EMBEDDING_MODEL (small)
EMBEDDING_MODEL = "text-embedding-3-small"
GPT_MODEL = "gpt-3.5-turbo"

**Simple Logic to Extract the information from the pdf**

This is a simple logic to extract the necessary information from the pdf we are going to use.

In [66]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
    return text

def split_sections_from_pdf(pdf_text):
    title = []
    text = []
    ignore = True
    current_section = "I.N.I.T.I.A.L-V.A.L.U.E"
    for line in pdf_text.split('\n'):
        line = line.strip()
        
        if not line:
            continue
        if is_new_section(line):
            ignore = ignore_section(line)
            if ignore:
                continue
            title.append(line)
            if current_section != "I.N.I.T.I.A.L-V.A.L.U.E":
                text.append(current_section)
            current_section = ""
        else:
            if not ignore:
                current_section += " " + line
    if current_section:
        text.append(current_section)
        
    
    sections = [(title),(text)]
    return sections

def ignore_section(line):
    if any(section in line for section in SECTIONS_TO_IGNORE):
        return True
    return False

def is_new_section(line):
    pattern = r"\d+\.\d+(?:\.\d+)? [A-Z].*?"
    
    if line.strip().count('.') > 7:
        return False
    if re.match(pattern, line.strip()):
        return True 
    if any(section in line for section in SECTIONS_TO_IGNORE):
        return True

    return False

def clean_section(section):

    titles = section[0]
    text = section[1]
    
    for line in text:
        line = re.sub(r"\[\d+\]", "", line)
        line = re.sub(r"\[\d\d+\]", "", line)
        line = line.strip()

    return (titles, text)


def num_tokens(text, model = GPT_MODEL):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def halved_by_delimiter(string, delimiter = "\n"):
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  
    elif len(chunks) == 2:
        return chunks 
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]
    
    
def truncated_string(string, model, max_tokens, print_warning = True,):
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    if print_warning and len(encoded_string) > max_tokens:
        print(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens.")
    return truncated_string


def split_strings_from_subsection(title, text, max_tokens = 1000, model = GPT_MODEL, max_recursion = 5):

    string = "\n\n".join(title + text)
    num_tokens_in_string = num_tokens(string, model)

    if num_tokens_in_string <= max_tokens:
        return [string]

    elif max_recursion == 0:
        return [truncated_string(string, model, max_tokens)]

    else:
        for delimiter in ["\n\n", "\n", ". "]:
            left, right = halved_by_delimiter(text, delimiter=delimiter)
            if left == "" or right == "":

                continue
            else:

                results = []
                for half in [left, right]:
                    half_strings = split_strings_from_subsection(title, half,max_tokens,model,max_recursion - 1)
                    results.extend(half_strings)
                return results
            
    return [truncated_string(string, model, max_tokens)]



**Call all functions to retrieve the clean pdf sections to then split it into strings**

In [67]:

# TODO #2: Call the previous created functions
pdf_text = extract_text_from_pdf("LETI_SISTCA_2023_24_Team2_OpenAI.pdf")
pdf_sections = split_sections_from_pdf(pdf_text)
cleaned_sections = clean_section(pdf_sections)

MAX_TOKENS = 1600
strings = []
titles = cleaned_sections[0]
texts = cleaned_sections[1]
for i in range(len(titles)):
    strings.extend(split_strings_from_subsection(titles[i], texts[i], max_tokens=MAX_TOKENS))




**Transforming the information to embeddings and saving it to a CSV file**

In [72]:
embeddings = []
for batch_start in range(0, len(strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = strings[batch_start:batch_end]
    
    # TODO #3: Make a request to the embeddings API with the batch as input
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=batch)
    
    for i, be in enumerate(response.data):
        assert i == be.index  
    batch_embeddings = [e.embedding for e in response.data]
    embeddings.extend(batch_embeddings)
    
    
df = pd.DataFrame({"text": strings, "embedding": embeddings})


SAVE_PATH = "SISTCA_TEAM2.csv"
df.to_csv(SAVE_PATH, index=False)

**The first step is done, now we need to create a function so GPT can awnser anything about the pdf using the saved embeddings**

**Change the Embedding Model and read the CSV file**

In [76]:

# TODO #4: Change the EMBEDDING_MODEL (ada)
EMBEDDING_MODEL = "text-embedding-ada-002"

# TODO #5: Create a variable with the CSV file path
embeddings_path = "SISTCA_TEAM2.csv"

df = pd.read_csv(embeddings_path)
df['embedding'] = df['embedding'].apply(ast.literal_eval)

**Functions to compare the relatedness off the strings with the query**

In [78]:
def strings_ranked_by_relatedness(query, df , relatedness_fn = lambda x, y: 1 - spatial.distance.cosine(x, y), top_n = 100) :
    
    # TODO 6: Make a request to the embeddings API with the query as input
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

strings, relatednesses = strings_ranked_by_relatedness("open ai", df, top_n=5)


def num_tokens(text, model = GPT_MODEL):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def query_message(query,df, model, token_budget):
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below articles on the document about OpenAI made by Team 2, composed by Patrícia Sousa, Carlos Alves, Jose Leal and Tiago Ribeiro, for SISTCA to answer the subsequent question. If the answer cannot be found in the articles, write "Sorry, the information you seek cannot be found in the document in question."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nPDF article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


**Create the final function**  


If the chat completions does not have information to answer the query, like defined, it will say _"Sorry, the information you seek cannot be found in the document in question."_**

In [80]:
def ask(query, df = df, model = GPT_MODEL, token_budget = 4096 - 500, print_message = False):
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the document made by Team2 for SISTCA about OpenAI."},
        {"role": "user", "content": message},
    ]
    
    # TODO 7: Make a request to the Chat Completions API
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content
    return response_message


**Test the ask function to verify it**

Because everytime you run ask you give a new prompt to chat completions the awnsers may very for one attempt to another.

In [85]:
#Test the function: Be aware that the awnsers are not the same everytime you do the request
print(ask("Scientific/technological background")) 
print(ask("Give me the authors")) 
print(ask("What can you tell me about the document")) 
print(ask("Give me the document structure"))

Sorry, the information you seek cannot be found in the document in question.
The authors of the document about OpenAI made by Team 2 for SISTCA are Patrícia Sousa, Carlos Alves, Jose Leal, and Tiago Ribeiro.
The document made by Team 2 for SISTCA about OpenAI covers various topics related to OpenAI, including Chat Completions, AI Companies, Assistants, Vision, Whisper, TTS (Text-to-Speech), and exercises related to challenges and integration of text-to-speech features. It also provides information on setting up OpenAI API keys and integrating text-to-speech features from OpenAI into projects.
The document structure includes sections such as Chat Completions, State-of-the-art AI companies, Exercise B 196 Challenge, Assistants, Vision, Exercise A, MacOS, Document Structure, and Whisper.
