# Word embedding for pdf file using OpenAI API

## Importing necessary libraries

In [11]:
import pdfplumber
import re
import openai  # for generating embeddings
import pandas as pd 
import tiktoken
import streamlit as st

## Extracting text from the manual pdf

The script described below is engineered to meticulously extract and organize text from PDF documents into a more accessible and structured format. It achieves this by discerning section headers and the associated content, ensuring a coherent representation of the information originally laid out in the PDF.

Here's an in-depth overview of how the script functions:

1. **Header Identification Function (`is_header`)**: Central to the organization process is the ability to recognize section headers within the document. The `is_header` function executes this by applying a heuristic that deems a text line as a header if it adheres to certain formatting rules, such as being in uppercase or title case. Given the variability in document formatting, this criterion may require customization to suit the particular style and structure of different PDFs.

2. **Text Extraction and Sectioning (`extract_sections`)**: This function serves as the workhorse of the script, delving into the PDF specified by `pdf_path` and orchestrating the text into an organized ensemble of sections, each marked by a distinct header:
   - The function employs `pdfplumber` for its capability to sift through the PDF and draw out the textual content, offering a raw yet comprehensive look at the document's contents.
   - It systematically categorizes the text into headers and their subsequent content. The advent of a new header signals the commencement of a fresh section, thereby ensuring the text is partitioned accurately under the relevant headers.
   - Upon the culmination of this processing stage, the function compiles the discrete lines of text into unified strings representative of individual sections. The output materializes as a list of dictionaries, with each dictionary encapsulating a unique section derived from the PDF.

3. **Post-Processing for Cohesive Content Representation**: After the initial segmentation and content aggregation, the script performs an additional step of post-processing. This stage involves merging the list of text lines within each section into a singular, coherent string. This transformation is crucial for preserving the narrative flow and original context of the content, making the extracted information more readable and primed for subsequent analysis or processing tasks.

4. **Resultant Data Structure**: The final product of the script is a cleanly structured list, where each element is a dictionary that contains a distinct section of the original PDF file. These dictionaries consist of two key components: the 'header' detailing the section's title, and the 'content' representing the amalgamated text pertinent to that section. 

This methodical approach not only ensures the integrity and continuity of the information extracted from the PDFs but also paves the way for more efficient and organized data handling, whether for analytical, archival, or content management purposes.

In [12]:
def is_header(text):
    # Simple heuristic to determine if a line of text is a header.
    # This could be improved with more complex logic, analyzing font size, or using machine learning models.
    return text.isupper() or text.istitle()  # Modify this condition based on your specific criteria for headers

def extract_sections(pdf_path):
    sections = []
    current_section = {"header": "", "content": []}

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                for line in text.split('\n'):
                    # Check if the line seems like a header
                    if is_header(line):
                        # If we were already filling a section, store it and start a new one
                        if current_section["content"]:
                            sections.append(current_section)
                            current_section = {"header": "", "content": []}
                        current_section["header"] += line + ' ' 
                    else:
                        # Otherwise, it's normal content, add it to the current section
                        current_section["content"].append(line)

        if current_section["content"]:
            sections.append(current_section)

    for section in sections:
        section["content"] = '\n'.join(section["content"])  # Join the list of content lines into a single string

    return sections

In [13]:
# Use the function on your PDF
pdf_path = r"path_to_your_pdf.pdf"  # Update this path
sections = extract_sections(pdf_path)


In [14]:
sections[1]

{'header': 'Petrel Exploration Geophysics ',
 'content': 'Interpret regional 2D and 3D projects at your desktop. Make use of high-performance computing\nfor improved regional understanding.\nPetrel Geology and Geological Modeling\nObtain accurate, high-resolution geological models of reservoir structure and stratigraphy.\nClassification and Estimation | Petrel Facies Modeling | Petrel Well Correlation | Petrel Surface'}

In [15]:
sections

[{'header': '',
  'content': 'WWeellccoommee ttoo tthhee PPeettrreell** hheellpp mmaannuuaall\nPPeettrreell SSeeiissmmiicc ttoo SSiimmuullaattiioonn SSooffttwwaarree\nOOppttiimmiizzee eexxpplloorraattiioonn aanndd ddeevveellooppmmeenntt ooppeerraattiioonnss\nPPeettrreell sseeiissmmiicc ttoo ssiimmuullaattiioonn ssooffttwwaarree hheellppss iinnccrreeaassee rreesseerrvvooiirr ppeerrffoorrmmaannccee bbyy iimmpprroovviinngg aasssseett\ntteeaamm pprroodduuccttiivviittyy.. GGeeoopphhyyssiicciissttss,, ggeeoollooggiissttss,, aanndd rreesseerrvvooiirr eennggiinneeeerrss ccaann ddeevveelloopp ccoollllaabboorraattiivvee\nwwoorrkkfflloowwss aanndd iinntteeggrraattee ooppeerraattiioonnss ttoo ssttrreeaammlliinnee pprroocceesssseess..\nBBeenneeffiittss\nUUnniiffyy wwoorrkkfflloowwss ffoorr EE&&PP tteeaammss-- EElliimmiinnaattee tthhee ggaappss iinn ttrraaddiittiioonnaall ssyysstteemmss tthhaatt rreeqquuiirree\nhhaannddooffffss ffrroomm oonnee tteecchhnniiccaall ddoommaaiinn ttoo tthhee nneexxtt uus

Now we will combine header and content to make one string and store all of them as elements of list

In [16]:
sections_list = []

for section_dict in sections:
    # Extract the 'header' and 'content' from the dictionary
    header = section_dict['header']
    content = section_dict['content']
    
    # Combine 'header' and 'content' with a newline in between
    combined_string = f"{header}\n{content}"
    
    # Append the combined string to your new list
    sections_list.append(combined_string)

In [33]:
max = 0
for section in sections_list:
    if len(section)>max:
        max = len(section)

max

43417

## Preprocessing the text before embedding

The script outlined below serves a crucial role in processing a list of text strings, specifically ensuring that none of the strings breach a predetermined character limit range, set here between 6,000 and 8,000 characters. This boundary is crucial when working with certain text processing APIs or systems that impose restrictions on the amount of text they can handle within a single operation.

Here's a detailed breakdown of the script's operation:

1. **Utility Function (`split_text_into_sentences`)**: Initially, the script employs a utility function that fragments the text into individual sentences using regular expressions to identify sentence terminators (like periods or question marks) while avoiding common false positives (such as abbreviations).

2. **Main Processing Function (`process_strings`)**: The main logic resides in the `process_strings` function, which iterates through each string in the provided list. This function segments the text into sentences and dynamically aggregates them, ensuring each combined string respects the defined minimum and maximum character constraints.

3. **Buffer Management for Text Aggregation**: A 'buffer' string temporarily holds the aggregated sentences. When the addition of a new sentence causes the buffer to approach the maximum limit, the function evaluates the buffer's length:

   - If the buffer exceeds the minimum character threshold, it is appended to the final list of processed strings, emptied, and then initiated with the new sentence.
   - If the buffer falls short of the minimum characters, the function continues to add sentences until it surpasses the minimum length, even if this means momentarily exceeding the maximum character limit. This ensures no processed string is below the acceptable length, accommodating scenarios where an original text string is too short.

4. **Post-Processing and Edge Case Handling**: After iteratively processing each text string from the original list, the script addresses any residual content in the buffer. If this content surpasses the minimum character limit, it is added to the final list, ensuring valuable information isn't discarded.

5. **Output Preparation**: The script concludes by returning a new list (`rearranged_strings`), consisting of the restructured text strings. Each entry in this list complies with the character restrictions, making them suitable for systems or APIs sensitive to text length.

This approach ensures the integrity of the processed text, respecting logical content boundaries (like sentences) and maintaining the meaningful sequence of information, thereby readying the data for further text processing tasks within character-bound constraints.


In [42]:
import re

def split_text_into_sentences(text):
    # This function splits text into a list of sentences.
    sentence_endings = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s')
    sentences = sentence_endings.split(text)
    return sentences

def process_strings(strings, min_chars=6000, max_chars=8000):
    """
    Process the list of strings, rearranging and splitting content so that each element adheres to character limits.

    :param strings: List of strings.
    :param min_chars: Minimum number of characters allowed in each section.
    :param max_chars: Maximum number of characters allowed in each section.
    :return: New list of strings, with content rearranged and split according to specified limits.
    """
    processed_list = []
    buffer = "" 

    for content in strings:
        sentences = split_text_into_sentences(content)
        for sentence in sentences:
            sentence = sentence.strip()
            if not sentence:
                continue

            # If adding the new sentence doesn't exceed the maximum character limit, we add it to the buffer.
            if len(buffer) + len(sentence) + 1 <= max_chars:  # +1 for space
                buffer += (sentence + " ")
            else:

                if len(buffer) >= min_chars:
                    processed_list.append(buffer.strip())
                    buffer = sentence + " "  # Start a new section with the current sentence
                else:
                    # If the current buffer is less than the minimum limit, we need special handling. 
                    # We will keep adding sentences to it until it exceeds min_chars, even if it means
                    # going over the max_chars limit for this specific section.
                    while len(buffer) < min_chars and sentences:
                        buffer += (sentences.pop(0).strip() + " ")  # Pop sentences and add them to the buffer
                    processed_list.append(buffer.strip())
                    buffer = ""  # Start fresh, as we've handled the 'too short' case by going over the limit
                    
                    if sentence:  # If current sentence wasn't added, initiate the buffer with it
                        buffer = sentence + " "

    # Handling the last part if it's above the minimum limit
    if len(buffer) >= min_chars:
        processed_list.append(buffer.strip())

    return processed_list

# Assuming 'original_strings' is your list of strings.
rearranged_strings = process_strings(sections_list, 6000, 8000)



## Embedding the document by chunks

In [44]:
# Configure your OpenAI API
openai.api_key =  st.secrets['auth_key']
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023

# You might need to reduce this number if you still hit the token limit.
# The actual value could be much smaller, depending on the average tokens per text in 'sections_list'.
SAFE_BATCH_SIZE = 5  # Safely under the assumption of maximum tokens, considering the size of your texts

embeddings = []
batch_start_indices = range(0, len(rearranged_strings), SAFE_BATCH_SIZE)

for batch_start in batch_start_indices:
    batch_end = batch_start + SAFE_BATCH_SIZE
    batch_texts = rearranged_strings[batch_start:batch_end]
    
    try:
        print(f"Processing batch {batch_start} to {batch_end-1}")
        response = openai.Embedding.create(
            model=EMBEDDING_MODEL, 
            input=batch_texts
        )
        # Extract embeddings and extend the list
        batch_embeddings = [data["embedding"] for data in response["data"]]
        embeddings.extend(batch_embeddings)
    except Exception as e:
        print(f"Error processing batch {batch_start} to {batch_end-1}: {e}")
        continue  # Skip the problematic batch or handle the issue based on your project's requirements


Processing batch 0 to 4
Processing batch 5 to 9
Processing batch 10 to 14
Processing batch 15 to 19
Processing batch 20 to 24
Processing batch 25 to 29
Processing batch 30 to 34
Processing batch 35 to 39
Processing batch 40 to 44
Processing batch 45 to 49
Processing batch 50 to 54
Processing batch 55 to 59
Processing batch 60 to 64
Processing batch 65 to 69
Processing batch 70 to 74
Processing batch 75 to 79
Processing batch 80 to 84
Processing batch 85 to 89
Processing batch 90 to 94
Processing batch 95 to 99
Processing batch 100 to 104
Processing batch 105 to 109
Processing batch 110 to 114
Processing batch 115 to 119
Processing batch 120 to 124
Processing batch 125 to 129
Processing batch 130 to 134
Processing batch 135 to 139
Processing batch 140 to 144
Processing batch 145 to 149
Processing batch 150 to 154
Processing batch 155 to 159
Processing batch 160 to 164
Processing batch 165 to 169
Processing batch 170 to 174
Processing batch 175 to 179
Processing batch 180 to 184
Processi

In [45]:
# Construct a DataFrame (if all batches were successful, this should match the original list length)
df = pd.DataFrame({
    "text": rearranged_strings[:len(embeddings)],  # Safety measure to match the embeddings length
    "embedding": embeddings
})

# Ensure you've captured all embeddings
assert len(df) == len(rearranged_strings), "Mismatch between number of texts and embeddings"

In [46]:
df

Unnamed: 0,text,embedding
0,WWeellccoommee ttoo tthhee PPeettrreell** hhee...,"[-0.02119479887187481, -0.006573337130248547, ..."
1,Benefits \n3D surveys and thousands of 2D seis...,"[-0.034185562282800674, -2.8231421310920268e-0..."
2,Surface attribute library for rapid prospect i...,"[-0.03658333048224449, 0.013583413325250149, 0..."
3,Uses a standard layer cake approach for domain...,"[-0.013447406701743603, 0.005238193087279797, ..."
4,Fault properties can then be visualized in the...,"[-0.03174079954624176, -0.015956807881593704, ..."
...,...,...
446,Appendix 5 - Well Connection Calculations \nWh...,"[-0.015207061544060707, -0.007961086928844452,..."
447,Network access is defined by two values: an IP...,"[-0.0030754979234188795, 0.007947798818349838,..."
448,Observe how\nthe Memory usage information in t...,"[-0.009794695302844048, -0.01491154171526432, ..."
449,This name should be an existing name (see\nPet...,"[-0.02186855487525463, 0.006534830201417208, -..."


## Store document chunks and embeddings

In [47]:
df.to_csv('petrel_manual.csv')