# Objective

The purpose of this notebook is to devise a strategy for identifying key entities and their attributes in finance PDF files such as Earning calls transcripts. The process should be reasonably fast (~a couple of seconds).

Elements to be identified are:
- Entities such as people and companies
- Relevant attributes for these, such as roles or relationships

In addition to this, **I have prepared a second task to identify "business developments" and classify them as positive or negative** with a score between -10 (very negative) and 10 (very positive). The reasoning behind this extra task is to show the flexibility of the selected approach and also have another task to outline that advantages and disadvantages of the approach.

-----

To achieve these objectives, we have chosen to leverage [Instruct GPT](https://openai.com/research/instruction-following) in conjunction with the [Langchain library](https://python.langchain.com/docs/get_started/introduction). The decision to use Instruct GPT, rather than a smaller Transformer model like [REBEL](https://huggingface.co/Babelscape/rebel-large), is based on several compelling reasons (which have been outlined by Instruct GPT itself):

**Advantages of Instruct GPT:**

1. **Contextual Understanding:** Instruct GPT, being a larger and more advanced language model, possesses a deep understanding of context and nuances. This positions it well to comprehend and generate text that accurately captures the desired information.

2. **Higher Performance:** Instruct GPT consistently delivers superior results in various natural language understanding tasks due to its larger size and extensive training.

3. **Reduced Ambiguity:** Instruct GPT excels at handling ambiguous queries or instructions by considering context and providing comprehensive responses.

*Extra (outlined by me, text formatted by Chat-GPT)*

4. **Speed of Proof-of-Concept Creation:** Leveraging Instruct GPT's extensive knowledge, we can achieve promising results **without the need for extensive training data**. We can reliably apply zero-shot or few-shot learning techniques for rapid development.

5. **Task Flexibility:** Instruct GPT allows us to adapt our entity and relationship detection by making slight adjustments to the prompts. This means we can maintain a consistent pipeline while obtaining diverse results by varying input instructions.

**Considerations for Smaller Transformer Models:**

1. **Resource Constraints:** Smaller models demand fewer computational resources and memory, making them suitable for deployment in resource-constrained environments, especially if we plan to deploy our private model. `[Comment: This would only apply if we wanted to deploy the model privately]`

2. **Inference Speed:** Smaller models generally offer faster inference times, which can be crucial for real-time or high-throughput applications.

3. **Specificity:** If our information extraction task is highly specialized and domain-specific, a smaller model may suffice and prove to be a more cost-effective solution.

*Extra (outlined by me, text formatted by Chat-GPT)*

4.  **Data availability**. Related to point 3, but, if we have a lot of specific data, it would probably be interesting to try smaller models, because fine-tuning large LLMs can be very expensive.

In [2]:
import openai
import os
import PyPDF2
import re
import time
import tiktoken
import datetime
import pandas as pd

from pathlib import Path
from langchain.llms import AzureOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks import get_openai_callback

openai.api_type = "azure"
openai.api_base = "https://gpt3tests.openai.azure.com/"
openai.api_version = "2022-12-01"

openai.api_key = os.environ["OPENAI_API_KEY"]

# Deployment name in my Azure OpenAI Studio is "Davinci003", model is "text-davinci-003"
engine = "Davinci003" # uses p50k_base tokenizer
model = "text-davinci-003"
openai_api_version = "2022-12-01" # not sure about the openAI API version, did some tests

# Deployment name in my Azure OpenAI Studio is "GPT35Turbo", model is "gpt-35-turbo"
# engine = "GPT35Turbo" # uses cl100k_base tokenizer, but there seems to be issues: https://github.com/openai/openai-python/issues/304
# model = "gpt-35-turbo"
# openai_api_version = "2023-05-15"

MAX_GENERATION_LENGTH = 1000
MODEL_CONTEXT_LENGTH = 4097
EXTRA_TOKENS_FOR_TOKENIZATION_VARIABILITY = 50 # Not sure why, but I have run the same process multiple times and I have seen different tokenizations, need to double check this
TEMPERATURE = 0 # low temperature to avoid GPT's "imagination"

llm = AzureOpenAI(deployment_name=engine, openai_api_version=openai_api_version, temperature=TEMPERATURE)
llm.openai_api_key = openai.api_key
llm.openai_api_base = openai.api_base 
llm.max_tokens = MAX_GENERATION_LENGTH

# 1 - Prepare data

Our starting point is a PDF file containing text. To effectively use this text with our Language Model (LLM), we must adjust it to fit within the model's context length. Here's how we'll do it:

1. **Load the PDF:** We'll begin by loading the content from the PDF file.
2. **Divide into Paragraphs:** Next, we'll break down the text into paragraphs. This step helps us manage the text in smaller, more digestible chunks.
3. **Count Tokens per Paragraph:** Finally, we'll calculate the number of tokens in each paragraph. This information is crucial for generating prompts that fit within the model's capacity.

## 1.1 - Load PDF

In [3]:
pdf_file_name = "East West Bancorp"

# Open the PDF file
pdf_file_path = Path(f'./data/{pdf_file_name}.pdf')

## 1.2 - Dividing Text into Paragraphs

To make the most of our Language Model (LLM), we need to consider its context length limit, which is set at 4097 tokens in this case. Our goal is to extract and provide the LLM with as many complete paragraphs as possible while staying within this token limit. To this purpose we are first going to divide the text into the best possible paragraphs.

#### Process Overview:

1. **Opening the PDF File and Preparing the List for Paragraphs:**
   - Start by opening the PDF file and initializing an empty list to store paragraphs.
2. **Iterating Through PDF Pages:**
   - We go through each page of the PDF document.
3. **Extracting Text from Pages:**
   - Extract the text content from each page.
4. **Using a Regular Expression (`r'\n(?=[A-Z])'`):**
   - We employ a regular expression `r'\n(?=[A-Z])'` to split the text. This splits the text at line breaks (`\n`) that are followed by an uppercase letter, indicating the beginning of a new paragraph.
5. **Filtering and Collecting Valid Paragraphs:**
   - We remove empty paragraphs and add the valid ones to our list.
6. **Cleaning Paragraphs:**
   - To enhance readability, we remove any unnecessary line breaks within paragraphs.
7. **Adding a Line Break at Paragraph Start:**
   - For clear separation when using these paragraphs in prompts, we insert a line break at the beginning of each paragraph.

In [4]:
def extract_paragraphs_from_pdf(pdf_path):
    paragraphs = []

    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Iterate through each page of the PDF
        for page_number in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_number]
            page_text = page.extract_text()

            # Use regular expressions to split the text into paragraphs
            page_paragraphs = re.split(r'\n(?=[A-Z])', page_text)

            # Remove empty paragraphs
            page_paragraphs = [p.strip() for p in page_paragraphs if p.strip()]

            # Add the page paragraphs to the overall list of paragraphs
            paragraphs.extend(page_paragraphs)

    return paragraphs

# Divide the text into paragraphs
paragraphs = extract_paragraphs_from_pdf(pdf_file_path)

# Now 'paragraphs' is a list of strings, where each string is a paragraph from the PDF. Let's remove the "ignored" line breaks
cleaned_paragraphs = [paragraph.replace("\n", "") for paragraph in paragraphs]

# Let's add one line break at the beginning of each paragraph to make sure in the prompt they are separately considered
cleaned_paragraphs = ["\n" + paragraph for paragraph in paragraphs]

## 1.3 - Count number of tokens per paragraph

We need this information to divide the text into the appropriate blocks that will be later ingested by the LLM. To this purpose, we are going to use the `tiktoken` library.

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you 
1. Whether the string is too long for a text model to process. 
2. How much an OpenAI API call costs (as usage is priced by token).

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:

```python
encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
encoding = tiktoken.encoding_for_model('text-davinci-003')
```

[Official documentation example for counting tokens](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)

In [5]:
def count_tokens(encoding, text):
    return len(encoding.encode(text))

encoding = tiktoken.encoding_for_model(model)

# Count tokens for each paragraph and store the results in a list of dictionaries
token_counts = []
for paragraph in cleaned_paragraphs:
    num_tokens = count_tokens(encoding, paragraph)
    token_counts.append({'Paragraph': paragraph, 'Token Count': num_tokens})

# Create a Pandas DataFrame from the list of token counts
df = pd.DataFrame(token_counts)
print(f"Total number of tokens in the file: {df['Token Count'].sum()}")

df

Total number of tokens in the file: 9052


Unnamed: 0,Paragraph,Token Count
0,"\nEast West Bancorp, Inc. ( NASDAQ: EWBC ) Q3 ...",35
1,\nCompany Participants,3
2,\nAdrienne Atkinson - Director of Investor Rel...,10
3,\nDominic Ng - Chairman and Chief Executive Of...,10
4,\nChristopher Del Moral -Niles - Chief Financi...,11
...,...,...
223,\nDominic Ng,4
224,"\nWell, I just want to thank everyone for list...",37
225,\nOperator,3
226,\nThe conference has now concluded. Thank you ...,31


# 2 - Tasks

We are going to use GPT 3 for several different tasks:
* Identify entities and relationships
* Identify business developments and classify them as positive or negative with a score betwen -10 (very negative) and 10 (very positive)

For the "entities & relationships" task, we are going to use a one-shot learning approach, where we provide an example of what we want the model to do.
For the business developments task, we are going to use a zero-shot learning approach. `[Comment: I tried a one-shot approach, but it confused the model (probably needed more work)]`

## 2.1 - Identifying Entities and Relationships

In this task, we combine named-entity recognition and relation extraction in a generative approach. To accomplish this, we will utilize [Instruct GPT](https://openai.com/research/instruction-following) with Langchain, employing a straightforward [LLMChain](https://docs.langchain.com/docs/components/chains/llm-chain). An LLMChain represents a common type of chain, composed of a PromptTemplate, a model (either an LLM or a ChatModel), and, optionally, an output parser.

Our objective involves identifying the following elements:
- Entities, such as individuals and companies.
- Pertinent attributes associated with these entities, such as roles or relationships.

We will primarily focus on three types of relationships that exist between entities:

1. `<is_a>`: This relationship is useful for defining the nature of companies, places, or assets.
2. `<works_at>`: This relationship indicates the place of employment for a person.
3. `<has_position>`: This relationship specifies the role or position held by a person within the company they are associated with.

**Please note:** While our approach can be expanded to encompass more intricate relationships, we have chosen to concentrate on these three examples for clarity and demonstration purposes.


In [6]:
entity_relationships_task = "entity_relationships"

### 2.1.1 - Prepare the prompt

#### Load base prompt

In [7]:
# Base template for entity relationship extraction tasks
base_template_path = Path(f"./prompts/templates/{entity_relationships_task}/base_template.txt")

# File showing the relationships we are interested in extracting
relationships_template_path = Path(f"./prompts/templates/{entity_relationships_task}/pcpa.txt")

# Example prompt for the specific relationships defined above
example_prompt_path = Path(f"./prompts/examples/{entity_relationships_task}/pcpa_microsoft.txt")

# Read the template content from the file
with open(base_template_path, 'r') as prompt_file:
    base_template = prompt_file.read()

# Read the template content from the file
with open(relationships_template_path, 'r') as prompt_file:
    relationships_template = prompt_file.read()

# Read the template content from the file
with open(example_prompt_path, 'r') as prompt_file:
    example_prompt = prompt_file.read()

prompt_text = relationships_template.format(example=example_prompt)
prompt_text = base_template.format(prompt=prompt_text)

n_prompt_tokens = count_tokens(encoding, prompt_text)
print(n_prompt_tokens)

899


#### Prepare chunks of text

In [8]:
def generate_chunks(dataframe, max_tokens_per_chunk):
    # Initialize variables
    chunks = []
    current_chunk = []
    current_token_count = 0

    # Iterate through the DataFrame rows
    for index, row in dataframe.iterrows():
        paragraph = row['Paragraph']
        num_tokens = row['Token Count']

        # If adding the current paragraph to the current chunk does not exceed the limit
        if current_token_count + num_tokens <= max_tokens_per_chunk:
            current_chunk.append(paragraph)
            current_token_count += num_tokens
        else:
            # Add the current chunk to the list of chunks
            if current_chunk:
                chunks.append("".join(current_chunk))
            # Start a new chunk with the current paragraph
            current_chunk = [paragraph]
            current_token_count = num_tokens

    # Add the last chunk, if it exists
    if current_chunk:
        chunks.append("".join(current_chunk))

    return chunks


chunk_size = MODEL_CONTEXT_LENGTH - MAX_GENERATION_LENGTH - EXTRA_TOKENS_FOR_TOKENIZATION_VARIABILITY - n_prompt_tokens

chunks = generate_chunks(df, chunk_size)

total_token_count = 0
for i in range(0, len(chunks)):
    token_count = count_tokens(encoding, chunks[i])
    print(f"{i}: {token_count}")
    total_token_count += token_count

print(f"Total: {total_token_count}")

0: 2133
1: 2147
2: 2133
3: 2141
4: 498
Total: 9052


### 2.1.2 - Run prompt through the LLM

In [9]:
def llm_run(chain, query):
    start_time = time.time()  # Record the start time

    with get_openai_callback() as cb:
        result = chain.run(query)
        print(f'Spent a total of {cb.total_tokens} tokens')

    end_time = time.time()  # Record the end time
    execution_time = end_time - start_time  # Calculate the execution time in seconds
    print(f'Time taken: {round(execution_time, 2)} seconds')

    return result

# Create the prompt template
prompt = PromptTemplate(
    input_variables=["input_text"],
    template=prompt_text + "\n{input_text}",
)

# Create the simple LLMChain
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Run the prompts through the LLM
er_outputs = []
for i in range(len(chunks)):
    print(f"Chunk {i}")
    output = llm_run(llm_chain, chunks[i])
    er_outputs.append(output)

Chunk 0


### 2.1.3 - Format output

In [None]:
def split_relationship_text(text):
    # Split the text based on "<" and ">" to separate the relationship part
    parts = text.split("<")
    
    if len(parts) == 2:
        # Extract entities and relationship
        entity1 = parts[0].strip()
        relationship, entity2  = parts[1].split(">")
        entity2 = entity2.strip()
        return entity1, relationship, entity2
    else:
        # Handle invalid input gracefully
        print(f"Invalid format for: {text}")
        return None, None, None

er_output_dfs = []
for output in er_outputs:
    er_output_dfs.append(pd.DataFrame({"Triplet": output.split("\n")}))

er_output_df = pd.concat(er_output_dfs)
er_output_df = er_output_df[er_output_df["Triplet"] != ""]
er_output_df = er_output_df.drop_duplicates(["Triplet"]).reset_index(drop=True).copy()

# Apply the function to the DataFrame and create new columns
er_output_df[['Entity 1', 'Relationship', 'Entity 2']] = er_output_df['Triplet'].apply(lambda x: pd.Series(split_relationship_text(x)))

Invalid format for:  the 
Invalid format for: beginning of the year, and we expect this trend to continue. We will continue to focus on 
Invalid format for: managing our balance sheet prudently, and we will continue to focus on our sustainability 
Invalid format for: initiatives.
Invalid format for:  We're not selling assets to reduce gearing.
Invalid format for: Yeah. It's a good question. I think it's too early to comment on that. We do have a number of 
Invalid format for: long-term tenants in the portfolio, and we're in the process of renewing those leases. So I think 
Invalid format for: it's too early to comment on that.


### 2.1.4 - Store output

In [None]:
# Create a timestamp for the current date and time
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Define the output directory path using pathlib
er_output_directory = Path('./outputs') / pdf_file_name #/ timestamp
er_output_directory.mkdir(parents=True, exist_ok=True)

# Define the output file path for the Parquet file
er_output_file_path = er_output_directory / f'{entity_relationships_task}.parquet'

# Save the DataFrame to Parquet format
er_output_df.to_parquet(er_output_file_path, index=False)

In [None]:
er_output_df

Unnamed: 0,Triplet,Entity 1,Relationship,Entity 2
0,Timothy Collyer <has_position> Managing Director,Timothy Collyer,has_position,Managing Director
1,Michael Green <has_position> Chief Investment ...,Michael Green,has_position,Chief Investment Officer
2,Dion Andrews <has_position> Chief Financial Of...,Dion Andrews,has_position,Chief Financial Officer
3,Timothy Collyer <works_at> Growthpoint Propert...,Timothy Collyer,works_at,Growthpoint Properties Australia
4,Michael Green <works_at> Growthpoint Propertie...,Michael Green,works_at,Growthpoint Properties Australia
...,...,...,...,...
77,Woolworths <is_a> Company,Woolworths,is_a,Company
78,Linfox <is_a> Company,Linfox,is_a,Company
79,Erskine Park <is_a> Place,Erskine Park,is_a,Place
80,Woolworths <works_at> Western Australia,Woolworths,works_at,Western Australia


#### Some comments (`text-davinci-003`)

As observed, the model encounters difficulties due to the absence of well-defined complex relationships between entities, and the lack of a proper ontology exacerbates this issue. Consequently, the model occasionally misassigns `<works_at>` relationships to companies and places when, ideally, it should employ `<is_located_at>` or a similar relation. This confusion arises from the model's limited flexibility, stemming from our failure to provide clear constraints.

One straightforward enhancement is to establish constraints dictating that certain types of relationships should only exist between specific types of entities. For instance, we can flag `<works_at>` relationships that connect a pair of entities that are not `(Person, Company)`.


## 2.2 - Extract business developments and classify their sentiment

This task is a mix of summarization and classification. We first "summarize" the text by extracting key business developments and then classify those as positive or negative. To accomplish this, we will utilize [Instruct GPT](https://openai.com/research/instruction-following) with Langchain, employing a straightforward [LLMChain](https://docs.langchain.com/docs/components/chains/llm-chain). An LLMChain represents a common type of chain, composed of a PromptTemplate, a model (either an LLM or a ChatModel), and, optionally, an output parser.


**Note:** We could also try a different approach where we divide the process in two parts and use [SimpleSequentialChain](https://python.langchain.com/docs/modules/chains/foundational/sequential_chains), where the output of one step is the input to the next. This would allow us to use different models for differents parts of the process. It would be the approach I would have chosen if I had to work with less potent LLMs (also cheaper).

* The first chain would generate a list of developments
* The second chain would take the list of developments as input and classify them as either positive or negative

In [None]:
business_developments_task = "business_developments"

### 2.2.1 - Prepare the prompt

#### Load base prompt

This template is much simpler because we are not providing example (zero-shot learning), we just provide a generic formatting guideline, so we just load a single file.

In [None]:
# Base template for entity relationship extraction tasks
base_template_path = Path(f"./prompts/templates/{business_developments_task}/base_template.txt")

# Read the template content from the file
with open(base_template_path, 'r') as prompt_file:
    base_template = prompt_file.read()

prompt_text = base_template

n_prompt_tokens = count_tokens(encoding, prompt_text)
print(n_prompt_tokens)

220


#### Prepare chunks of text

We have previously defined the `generate_chunks()` function

In [None]:
chunk_size = MODEL_CONTEXT_LENGTH - MAX_GENERATION_LENGTH - EXTRA_TOKENS_FOR_TOKENIZATION_VARIABILITY - n_prompt_tokens

chunks = generate_chunks(df, chunk_size)

total_token_count = 0
for i in range(0, len(chunks)):
    token_count = count_tokens(encoding, chunks[i])
    print(f"{i}: {token_count}")
    total_token_count += token_count

print(f"Total: {total_token_count}")

0: 2798
1: 2817
2: 2687
3: 284
Total: 8586


### 2.2.2 - Run the prompt through the model

In [None]:
# Create the prompt template
prompt = PromptTemplate(
    input_variables=["input_text"],
    template=prompt_text + "\n{input_text}",
)

# Create the simple LLMChain
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Run the prompts through the LLM
bd_outputs = []
for i in range(len(chunks)):
    print(f"Chunk {i}")
    output = llm_run(llm_chain, chunks[i])
    bd_outputs.append(output)

Chunk 0


Spent a total of 3628 tokens
Time taken: 17.57 seconds
Chunk 1
Spent a total of 3412 tokens
Time taken: 10.44 seconds
Chunk 2
Spent a total of 3225 tokens
Time taken: 9.18 seconds
Chunk 3
Spent a total of 612 tokens
Time taken: 2.61 seconds


In [None]:
bd_outputs

["\n\nBusiness Development Summary | Score | Reason for Score\nAcquisition of GSO Building in Dandenong | 8 | The acquisition of the GSO Building in Dandenong provides Growthpoint with a 9.4-year WALE and a modern, long-term Victorian Government tenanted building located in a growing major urban center. \nDivestment of 333 Ann Street, Brisbane | 8 | The divestment of 333 Ann Street, Brisbane provides Growthpoint with a 3.7-year WALE and a strong return on the asset acquired back in 2012 for AUD 109 million. \nLeasing Performance | 9 | Growthpoint achieved a strong leasing performance with over 156,000 square meters leased equivalent to 11.2% of income, resulting in occupancy of 93% across the portfolio. \nIntegration of Fortius Funds Management Platform | 8 | The integration of the Fortius Funds Management platform was successfully completed in September 2022, providing a long-term growth platform for the group. \nGrowth in Metro and Fringe CBD Office Market | 8 | According to JLL data

### 2.2.3 - Format output

In [None]:
def split_business_development_text(text):
    parts = text.split('|')
    if len(parts) == 3:
        business_development_summary = parts[0].strip()
        # score = int(parts[1].strip())
        score = parts[1].strip()
        reason_for_score = parts[2].strip()
        return business_development_summary, score, reason_for_score
    else:
        print(f"Invalid format for: {text}")
        return None, None, None

bd_output_dfs = []
for output in bd_outputs:
    bd_output_dfs.append(pd.DataFrame({"RAW Output": output.split("\n")}))

bd_output_df = pd.concat(bd_output_dfs)
bd_output_df = bd_output_df[bd_output_df["RAW Output"] != ""]
bd_output_df = bd_output_df.drop_duplicates(["RAW Output"]).reset_index(drop=True).copy()

# Apply the function to the DataFrame and create new columns
bd_output_df[['Business development', 'Score', 'Explanation']] = bd_output_df['RAW Output'].apply(lambda x: pd.Series(split_business_development_text(x)))
bd_output_df = bd_output_df.dropna()

Invalid format for:  
Invalid format for: Newstead with two government tenants.
Invalid format for: Yeah. It's a good question. I think it's too early to comment on that. We do have a number of 
Invalid format for: tenants that have been with us for a long time. We have a number of tenants that have been 
Invalid format for: with us for a shorter period of time. So it's really hard to comment on that. We do have a 
Invalid format for: number of tenants that have been with us for a long time, and we do have a number of 
Invalid format for: tenants that have been with us for a shorter period of time. So it's really hard to comment on 
Invalid format for: that. We do have a number of tenants that have been with us for a long time, and we do have 
Invalid format for: a number of tenants that have been with us for a shorter period of time. So it's really hard to 
Invalid format for: comment on that.


### 2.2.4 - Store output

In [None]:
# Create a timestamp for the current date and time
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Define the output directory path using pathlib
bd_output_directory = Path('./outputs') / pdf_file_name #/ timestamp
bd_output_directory.mkdir(parents=True, exist_ok=True)

# Define the output file path for the Parquet file
bd_output_file_path = bd_output_directory / f'{business_developments_task}.parquet'

# Save the DataFrame to Parquet format
bd_output_df.to_parquet(bd_output_file_path, index=False)

In [None]:
bd_output_df

Unnamed: 0,RAW Output,Business development,Score,Explanation
0,Business Development Summary | Score | Reason ...,Business Development Summary,Score,Reason for Score
1,Acquisition of GSO Building in Dandenong | 8 |...,Acquisition of GSO Building in Dandenong,8,The acquisition of the GSO Building in Dandeno...
2,"Divestment of 333 Ann Street, Brisbane | 8 | T...","Divestment of 333 Ann Street, Brisbane",8,"The divestment of 333 Ann Street, Brisbane pro..."
3,Leasing Performance | 9 | Growthpoint achieved...,Leasing Performance,9,Growthpoint achieved a strong leasing performa...
4,Integration of Fortius Funds Management Platfo...,Integration of Fortius Funds Management Platform,8,The integration of the Fortius Funds Managemen...
5,Growth in Metro and Fringe CBD Office Market |...,Growth in Metro and Fringe CBD Office Market,8,"According to JLL data, the CBD office markets ..."
6,Tenants Accommodating for Quick Feasible Occup...,Tenants Accommodating for Quick Feasible Occup...,8,Tenants are accommodating for quick feasible o...
7,Positive Momentum in Industrial Property Secto...,Positive Momentum in Industrial Property Sector,9,The positive momentum in the industrial market...
8,High Quality Assets Attractive to Tenants | 9 ...,High Quality Assets Attractive to Tenants,9,"Growthpoint's high quality, energy efficient b..."
9,Leasing Activity | 8 | Growthpoint executed 33...,Leasing Activity,8,"Growthpoint executed 33 leases, accounting for..."


Bad pipe message: %s [b'f\x89\x03\xaa\xf2i\nw\xa5\x1d\xb0\xd5|\x86R\xc2\x87( \xa2\xf3\x97Px%\xf3\xc1\xb9\xf5\xcf[\xd9h\xe4v\xf4t"\x05\xcc\xe6\xee\x08O\x9f=n\xff\x1d9\x85\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x00\n\x00\x1d\x00\x17\x00\x1e\x00\x19\x00\x18\x00#\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\r\x00\x1e\x00\x1c']
Bad pipe message: %s [b'\x05\x03\x06', b'\x07\x08']
Bad pipe message: %s [b'\t\x08\n\x08\x0b\x08\x04']
Bad pipe message: %s [b'\x08\x06\x04\x01\x05\x01\x06', b'']
Bad pipe message: %s [b'\x93\x8e\x05\x13\x88&\xf3\xf5)\xac\nl3\xfd\xbf!g\x94 \xd6\x80w;\xf12tF-\xc9\xf0z\xe3B\xbe\xb2\xa3\x9f\xc8\xb7\x87\x1bL\x1bq7Be\xaa1\xb2\xe0\x00\x08\x13\x02\x13\x03\x13']
Bad pipe message: %s [b'\x03\x02\x03\x04\x00-\x00\x02\x01\x01\x003\x00&\x00$\x00\x1d\x00 \xc0l\xcdV\x8c\xfc\x1c\xc8\xbcC\x18D\xads\xf1tb\xed\x91\x14R\xb5']
Bad pipe message: %s [b'']
Bad pipe message: %s [b"\x93\x

#### Comments on Model Behavior (`text-davinci-003`):

**Variability in Result Formatting:**
It's noticeable that, even with a temperature setting of 0, the model generates answers in various formats. We can observe a few recurring patterns:
- Some Business development	answers begin with "Business development Summary," while others start with numbered lists like "1.," "2.," etc.
- The explanations for scores are sometimes presented directly, and other times, they begin with terms like "positive," "negative," or "neutral."

To mitigate this variability and achieve more consistent formatting in the answers, we have a couple of options:
* To improve the prompts we use (manually or with prompt fine-tuning)
* Fine-tune the model. 

Personally, I believe that fine-tuning the model using pre-tested gold-standard examples is the most effective solution. This way, we can ensure that the model produces the desired output consistently.

Alternatively, we can consider re-running sections of text that haven't been formatted as intended. We might use a higher temperature setting during these runs and verify if the output aligns with our expectations.

**Extraneous Outputs (Deviation from Guidelines):**
As evident, the model occasionally generates text that doesn't adhere to the instructions we provided. Fortunately, we disregarded these extraneous outputs during the formatting process. To address this issue, we should investigate why this occurs and explore potential improvements to prevent such deviations from our guidelines in the future.
