# Day 1 - Examples: Entity Extraction

Entity extraction refers to extracting specific pieces of information from a given document/text. For instance, given a Wikipedia blurb, identify a person's date of birth. Note that we use the text completion APIs here as we do not need to have a conversation with the modell. We only give it instructions and expect an answer.

References/Further Reading:

OpenAI: The code here is a simplified version of https://github.com/openai/openai-cookbook/blob/main/examples/Entity_extraction_for_long_documents.ipynb

PaLM: The code here is based on https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/generative_ai/extraction.py

In [1]:
# Load environment variables
from dotenv import load_dotenv

load_dotenv("../../.env")

True

## Setup template

We setup a template here asking the model to extract the information we want from a given document. We can then replace different documents into this template and acquire that information at each point. Note the ending "0.". This helps the model to understand that we want it to answer point by point, starting with point 0. Feel free to customize the questions to your liking!

In [10]:
document = '<document>'
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. What's the name of the company?\n1. Who are the founders of the company?\n2. When is it founded?\n3. When did it raise series A?\nDocument: \"\"\"{document}\"\"\"\n\n0.'''
print(template_prompt)

Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. What's the name of the company?
1. Who are the founders of the company?
2. When is it founded?
3. When did it raise series A?
Document: """<document>"""

0.


#### User Input

Here we add in the document we want the model to read. Feel free to modify this to a document of your choice!

In [12]:
select_document = """
Overview:
Tech Solutions Inc. is a leading technology consulting firm specializing in providing innovative solutions to businesses across various industries. We offer a comprehensive range of services including software development, IT consulting, project management, and cybersecurity solutions. 
With a strong focus on delivering exceptional quality and customer satisfaction, we have established ourselves as a trusted partner for organizations seeking digital transformation.
They were founded on April 12, 2005 and raise their first seed in 2007 and series A on May 25, 2007.

Founders:

Background: John Smith is a visionary entrepreneur with over 20 years of experience in the technology industry. He has a deep understanding of market trends and has successfully led several software development projects for multinational corporations.
Role in the Company: As a co-founder of Tech Solutions Inc., John Smith plays a pivotal role in shaping the company's strategic direction. His expertise in software development and leadership skills have been instrumental in driving the company's growth.
Sarah Johnson:

Background: Sarah Johnson is a highly accomplished technologist with a strong background in software engineering. She has extensive experience in managing complex IT projects and has a proven track record of delivering innovative solutions.
Role in the Company: As a co-founder of Tech Solutions Inc., Sarah Johnson leads the company's technical operations. Her deep knowledge of software engineering principles and commitment to excellence have been crucial in establishing the company as a leader in the industry.
Together, John Smith and Sarah Johnson founded Tech Solutions Inc. with the aim of providing cutting-edge technology solutions to help businesses thrive in the digital age. Their combined expertise and passion for innovation have been instrumental in the company's success.
"""


## OpenAI

Extracting entities with OpenAI is pretty simple. We pretty much just send the template (with our document added in) to the Completion API and the output will be what the model thinks are the answers!

In [14]:
import openai

prompt = template_prompt.replace('<document>',select_document)

response = openai.Completion.create(
    model='text-davinci-003', 
    prompt=prompt,
    temperature=0,
    max_tokens=1500,
)
print("0." + response['choices'][0]['text'])

0. What's the name of the company?
Tech Solutions Inc.

1. Who are the founders of the company?
John Smith and Sarah Johnson (Page 1)

2. When is it founded?
April 12, 2005 (Page 1)

3. When did it raise series A?
May 25, 2007 (Page 1)


## PaLM

Extracting entities with PaLM is also trivial. We more or less do the same thing as we did for OpenAI and send the prompt to the model.

In [9]:
from vertexai.preview.language_models import TextGenerationModel

prompt = template_prompt.replace('<document>',select_document)

model = TextGenerationModel.from_pretrained("text-bison@001")
response = model.predict(prompt, max_output_tokens=1024)
print(response.text)

The name of the company is Tech Solutions Inc.
1. The founders of the company are John Smith and Sarah Johnson.
2. The company was founded on April 12, 2005.
3. The company raised series A on May 25, 2007.
