# Data Scraping Using Prompt Engineering in Generative AI

Carolyn Kao (carolyn.kao@lseg.com)\
July 15, 2024

**Disclaimer**

>> This Jupyter notebook is intended solely for demonstration purposes. The content herein is unofficial and does not contain any proprietary or confidential information belonging to LSEG. All data used in this notebook is publicly available or generated for the purposes of this demo.
>> Furthermore, this notebook does not include the final solutions submitted for PoTaggle III and is not a representation of the final product or solution.

>> This tutorial is partially checked in to https://github.com/chkao831/06-2024_DataScraping_PoTaggle/

## Objective

Given some target url's, the aim is to scrape data for various companies in terms of their earnings release and earnings call, without the need for manual data entry or collection. 

<img src="./images/DALLE_beautifulsoup.png" alt="drawing" width="50%"/>

## Project Setup

You need an API key to call the Gemini API. \
Get an API key at https://aistudio.google.com/app/apikey

The `config.py` is not checked in due to security reason. Please refer to `config_sample.py` for an example.

In [1]:
import google.generativeai as genai
from config import API_KEY

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import textwrap
from IPython.display import Markdown

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [3]:
# Access your API key as an environment variable.
genai.configure(api_key=API_KEY) # os.environ['API_KEY']

# Choose a model that's appropriate for your use case.
model = genai.GenerativeModel('gemini-1.5-pro-latest')

| Model variant | Input(s) | Output | Optimized for |
|---|---|---|---|
| Gemini 1.5 Pro | Audio, images, videos, and text | Text | Complex reasoning tasks such as code and text generation, text editing, problem solving, data extraction and generation |
| Gemini 1.5 Flash | Audio, images, videos, and text | Text | Fast and versatile performance across a diverse variety of tasks |


## Response Generation

In [4]:
prompt = "Introduce, very briefly, about what function calling is in generative ai api call?"
response = model.generate_content(prompt)

to_markdown(response.text)

> Function calling in generative AI API calls lets you connect your AI model to external tools and APIs.  Instead of just generating text, the model can now **understand your intent to use a tool**,  **figure out what information it needs**, and then **make that API call for you**. 
> 
> Think of it like this: you're asking a helpful assistant to do something that requires more than just talking.  


### Temperature and Max Output Token

In [5]:
model = genai.GenerativeModel(
    'gemini-1.5-pro-latest',
    generation_config=genai.GenerationConfig(
        max_output_tokens=100,
        temperature=2,
    ))

Every prompt you send to the model includes parameters that control how the model generates responses. You can use GenerationConfig to configure these parameters. If you don't configure the parameters, the model uses default options, which can vary by model.

<img src="./images/temperature.png" alt="drawing" width="50%"/>

`temperature` controls the randomness of the output. Use higher values for more creative responses, and lower values for more deterministic responses. Values can range from [0.0, 2.0].

`maxOutputTokens` sets the maximum number of tokens to include in a candidate.

In [6]:
response = model.generate_content(prompt)

to_markdown(response.text)

> Function calling in generative AI API calls lets you connect your AI model to external tools or APIs. 
> 
> **Simply put, you can now give the AI instructions that include using specific functions (like calculations, database lookups, or other API interactions) to generate richer, more dynamic responses.** 


In [7]:
model = genai.GenerativeModel(
    'gemini-1.5-pro-latest',
    generation_config=genai.GenerationConfig(
        max_output_tokens=200,
        temperature=0,
    ))

response = model.generate_content(prompt)
to_markdown(response.text)

> Function calling in generative AI API calls allows you to describe functions to the AI model, which it can then intelligently choose to use when generating its response.  Instead of just giving you text, the AI can now return structured data indicating a function to call and the data needed for that function. This lets you connect the AI to your own tools and systems, making it much more powerful and useful for building applications. 


## Extract text from a URL using BeautifulSoup

<img src="./images/soup.png" alt="drawing" width="50%"/>

In [8]:
from bs4 import BeautifulSoup
import requests

In [9]:
url = 'http://newsfile.refinitiv.com/getnewsfile/v1/story?guid=urn:newsml:reuters.com:20231220:nPn2HTQZpa&default-theme=true'

In [10]:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
text_content = soup.get_text()
clean_text = ' '.join(text_content.split()).strip()

In [11]:
to_markdown(clean_text)

> JELD-WEN to Release Fourth Quarter and Full Year 2023 Results JELD-WEN to Release Fourth Quarter and Full Year 2023 Results PR Newswire CHARLOTTE, N.C., Dec. 20, 2023 CHARLOTTE, N.C., Dec. 20, 2023 /PRNewswire/ -- JELD-WEN Holding, Inc. (NYSE: JELD), a leading global manufacturer of building products, announced today that it will release fourth quarter and full year 2023 results on Monday, February 19, 2024. The company will hold a conference call to discuss the results at 8 a.m. ET on Tuesday, February 20, 2024. Interested investors and other parties can access the call either via webcast by visiting the Investor Relations section of the company's website at https://investors.jeld-wen.com, or by dialing 888-330-2446 from the United States or +1-240-789-2732 internationally and using the conference ID 1285715. For those unable to listen to the live event, a replay will be available on the company's website approximately two hours following completion of the call. To learn more about JELD-WEN, please visit the company's website at https://corporate.jeld-wen.com. About JELD-WEN Holding, Inc.JELD-WEN is a leading global designer, manufacturer and distributor of high-performance interior and exterior doors, windows, and related building products serving the new construction and repair and remodeling sectors. Headquartered in Charlotte, N.C., the company operates facilities in 16 countries in North America and Europe and employs approximately 18,000 people. Since 1960, the JELD-WEN team has been committed to making quality products that create safe and sustainable environments for customers, associates and local communities. The JELD-WEN family of brands includes JELD-WEN® worldwide; LaCantina™ and VPI™ in North America; and Swedoor® and DANA® in Europe. For more information, visit corporate.jeld-wen.com. Media Contact:Colleen PenhallVice President, Corporate Communications980-322-2681cpenhall@jeldwen.com Investor Relations Contact:James ArmstrongVice President, Investor Relations704-378-5731jarmstrong@jeldwen.com View original content to download multimedia:https://www.prnewswire.com/news-releases/jeld-wen-to-release-fourth-quarter-and-full-year-2023-results-302012716.html SOURCE JELD-WEN Holding, Inc.

In [12]:
organization_name = soup.find('div', class_='organization')
print(organization_name)

None


## Function Calling

### What's the problem above?

-- **Limited Context Understanding**!!

- Limitation: BeautifulSoup extracts data based on specified tags and classes without understanding the context or semantic meaning of the content.
- Impact: This can lead to incorrect data extraction if the target information is not well-defined or context-dependent.


reference: https://ai.google.dev/gemini-api/docs/function-calling

### A simplified example for data scraping

In [13]:
model = genai.GenerativeModel(model_name='gemini-1.5-pro-latest')

In [14]:
prompt = f"""
I am reading a financial news. I heard that the company will 
be holding a call to discuss the year results. 
Please extract the timezone and the call date for me.

The financial news is extracted as {clean_text}
"""

In [15]:
chat = model.start_chat()
response = chat.send_message(prompt)

In [16]:
response.text

'The conference call is scheduled for **Tuesday, February 20, 2024**, at **8 a.m. ET** (Eastern Time). \n'

### Again, what's the problem above?

-- **Unstable output format**!!

Custom functions can be defined and provided to Gemini models using the Function Calling feature. The models do not directly invoke these functions, but instead generate structured data output that specifies the function name and suggested arguments. 

This output lets you write applications that take the structured output and call external APIs, and the resulting API output can then be incorporated into a further model prompt, allowing for more comprehensive query responses.

| Parameter | Type | Required | Description |
|---|---|---|---|
| time_zone | string | yes | The time zone of the specified event. |
| date | string| yes | The date of the event. |

In [17]:
def get_time_and_date(time_zone: str, date: str):
    """Get the time zone and the date of a specified event. 

    Args:
        time_zone: The time zone of the specified event
        date: The date of the event

    Returns:
        A dictionary containing the set time_zone and date.
    """
    return {
        "time_zone": time_zone,
        "date": date
    }

In [18]:
model = genai.GenerativeModel(model_name='gemini-1.5-pro-latest',
                              tools=[get_time_and_date])

In [19]:
chat = model.start_chat()
response = chat.send_message(prompt)

In [20]:
chat.history

[parts {
   text: "\nI am reading a financial news. I heard that the company will \nbe holding a call to discuss the year results. \nPlease extract the timezone and the call date for me.\n\nThe financial news is extracted as JELD-WEN to Release Fourth Quarter and Full Year 2023 Results JELD-WEN to Release Fourth Quarter and Full Year 2023 Results PR Newswire CHARLOTTE, N.C., Dec. 20, 2023 CHARLOTTE, N.C., Dec. 20, 2023 /PRNewswire/ -- JELD-WEN Holding, Inc. (NYSE: JELD), a leading global manufacturer of building products, announced today that it will release fourth quarter and full year 2023 results on Monday, February 19, 2024. The company will hold a conference call to discuss the results at 8 a.m. ET on Tuesday, February 20, 2024. Interested investors and other parties can access the call either via webcast by visiting the Investor Relations section of the company\'s website at https://investors.jeld-wen.com, or by dialing 888-330-2446 from the United States or +1-240-789-2732 inter

In [21]:
for k, v in chat.history[-1].parts[0].function_call.args.items():
    print(f'Parameter: {k}')
    print(f'Output: {v}')

Parameter: date
Output: February 20, 2024
Parameter: time_zone
Output: ET


To even fine tune to a desired format...

In [22]:
prompt = f"""
I am reading a financial news. I heard that the company will 
be holding a call to discuss the year results. 
Please extract the timezone and the call date for me.

Please note that after extracting the time zone, if it's an abbreviation, 
help me convert to its full term at your knowledge. 
And please convert the time to YYYY/MM/DD format. 

The financial news is extracted as {clean_text}
"""

In [23]:
response = chat.send_message(prompt)

In [24]:
response

response:
GenerateContentResponse(
    done=True,
    iterator=None,
    result=glm.GenerateContentResponse({
      "candidates": [
        {
          "content": {
            "parts": [
              {
                "function_call": {
                  "name": "get_time_and_date",
                  "args": {
                    "date": "2024/02/20",
                    "time_zone": "Eastern Time"
                  }
                }
              }
            ],
            "role": "model"
          },
          "finish_reason": 1,
          "index": 0,
          "safety_ratings": [
            {
              "category": 8,
              "probability": 1,
              "blocked": false
            },
            {
              "category": 7,
              "probability": 1,
              "blocked": false
            },
            {
              "category": 9,
              "probability": 1,
              "blocked": false
            },
            {
              "category": 1

In [25]:
for k, v in chat.history[-1].parts[0].function_call.args.items():
    print(f'Parameter: {k}')
    print(f'Output: {v}')

Parameter: date
Output: 2024/02/20
Parameter: time_zone
Output: Eastern Time


In [26]:
fc = response.candidates[0].content.parts[0].function_call

In [27]:
fc

name: "get_time_and_date"
args {
  fields {
    key: "time_zone"
    value {
      string_value: "Eastern Time"
    }
  }
  fields {
    key: "date"
    value {
      string_value: "2024/02/20"
    }
  }
}

In [28]:
assert fc.name == 'get_time_and_date'

In [29]:
fc.args['time_zone']

'Eastern Time'

In [30]:
fc.args['date']

'2024/02/20'

## Appendix: Wrap things up together

For demo purpose only. 
- prerequisite: fill in the url's and question prompt in `config.py`

In [31]:
from config import url_list, question_prompt
from util import aggregate_information_into_json
import pandas as pd

In [32]:
columns = ['Event', 'Org', 'Ticker', 'TimeFrame', 'Date', 'MarketPhase', 'Estimate Days', 'Time', 'Timezone', 'Website', 'Phone_Number', 'Passcode', 'url']
df = pd.DataFrame(columns=columns)

model = genai.GenerativeModel(model_name='gemini-1.5-pro-latest', tools=[aggregate_information_into_json])
num_trials = 5

for url in url_list:
    print(f"Working on {url}")
    res = {}
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    text_content = soup.get_text()
    clean_text = ' '.join(text_content.split()).strip()
    for _ in range(num_trials):
        try:
            chat = model.start_chat()
            response = chat.send_message(
                f"{question_prompt}. The release text is {clean_text}.")
            for param, data in chat.history[-1].parts[0].function_call.args.items():
                res[param]=data

            row_release = {
                'Event': 'Earnings Release',
                'Org': res['organization_name'],
                'Ticker': res['ticker'],
                'TimeFrame': res['release_time_quarter'],
                'Date': res['news_release_date'],
                'MarketPhase': res['market_phase'],
                'url': url
            }
            row_call = {
                'Event': 'Earnings Call',
                'Org': res['organization_name'],
                'Ticker': res['ticker'],
                'TimeFrame': res['release_time_quarter'],
                'Date': res['call_date'],
                'Time': res['call_time'],
                'Timezone': res['release_timezone'],
                'Website': res['website'],
                'Phone_Number': res['phone_number'],
                'Passcode': res['passcode'],
                'url': url
            }
            df = pd.concat([df, pd.DataFrame([row_release])], ignore_index=True)
            df = pd.concat([df, pd.DataFrame([row_call])], ignore_index=True)
            break
        except Exception as e:
            print(e)

Working on http://newsfile.refinitiv.com/getnewsfile/v1/story?guid=urn:newsml:reuters.com:20231220:nPn2HTQZpa&default-theme=true


Working on http://www.apog.com/news-releases/news-release-details/apogee-enterprises-announces-date-fiscal-2024-third-quarter
Working on http://investors.lee.net/news-releases/news-release-details/lee-enterprises-plans-quarterly-call-and-webcast-december-7-2023
Working on http://newsfile.refinitiv.com/getnewsfile/v1/story?guid=urn:newsml:reuters.com:20231204:nBwb4LlQSa&default-theme=true
'NoneType' object has no attribute 'items'
Working on http://radioone.gcs-web.com/news-releases/news-release-details/urban-one-inc-first-and-second-quarter-2023-results-conference
Working on http://investors.asana.com/news-releases/news-release-details/asana-announce-third-quarter-fiscal-year-2024-financial-results


In [33]:
df

Unnamed: 0,Event,Org,Ticker,TimeFrame,Date,MarketPhase,Estimate Days,Time,Timezone,Website,Phone_Number,Passcode,url
0,Earnings Release,[JELD-WEN],[NYSE:JELD],"[[q4, 2023]]",20/12/2023,[BMO],,,,,,,http://newsfile.refinitiv.com/getnewsfile/v1/s...
1,Earnings Call,[JELD-WEN],[NYSE:JELD],"[[q4, 2023]]",20/02/2024,,,8:00 am,UTC-05:00 Eastern Time (US & Canada),"[https://investors.jeld-wen.com, https://corpo...","[[live, 888-330-2446,+1-240-789-2732], [replay...",[No Passcode Found],http://newsfile.refinitiv.com/getnewsfile/v1/s...
2,Earnings Release,Apogee,NASDAQ:APOG,"[[q3, 2024]]",07/12/2023,BMO,,,,,,,http://www.apog.com/news-releases/news-release...
3,Earnings Call,Apogee,NASDAQ:APOG,"[[q3, 2024]]",21/12/2023,,,8:00 am,UTC-06:00 Central Time (US & Canada),[https://www.apog.com/events-and-presentations...,"[[live, []], [replay, []]]",No Passcode Found,http://www.apog.com/news-releases/news-release...
4,Earnings Release,Lee Enterprises,NASDAQ:LEE,"[[q4, 2023]]","December 01, 2023",BMO,,,,,,,http://investors.lee.net/news-releases/news-re...
5,Earnings Call,Lee Enterprises,NASDAQ:LEE,"[[q4, 2023]]","December 7, 2023",,,9:00 am,UTC-06:00 Central Time,https://www.lee.net,No phone-number Found,No Passcode Found,http://investors.lee.net/news-releases/news-re...
6,Earnings Release,System1,NYSE:SST,"[[q3, 2023]]",04/12/2023,AMC,,,,,,,http://newsfile.refinitiv.com/getnewsfile/v1/s...
7,Earnings Call,System1,NYSE:SST,"[[q3, 2023]]",12/12/2023,,,5:00 pm,UTC-05:00 Eastern Time (US & Canada),[https://www.businesswire.com/news/home/202312...,"[[live, []], [replay, []]]",[No Passcode Found],http://newsfile.refinitiv.com/getnewsfile/v1/s...
8,Earnings Release,Urban One,"NASDAQ:UONEK,UONE","[[q1, 2023], [q2, 2023]]",01/12/2023,BMO,,,,,,,http://radioone.gcs-web.com/news-releases/news...
9,Earnings Call,Urban One,"NASDAQ:UONEK,UONE","[[q1, 2023], [q2, 2023]]",07/12/2023,,,10:00 am,EST,https://www.prnewswire.com/news-releases/urban...,"1-844-721-7241, +1) 409-207-6955",7824764,http://radioone.gcs-web.com/news-releases/news...
