[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fw-ai/cookbook/blob/main/examples/function_calling/fireworks_functions_information_extraction.ipynb)

# Summarize Anything - Information Extraction via [Fireworks Function Calling](https://readme.fireworks.ai/docs/function-calling)

This is inspired by awesome colab notebook by [Deepset](https://colab.research.google.com/github/anakin87/notebooks/blob/main/information_extraction_via_llms.ipynb). Check out there OSS LLM Orchestration framework [haystack](https://haystack.deepset.ai/).

In this experiment, we will use function calling ability of [Fireworks Function Calling](https://readme.fireworks.ai/docs/function-calling) model to generate structured information from unstructured data.

🎯 Goal: create an application that, given a text (or URL) and a specific structure provided by the user, extracts information from the source.


The "**function calling**" capability first launched by [OpenAI](https://platform.openai.com/docs/guides/function-calling) unlocks this task: the user can describe a structure, by defining a fake function with all its typed and specific parameters. The LLM will prepare the data in this specific form and send it back to the user.

**Fireworks Function Calling**

Fireworks released a high quality function calling model which is capable of handling long tool context, multi turn conversations & interleaving tool invocations ith regular conversation. We are going to use this model today as our LLM to power our app.



>[Summarize Anything - Information Extraction via Fireworks Function Calling](#scrollTo=8Ksv005GbN2w)

>>[Introduction](#scrollTo=cp4hJ34JivkB)

>[Document Retrieval & Clean Up](#scrollTo=buM6rGqMwLZ4)

>>>[Let's learn about Capybara](#scrollTo=0kVJ8IfSI-Dx)

>>>[How about Yucatan Deer](#scrollTo=Tzz1LSS-JBk4)

>>>[Something more massive - African Elephant](#scrollTo=B0M4NEm9JMAw)

>>[Let's make example fun - News Summarization](#scrollTo=x7Y8_xmxDOKx)



## Setup
Let's install the dependencies needed for the demo first and import any dependencies needed.

In [None]:
!pip install openai

In [None]:
import torch
import json
from typing import Dict
import os
import openai
from IPython.display import HTML, display

## Setup your API Key

In order to use the Fireworks AI function calling model, you must first obtain Fireworks API Keys. If you don't already have one, you can one by following the instructions [here](https://readme.fireworks.ai/docs/quickstart).

In [None]:
model_name = "accounts/fireworks/models/firefunction-v1"
clinet = client = openai.OpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key = "YOUR_FW_API_KEY",
)

## Introduction

The [documentation](https://readme.fireworks.ai/docs/function-calling) for FW function calling details the API we can use to specify the list of tools/functions available to the model. We will use the described API to test out the structured response usecase.

Before we can begin, let's give the function calling model a go with a simple toy example and examine it's output.

In [None]:
tools = [
  {
    "type": "function",
    "function": {
      "name": "uber.ride",
      "description": "Find suitable ride for customers given the location, type of ride, and the amount of time the customer is willing to wait as parameters",
      "parameters": {
          "type": "object",
          "properties": {
             "loc":  { "type": "string", "description": "location of the starting place of the uber ride"},
             "type": { "type": "enum", "enum": ["plus", "comfort", "black"], "description": "types of uber ride user is ordering"},
             "time": { "type": "string", "description": "the amount of time in minutes the customer is willing to wait"}
          }
      }
    }
  }
]
tool_choice = "auto"
user_prompt = "Call me an Uber ride type \"Plus\" in Berkeley at zipcode 94704 in 10 minute"
messages = [
    {
         "role": "system",
         "content": "You are a helpful assistant with access to tools. Use them wisely and don't imagine parameter values",
    },
    {
        "role": "user",
        "content": user_prompt,
    }
]

In [None]:
chat_completion = client.chat.completions.create(
    model=model_name,
    messages=messages,
    tools=tools,
    tool_choice=tool_choice,
    temperature=0.1
)

In [None]:
print(chat_completion.choices[0].message.model_dump_json(indent=4))

{
    "content": " ",
    "role": "assistant",
    "function_call": null,
    "tool_calls": [
        {
            "id": "call_m3SQZptP33ehkhux6ogd7zLC",
            "function": {
                "arguments": "{\"loc\": \"Berkeley\", \"type\": \"plus\", \"time\": \"10\"}",
                "name": "uber.ride"
            },
            "type": "function",
            "index": 0
        }
    ]
}


The model outputs the function that should be called along with arguments under the `tool_calls` field. This field contains the arguments to be used for calling the function as JSON Schema and the `name` field contains the name of the function to be called.


The output demonstrates a sample input & output to function calling model. All good! ✅

## Document Retrieval & Clean Up

Before we can get started with extracting the right set of information. We need to first obtaint the document given a url & then clean it up. For cleaning up HTML, we will use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/).

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://www.rainforest-alliance.org/species/capybara/"
page = requests.get(url)

# Import Module
from bs4 import BeautifulSoup
import requests

# Website URL
URL = 'https://www.geeksforgeeks.org/data-structures/'

# Page content from Website URL
page = requests.get(url)

# Function to remove tags
def remove_tags(html):

    # parse html content
    soup = BeautifulSoup(html, "html.parser")

    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)


# Print the extracted data
cleaned_content = remove_tags(page.content)

## Setup Information Extraction using Function Calling

After we have obtained clean data from a html page given a url, we are going to send this data to function calling model. Along with sending the cleaned html, we are also going to send it the schema in which we expect the model to produce output. This schema is sent under the tool specification of chat completion call.

For this notebook, we use the `animal_info_tools` schema to extract information from species info pages of [Rain Forest Alliance](https://www.rainforest-alliance.org/). There are several attributes about the animal we want the model to extract from the web page e.g. `weight`, `habitat`, `diet` etc. Additionally, we specify some attributes as `required` forcing the model to always output this information regardless of the input. Given, we would be supplying the model with species information pages, we expect this information to be always present.

**NOTE** We set the temperature to 0.0 to get reliable and consistent output across calls. In this particular example, we want the model to produce the right answer rather than creative answer.

In [None]:
from typing import Dict, List, Any

def extract_data(tools: List[Dict[str, Any]], url: str) -> str:
  assert len(tools) == 1, "Only one scehma can be selected for data extraction"
  tool_choice = {
      "type": "function",
      "function": {
          "name": tools[0]["function"]["name"]
      }
  }
  page = requests.get(url)
  cleaned_content = remove_tags(page.content)

  messages = [
      {
          "role": "system",
          "content": f"You are a helpful assistant with access to tools. Use them wisely and don't imageine parameter values."
      },
      {
          "role": "user",
          "content": f"Extract data from the following text. START TEXT {cleaned_content} END TEXT."
      }
  ]

  chat_completion = client.chat.completions.create(
    model=model_name,
    messages=messages,
    tools=tools,
    tool_choice=tool_choice,
    temperature=0.0
  )

  def val_to_color(val):
    """
    Helper function to return a color based on the type/value of a variable
    """
    if isinstance(val, list):
      return "#FFFEE0"
    if val is True:
      return "#90EE90"
    if val is False:
      return "#FFCCCB"
    return ""

  args = json.loads(chat_completion.choices[0].message.tool_calls[0].function.arguments)

  # Convert data to HTML format
  html_content = '<div style="border: 1px solid #ccc; padding: 10px; border-radius: 5px; background-color: #f9f9f9;">'
  for key, value in args.items():
      html_content += f'<p><span style="font-family: Cursive; font-size: 30px;">{key}:</span>'
      html_content += f'&emsp;<span style="background-color:{val_to_color(value)}; font-family: Cursive; font-size: 20px;">{value}</span></p>'
  html_content += '</div>'

  return {"html_visualization": html_content}

In [None]:
animal_info_tools = [
{
  "type": "function",
  "function": {
      "name": "extract_data",
      "description": "Extract data from text",
      "parameters": {
          "type": "object",
          "properties": {
              "about_animals": {
                  "description": "Is the article about animals?",
                  "type": "boolean",
              },
              "about_ai": {
                  "description": "Is the article about artificial intelligence?",
                  "type": "boolean",
              },
              "weight": {
                  "description": "the weight of the animal in lbs",
                  "type": "integer",
              },
              "habitat": {
                  "description": "List of places where the animal lives",
                  "type": "array",
                  "items": {"type": "string"},
              },
              "diet": {
                  "description": "What does the animal eat?",
                  "type": "array",
                  "items": {"type": "string"},
              },
              "predators": {
                  "description": "What are the animals that threaten them?",
                  "type": "array",
                  "items": {"type": "string"},
              },
          },
          "required": ["about_animals", "about_ai", "weight", "habitat", "diet", "predators"],
      }
  }
}
]

### Let's learn about Capybara

Given the schema, we expect the model to produce some basic information like `weight`, `habitat`, `diet` & `predators` for Capybara. You can visit the [webpage](https://www.rainforest-alliance.org/species/capybara/) to see the source of the truth.

In [None]:
display(HTML(extract_data(animal_info_tools, url="https://www.rainforest-alliance.org/species/capybara/")['html_visualization']))

You can see the model correctly identifies the correct weight - `100 lbs` for the Capybara even though the webpage mentions the weight in `kgs` too. It also identifies the correct habitat etc. for the animal.  

### How about Yucatan Deer

In [None]:
display(HTML(extract_data(animal_info_tools, url="https://www.rainforest-alliance.org/species/yucatan-deer/")['html_visualization']))

### Something more massive - African Elephant

In [None]:
display(HTML(extract_data(animal_info_tools, url="https://www.rainforest-alliance.org/species/african-elephants/")['html_visualization']))

## Let's make example fun - News Summarization

Now let's use a more fun example. In order for LLMs to leverage the world knowledge, they need to be able to organize unstructured sources like websites into more structured information. Let's take the example of a news article announcing the new funding round for the startup [Perplexity AI](https://www.perplexity.ai/). For our sample news summarization app, the user only specifies the small list of information that want from the article and then ask the LLM to generate the needed information for them.

In [None]:
# Detail the information needed
news_info_tools = [
  {
    "type": "function",
    "function": {
        "name": "extract_data",
        "description": "Extract data from text",
        "parameters": {
            "type": "object",
            "properties": {
                "about_ai": {
                    "description": "Is the article about artificial intelligence?",
                    "type": "boolean",
                },
                "company_name": {
                    "description": "The name of the company which is being referenced in document",
                    "type": "string",
                },
                "valuation": {
                    "description": "Valuation of the company which is being referenced in document",
                    "type": "string",
                },
                "investors": {
                    "description": "investors in the company being referenced in document",
                    "type": "array",
                    "items": {"type": "string"},
                },
                "competitors": {
                    "description": "competitors of the company being referenced in document",
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["about_ai", "company_name", "valuation", "investors", "competitors"],
        }
    }
  }
]

In [None]:
display(HTML(extract_data(news_info_tools, url="https://finance.yahoo.com/news/perplexity-ai-challenge-google-hinges-124622631.html")['html_visualization']))