<a href="https://colab.research.google.com/github/automationcreators/flowchartmaker/blob/main/Cag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os

os.environ['GOOGLE_API_KEY'] = 'xxx'
os.environ['FIRECRAWL_API_KEY'] = 'xxx'
os.environ['HELICONE_API_KEY'] = 'xxx'

In [None]:
from google import genai
from google.genai import types
import pathlib
import httpx
import json

client = genai.Client(
    api_key=os.environ['GOOGLE_API_KEY'],
    http_options={
        "base_url": 'https://gateway.helicone.ai',
        "headers": {
            "helicone-auth": f'Bearer {os.environ.get("HELICONE_API_KEY")}',
            "helicone-target-url": 'https://generativelanguage.googleapis.com'
        }
    })

doc_url = "https://arxiv.org/pdf/2412.15605v1"  # Replace with the actual URL of your PDF

# Retrieve and encode the PDF byte
filepath = pathlib.Path('file.pdf')
filepath.write_bytes(httpx.get(doc_url).content)

121164

In [None]:
prompt = "Who are the authors"
response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      prompt])

## Example API assistant

### Raw response

In [None]:
def direct_llm_call(prompt):
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[prompt])
    return(response.text)

raw_response = direct_llm_call("Help me generate api request in curl for scrape'ai-jason.com' using firecrawl rest api, return only the curl command")
print(raw_response)

```bash
curl -X POST \
  'https://api.firecrawl.dev/crawler/crawl' \
  -H 'Content-Type: application/json' \
  -H 'X-Firecrawl-Api-Key: YOUR_FIRECRAWL_API_KEY' \
  -d '{
  "url": "https://ai-jason.com",
  "options": {
    "extract_rules": {
      "page": {
        "type": "json"
      }
    }
  }
}'
```

**Important:**

*   Replace `YOUR_FIRECRAWL_API_KEY` with your actual Firecrawl API key.
*   This command uses the default `json` extractor on the whole page.  You'll likely want to define more specific extraction rules within the `extract_rules` section depending on what data you're trying to scrape from the site.  See Firecrawl documentation for details on defining extract rules.



### Actual implementation

In [None]:
from firecrawl import FirecrawlApp

fire = FirecrawlApp(api_key=os.getenv("FIRECRAWL_API_KEY"))

url = "https://docs.firecrawl.dev/"

# Get all links from the website
all_links = fire.map_url(url).get('links')[1:]

len(all_links)

In [None]:
def filter_api_doc_urls(doc_links):
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=f"{doc_links}, Above is list of urls, your goal is to extract all docs contain REST API reference, return only the urls",
        config={
            "response_mime_type": "application/json",
            "response_schema": list[str]
        }
    )

    return json.loads(response.candidates[0].content.parts[0].text)

api_doc_urls = filter_api_doc_urls(all_links)
len(api_doc_urls)

31

In [None]:
def get_markdown_from_urls(urls: list[str]):
    batch_scrape_result = fire.batch_scrape_urls(urls, {'formats': ['markdown']})

    all_markdown = []
    for page in batch_scrape_result['data']:
        all_markdown.append({
            "url": page['metadata']['url'],
            'markdown': page['markdown']
        })

    return all_markdown

all_markdowns = get_markdown_from_urls(api_doc_urls)
all_markdowns


[{'url': 'https://docs.firecrawl.dev/api-reference/endpoint/search',
 {'url': 'https://docs.firecrawl.dev/api-reference/introduction',
  'markdown': '[Firecrawl Docs home page![light logo](https://mintlify.s3.us-west-1.amazonaws.com/firecrawl/logo/light.svg)![dark logo](https://mintlify.s3.us-west-1.amazonaws.com/firecrawl/logo/dark.svg)](https://firecrawl.dev/)\n\nv1\n\nSearch or ask...\n\nCtrl K\n\nSearch...\n\nNavigation\n\nUsing the API\n\nIntroduction\n\n[Documentation](https://docs.firecrawl.dev/introduction) [SDKs](https://docs.firecrawl.dev/sdks/overview) [Learn](https://www.firecrawl.dev/blog/category/tutorials) [API Reference](https://docs.firecrawl.dev/api-reference/introduction)\n\n## [\u200b](https://docs.firecrawl.dev/api-reference/introduction\\#features)  Features\n\n[**Scrape** \\\\\n\\\\\nExtract content from any webpage in markdown or json format.](https://docs.firecrawl.dev/api-reference/endpoint/scrape) [**Crawl** \\\\\n\\\\\nCrawl entire websites, extract their co

In [None]:
def cag_response_call(prompt):
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[json.dumps(all_markdowns),
                  prompt])
    return(response.text)

cag_response = cag_response_call("Help me generate api request in curl for scrape'ai-jason.com' using firecrawl rest api, return only the curl command")
print(cag_response)

```curl
curl --request POST \
  --url https://api.firecrawl.dev/v1/scrape \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "ai-jason.com",
  "formats": [
    "markdown"
  ],
  "onlyMainContent": true,
  "includeTags": [],
  "excludeTags": [],
  "headers": {},
  "waitFor": 0,
  "mobile": false,
  "skipTlsVerification": false,
  "timeout": 30000,
  "jsonOptions": {
    "schema": {},
    "systemPrompt": "<string>",
    "prompt": "<string>"
  },
  "actions": [],
  "location": {
    "country": "US",
    "languages": [
      "en-US"
    ]
  },
  "removeBase64Images": true,
  "blockAds": true,
  "proxy": "basic"
}'
```
