The main idea is to use scrapers to get concentrated content, and only then create a prompt for the language model. And then see how and in which direction the generation differs.

For example, web-scrapping of similar sites is taken.

The results are still mixed - Firecrawl scraper has run out of free tokens, Scrapegraph-ai has not been added (open source plus visual graph visualization). Also, now GPT4-o API requires constant authorization via phone, which is problematic to do from Russia, so it was replaced by GPT-2

## Setup test: Get competitors' pricing

In [1]:
competitor_sites = [
    {
        "name": "Articulate 360 by Adobe",
        "url": "https://www.articulate.com/360/pricing/freelancers"
    },
    {
        "name": "7taps",
        "url": "https://www.7taps.com/pricing"
    },
    {
        "name": "Mindsmith AI",
        "url": "https://www.mindsmith.ai/pricing"
    },
    {
        "name": "Cards-microlearning",
        "url": "https://www.cards-microlearning.com/en/tarifs"
    },
]

### Setup cost calculations

We can calculate how much it'll cost by using OpenAI's `tiktoken` library.

P.S. as of today, OpenAI hasn't updated `tiktoken` with the actual algorithm used to in `gpt-4o`, so we'll guesstimate using `gpt-4` tokenization encoding (cl100k_base).

In [2]:
pip install tiktoken --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.1 MB[0m [31m8.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m0.9/1.1 MB[0m [31m12.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import tiktoken

def count_tokens(input_string: str) -> int:
    tokenizer = tiktoken.get_encoding("cl100k_base")

    tokens = tokenizer.encode(input_string)

    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 5) -> float:
    num_tokens = count_tokens(input_string)

    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens

    return total_cost

# Example usage:
input_string = "What's the difference between beer nuts and deer nuts? Beer nuts are about 5 dollars. Deer nuts are just under a buck."
cost = calculate_cost(input_string)
print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

The total cost for using gpt-4o is: $US 0.000135


In [4]:
pip install prettytable tqdm --quiet

In [5]:
from typing import List, Callable, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = ALL

    cost_table.max_width = table_max_width
    cost_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to parse this content for extraction.")
    print(cost_table)

    return scraped_data

## Setup all the scrapers

### Beautiful Soup

In [6]:
pip install requests beautifulsoup4 --quiet

In [7]:
import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)


### Reader API by Jina AI

In [8]:
import requests

def scrape_jina_ai(url: str) -> str:
  response = requests.get("https://r.jina.ai/" + url)
  return response.text

### Firecrawl from Mendable.

In [9]:
pip install firecrawl-py --quiet

In [10]:
import firecrawl
import getpass

FIRECRAWL_API_KEY = getpass.getpass("Mendable API Key: ")

def scrape_firecrawl(url: str):
    app = firecrawl.FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    scraped_data = app.scrape_url(url)["markdown"]
    return scraped_data

Mendable API Key: ··········


## Scrapers comparison table

In [11]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      {"name": "Firecrawl", "function": scrape_firecrawl},
      {"name": "Jina AI", "function": scrape_jina_ai}
      ]

all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 200, 20)

Processing site Articulate 360 by Adobe using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  3.18it/s]
Processing site Articulate 360 by Adobe using Firecrawl: 100%|██████████| 1/1 [00:11<00:00, 11.54s/it]
Processing site Articulate 360 by Adobe using Jina AI: 100%|██████████| 1/1 [00:04<00:00,  4.06s/it]
Processing site 7taps using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  4.12it/s]
Processing site 7taps using Firecrawl: 100%|██████████| 1/1 [00:04<00:00,  4.92s/it]
Processing site 7taps using Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
Processing site Mindsmith AI using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  2.50it/s]
Processing site Mindsmith AI using Firecrawl: 100%|██████████| 1/1 [00:03<00:00,  3.81s/it]
Processing site Mindsmith AI using Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
Processing site Cards-microlearning using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  3.58it/s]
Processing site Cards-microlearning using Firec

Content Table:
+----------------------+------------------------+----------------------+----------------------+
|      Site Name       | Beautiful Soup content |  Firecrawl content   |   Jina AI content    |
+----------------------+------------------------+----------------------+----------------------+
|  Articulate 360 by   |         <html>         |    [Skip to main     |  Title: Freelancer   |
|        Adobe         | <head><title>403 Forbi |  content](#content)  |     Pricing for      |
|                      |  dden</title></head>   |                      |   Articulate 360 -   |
|                      |         <body>         | [![Articulate](https | Everything You Need  |
|                      | <center><h1>403 Forbid | ://www.articulate.co | to Create E‑Learning |
|                      |   den</h1></center>    | m/wp-content/uploads |                      |
|                      | <hr/><center>nginx</ce | /2023/06/articulate- | URL Source: https:// |
|                      | 




## OpenAI -> GPT-2



In [12]:
pip install openai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.3/328.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
import getpass
from openai import OpenAI

# NEED OPENAI VERIFICATION
#OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key: ')
#
#client = OpenAI(api_key=OPENAI_API_KEY)
#
#def extract(user_input: str):
#  entity_extraction_system_message = {"role": "system", "content": "Get me the three pricing tiers from this website's content, and return as a JSON with three keys: {cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}"}
#
#  messages = [entity_extraction_system_message]
#  messages.append({"role": "user", "content": user_input})
#
#  response = client.chat.completions.create(
#        model="gpt-4o",
#        messages=messages,
#        stream=False,
#        response_format={"type": "json_object"}
#    )
#
#  return response.choices[0].message.content

Heavy GPT-2

In [17]:
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
import json

def extract(user_input: str) -> dict:
    entity_extraction_system_message = (
        "Extract the three pricing tiers from the given content and return as JSON with keys: "
        "{cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}. "
        "Content: "
    )

    # Load a text generation pipeline with a more advanced model
    model_name = 'EleutherAI/gpt-neo-2.7B'
    generator = pipeline('text-generation', model=model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    # Concatenate the system message and user input
    prompt = entity_extraction_system_message + user_input

    # Tokenize the input
    input_ids = tokenizer.encode(prompt, return_tensors='pt', truncation=True, max_length=1024)

    # Generate the response
    response = generator(prompt, max_new_tokens=150)

    # Parse the response
    content = response[0]['generated_text']

    # Extract the JSON part of the generated text
    try:
        pricing_data = json.loads(content[content.index('{'):content.rindex('}')+1])
    except (ValueError, json.JSONDecodeError):
        pricing_data = {"cheapest": {"name": "", "price": 0.0}, "middle": {"name": "", "price": 0.0}, "most_expensive": {"name": "", "price": 0.0}}

    return pricing_data

Light GPT-2

In [14]:
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
import json

def extract(user_input: str) -> dict:
    entity_extraction_system_message = (
        "Extract the three pricing tiers from the given content and return as JSON with keys: "
        "{cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}. "
        "Content: "
    )

    # Load a text generation pipeline with the distilgpt2 model
    model_name = 'distilgpt2'
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

    # Concatenate the system message and user input
    prompt = entity_extraction_system_message + user_input

    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024)

    # Decode back to text to ensure length limit is respected
    truncated_prompt = tokenizer.decode(input_ids['input_ids'][0])

    # Generate the response
    response = generator(truncated_prompt, max_new_tokens=150, pad_token_id=tokenizer.eos_token_id)

    # Parse the response
    content = response[0]['generated_text']

    # Extract the JSON part of the generated text
    try:
        pricing_data = json.loads(content[content.index('{'):content.rindex('}')+1])
    except (ValueError, json.JSONDecodeError):
        pricing_data = {"cheapest": {"name": "", "price": 0.0}, "middle": {"name": "", "price": 0.0}, "most_expensive": {"name": "", "price": 0.0}}

    return pricing_data


### Display content.

In [18]:
def display_extracted_content(results: List[Dict[str, any]], num_objects: int):
    table = PrettyTable()
    table.field_names = ["Site", "Provider Name", "Extracted Content"]

    # Ensure num_objects does not exceed the length of the results list
    num_objects = min(num_objects, len(results))

    # Process the specified number of items from the results list with a progress bar
    for result in tqdm(results[:num_objects], desc="Processing results"):
        provider_name = result["provider"]

        for site in result["sites"]:
            function_name = site["name"]
            content = site["content"]

            # Progress bar for each function
            for _ in tqdm(range(1), desc=f"Extracting content with {provider_name} for {function_name}"):
                extracted_content = extract(content)
                table.add_row([provider_name, function_name, extracted_content])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = ALL

    print("Extracted Content Table:")
    print(table)

In [None]:
display_extracted_content(all_content, num_objects=9)

Processing results:   0%|          | 0/4 [00:00<?, ?it/s]
Extracting content with Articulate 360 by Adobe for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A

config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/10.7G [00:00<?, ?B/s]

In [None]:
display_extracted_content(all_content, num_objects=9)

Processing results:   0%|          | 0/4 [00:00<?, ?it/s]
Extracting content with Articulate 360 by Adobe for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Beautiful Soup: 100%|██████████| 1/1 [00:01<00:00,  1.42s/it]

Extracting content with Articulate 360 by Adobe for Firecrawl:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Firecrawl: 100%|██████████| 1/1 [00:01<00:00,  1.72s/it]

Extracting content with Articulate 360 by Adobe for Jina AI:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it]
Processing results:  25%|██▌       | 1/4 [00:04<00:14,  4.72s/it]
Extracting content with 7taps for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with 7taps for Beautiful Soup: 100%|██████████| 1/1 [00:02<00:00,  2.44s/it]

Extracting content with 7taps for Firecra

Extracted Content Table:
+-------------------------+----------------+----------------------------------------------------+
|           Site          | Provider Name  |                 Extracted Content                  |
+-------------------------+----------------+----------------------------------------------------+
| Articulate 360 by Adobe | Beautiful Soup |                         {                          |
|                         |                |                    "cheapest": {                   |
|                         |                |                      "name": "",                   |
|                         |                |                      "price": 0.0                  |
|                         |                |                          },                        |
|                         |                |                     "middle": {                    |
|                         |                |                      "name": "",                


