<b><h1>Cloudflare Workers AI

Run machine learning models, powered by serverless GPUs, on Cloudflare's global network.

Available on Free and Paid plans
Workers AI allows you to run AI models in a serverless way, without having to worry about scaling, maintaining, or paying for unused infrastructure. You can invoke models running on GPUs on Cloudflare's network from your own code — from Workers, Pages, or anywhere via the Cloudflare API.

Workers AI gives you access to:

50+ open-source models, available as a part of our model catalog
Serverless, pay-for-what-you-use pricing model
All as part of a fully-featured developer platform, including AI Gateway, Vectorize, Workers and more...

Workers AI is included in both the Free and Paid Workers plans and is priced at $0.011 per 1,000 <b>Neurons.

Free allocation allows anyone to use a total of 10,000 Neurons per day at no charge. To use more than 10,000 Neurons per day, you need to sign up for the Workers Paid plan. On Workers Paid, you will be charged at $0.011 / 1,000 Neurons for any usage above the free allocation of 10,000 Neurons per day.

You can monitor your Neuron usage in the Cloudflare Workers AI dashboard ↗.

All limits reset daily at 00:00 UTC. If you exceed any one of the above limits, further operations will fail with an error.

Workers Free & Workers Paid Plans - 10,000 Neurons per day
Workers Paid Plan - $0.011 / 1,000 Neurons

<h1><b>What are Neurons?
    
<h4>Neurons are our way of measuring AI outputs across different models, representing the GPU compute needed to perform your request. The serverless model allows you to pay only for what you use without having to worry about renting, managing, or scaling GPUs.

<b><h1>Image Model Pricing

<h4>Model	    <h5>@cf/openai/whisper-large-v3-turbo
    
<h4>Price in Tokens		<h5>$0.0005 per audio minute                    	

<h4>Price in Neurons    <h5>46.63 neurons per audio minute

                                                  

In [None]:
!pip install cloudflare

In [None]:
!pip install base64

In [None]:
from huggingface_hub import login
from google.colab import userdata as ud
at = ud.get('api_token')
acc = ud.get('account_id')
hf_token = ud.get('HF_TOKEN')
login(hf_token, add_to_git_credential = True)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/denver_extract.mp3'

In [None]:
with open(path, "rb") as af:
  ab = af.read()

In [None]:
#!pip install cloudflare
import json
import base64
from IPython.display import display, Markdown
from cloudflare import Cloudflare

client = Cloudflare(api_token = at)

# Base64 encode the audio data as a string before sending
audio_b64_encoded = base64.b64encode(ab).decode('utf-8')

result = client.ai.with_raw_response.run(
    account_id = acc,
    model_name = "@cf/openai/whisper-large-v3-turbo",
    audio = audio_b64_encoded
    )
text = result.json()['result']['text'] # Directly assign the result to 'text'

display(Markdown(text))

In [None]:
print(f"Type of ab: {type(ab)}")
print(f"Size of ab: {len(ab)} bytes")

In [None]:
system_message = """
You produce minutes of meetings from transcripts, with summary, key discussion points,
takeaways and action items with owners, in markdown format without code blocks.
"""

user_prompt = f"""
Below is an extract transcript of a Denver council meeting.
Please write minutes in markdown without code blocks, including:
- a summary with attendees, location and date
- discussion points
- takeaways
- action items with owners

Transcription:
{text}
"""

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_prompt}
  ]

In [None]:
LLAMA = "meta-llama/Llama-3.1-8B-Instruct"

In [None]:
# Uninstall torch to ensure a clean installation
!pip uninstall -y torch

# Install torch with CUDA support (common in Colab) and other dependencies
# If you encounter issues, you might need to specify the CUDA version (e.g., cu118 or cu121) explicitly
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install torch
!pip install transformers accelerate bitsandbytes sentencepiece requests


In [None]:
import torch
from transformers import AutoTokenizer as AT, AutoModelForCausalLM as AMFCLM, TextStreamer as TS, BitsAndBytesConfig as BABC

In [None]:
qc = BABC(
    load_in_4bit = True,
    bnb_use_double_quant = True,
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_quant_type = "nf4"
)

In [None]:
tz = AT.from_pretrained(LLAMA)
tz.pad_token = tz.eos_token
inputs = tz.apply_chat_template(messages, return_tensors = "pt").to("cuda")
streamer = TS(tz)
model = AMFCLM.from_pretrained(LLAMA, device_map = "auto", quantization_config = qc)
outputs = model.generate(inputs, max_new_tokens = 10000, streamer = streamer)

In [None]:
response = tz.decode(outputs[0])

In [None]:
display(Markdown(response))