## eventsFeed - measuring average event size

The [eventsFeed](https://api.catonetworks.com/documentation/#query-eventsFeed) API query is the number one Cato Networks API call no matter how you measure it - in terms of the number of customers who call it, number of calls each week and volume of data transferred, at time of writing (February 2025) it is in pole position on every leader board. Customers use it to transfer their security, connectivity and system events in JSON format from the Cato data lake to their external SIEM vendor either using an integration [created by the vendor](https://support.catonetworks.com/hc/en-us/articles/13975273800733-Third-Party-Supported-Integrations-for-Cato-Data) or creating their own custom integration (often in conjunction with a Cato partner or other third party expertise).

As well as providing the API as a **pull** method, Cato can **push** events to cloud storage, such as Amazon S3 and Azure Storage. In both cases, one frequently asked question is "How much data am I going to consume?". In this example, we will show how to use the **eventsFeed** query to calculate the average event size for a Cato account. This relies on the following assumptions and pre-requisites:
* Event integration has been enabled, to start feeding events into the queue (CMA / Resources / Event Integrations / Enable integration with Cato events).
* The account is fully onboarded with all features (especially security policies) enabled. Even if this is not the case it is still possible to obtain a reasonably close estimate of the eventual average.
*  An API key has been created. Even if the use case is the push to cloud storage, which does not use the API, the **eventsFeed** API query can still provide a very good estimate of the average event size for pushed events.
*  This example provides the average event size for the full event feed including all event types and subtypes. If the intention is to fetch only a subset of these by supplying a filter with the API query, this can significantly alter the average event size. In order to calculate the average in this case, the eventsFeed query used in this example would need to include the same filters.

### Factors affecting the average event size

Each event is a set of fields in JSON format. When measured across all Cato customers, the average event size is normally distributed but it is a wide distribution with considerable variability between the upper and lower extremes, so while it is possible to state an average event size for the entire customer base, many customers will experience an average value significantly above or below the company-wide average. This is due to:
* Differences in policy logging configuration. For most customers, ~95% of events are Internet Firewall and WAN Firewall, so a customer who chooses not to enable those features or not to log their traffic will have a significantly different feed.
* There is a core set of fields which all events have (such as the ID and timestamp) but otherwise, the number and content of fields varies widely between different event subtypes, so customers who disable core features or enable additional features will have a different set of events contributing to the average.
* Differences in feature configuration. A customer who enables identity features such as SSO and Identity Agent will have richer fields including additional user identity information, which will increase the size of some of their events.
* Other customer-specific differences, such as internal domain names. A customer whose internal namespace is "example.local" will have shorter device name fields than a customer with "examplesite.examplesubdomain.country.example.local".

### How eventsFeed works

[eventsFeed](https://support.catonetworks.com/hc/en-us/articles/360019839477-Cato-API-EventsFeed-Large-Scale-Event-Monitoring) works from a queue, not a timeframe. Some customers find this to be counter-intuitive and difficult to understand at first, but the reasons for doing it this way are clear - Cato is a globally distributed platform with customers who can generate well in excess of a thousand events per second. Gathering these events and processing them for distribution takes a non-zero amount of time, so it becomes extremely difficult, to guarantee event delivery based on timeframes. Using a queue ensures that, provided customers keep fetching, they are guaranteed to receive every event. A marker is used to maintain position in the queue. Submitting a request without a marker starts fetching from the beginning of the queue. At time of writing, events are aged out of the queue after three days, and each API request fetches a maximum of 3000 events.

### Calculating the average event size

In order to calculate the average event size, we will:
1. Make an initial eventsFeed request with an empty marker, to start at the beginning of the queue.
2. Process the response, adding the size of each retrieved event to a list.
3. Make another request using the marker received from the previous request.

We will do the above until we have fetched a specified number of events. At each iteration we will print the average size of the received batch of events, as well as the overall average, to show that the average converges toward a stable value as the number of samples increases.

### Initialising the connection to the API.
Firstly, let's import the libraries we need and set up the connection to the API. As usual, we assuming that the account ID and API key are preloaded as environment variables and we use our helper module to encapsulate the business of making an API call (see the [Getting Started](Getting%20Started.ipynb) notebook if any of this is unclear):

In [1]:
#
# Initialise the API connection
#
import datetime
import json
import os
from cato import API
C = API(os.environ["CATO_API_KEY"])

### The eventsFeed query

The eventsFeed query is documented here: [https://api.catonetworks.com/documentation/#query-eventsFeed](https://api.catonetworks.com/documentation/#query-eventsFeed). This is the query we will send:

In [2]:
#
# Define the query
#
query = '''
query eventsFeed($accountIDs:[ID!] $marker:String) {
  eventsFeed(accountIDs:$accountIDs, marker:$marker) {
    marker
    fetchedCount
    accounts {
      id
      records {
        fieldsMap
      }
    }
  }
}'''


#
# Variables
#
variables = {
    "accountIDs": [int(os.environ["CATO_ACCOUNT_ID"])],
    "marker": "",
}

### Implement the fetch loop

Let's call the query, looping around until we have reached the number of events.

In [3]:
#
# Stop when we've fetched at least this many events
#
limit = 100_000

#
# List to store the event sizes
#
event_sizes = []

#
# Fetch loop
#
iteration = 0
while True:

    #
    # Make the query
    #
    success, result = C.send("eventsFeed", variables, query)
    iteration += 1

    #
    # Get key fields
    #
    fetchedCount = result["data"]["eventsFeed"]["fetchedCount"]
    marker = result["data"]["eventsFeed"]["marker"]
    records = result["data"]["eventsFeed"]["accounts"][0]["records"]

    #
    # Process the records list. In order to get the truest estimate of
    # event size, we convert each event back to a string with no whitespace
    # in separators, and convert to bytes.
    #
    batch_sizes = []
    for event in records:
        event_str = json.dumps(event["fieldsMap"], separators=(',',':'))
        event_bytes = event_str.encode("utf-8")
        batch_sizes.append(len(event_bytes))
        event_sizes.append(len(event_bytes))        

    #
    # Stop if we received nothing, to avoid divide by zero
    #
    if fetchedCount == 0:
        print(f'fetchedCount:{fetchedCount}, stopping')
        break        

    #
    # Print the current iteration's stats
    #
    print(f'iteration:{iteration} fetchedCount:{fetchedCount} totalFetched:{len(event_sizes)} \
batch_average:{int(sum(batch_sizes)/len(batch_sizes))} \
total_average:{int(sum(event_sizes)/len(event_sizes))}')

    #
    # Stop if we hit the limit
    #
    if len(event_sizes) >= limit:
        print(f'Received {len(event_sizes)} events >= limit:{limit}, stopping')
        break
    
    #
    # Update the marker
    #
    variables["marker"] = marker

iteration:1 fetchedCount:3000 totalFetched:3000 batch_average:1526 total_average:1526
iteration:2 fetchedCount:3000 totalFetched:6000 batch_average:1534 total_average:1530
iteration:3 fetchedCount:3000 totalFetched:9000 batch_average:1564 total_average:1541
iteration:4 fetchedCount:3000 totalFetched:12000 batch_average:1808 total_average:1608
iteration:5 fetchedCount:3000 totalFetched:15000 batch_average:1571 total_average:1601
iteration:6 fetchedCount:3000 totalFetched:18000 batch_average:1545 total_average:1591
iteration:7 fetchedCount:3000 totalFetched:21000 batch_average:1537 total_average:1584
iteration:8 fetchedCount:3000 totalFetched:24000 batch_average:1539 total_average:1578
iteration:9 fetchedCount:3000 totalFetched:27000 batch_average:1581 total_average:1578
iteration:10 fetchedCount:3000 totalFetched:30000 batch_average:1527 total_average:1573
iteration:11 fetchedCount:3000 totalFetched:33000 batch_average:1558 total_average:1572
iteration:12 fetchedCount:3000 totalFetched: