## Hello, future Elastic Open Crawler user!
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]()

This notebook is designed to help you migrate your Elastic Crawler configurations to Open Crawler-friendly YAML!

We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run.

_If you are running this notebook inside Google Colab, or have not installed elasticsearch in your local environment yet, please run the following cell to make sure the Python `elasticsearch` client is installed._

### Setup
First, let's start by making sure `elasticsearch` and other required dependencies are installed and imported by running the following cell:

In [510]:
!pip install elasticsearch

from getpass import getpass
from elasticsearch import Elasticsearch

import os
import json
import yaml
import pprint




We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:
- Your **Elasticsearch Cloud ID**
- An **API key**

To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.
You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place once it is created it will be displayed only upon creation.

In [511]:
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
API_KEY = getpass("Elastic Api Key: ")

Elastic Cloud ID:  ········
Elastic Api Key:  ········


Great! Now let's try connecting to your Elasticsearch instance.

In [512]:
es_client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=API_KEY,
)

# ping ES to make sure we have positive connection
es_client.info()['tagline']

'You Know, for Search'

Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!

If not, please double-check your Cloud ID and API key that you provided above. 

### Step 1: Acquire Basic Configurations

The first order of business is to establish what Crawlers you have and their basic configuration details.
This migration notebook will attempt to pull configurations for every distinct Crawler you have in your Elasticsearch instance.

In [669]:
 # in-memory data structure that maintains current state of the configs we've pulled
inflight_configuration_data = {}

crawler_configurations = es_client.search(
    index=".ent-search-actastic-crawler2_configurations_v2",
)

crawler_counter = 1
for configuration in crawler_configurations["hits"]["hits"]:
    source = configuration["_source"]

    # extract values
    crawler_oid = source["id"]
    output_index = source["index_name"]

    print (f"{crawler_counter}. {output_index}")
    print (f"   Crawler ID is {crawler_oid}\n")
    crawler_counter += 1

    crawl_schedule = [] # either no schedule or a specific schedule - determined in Step 4
    if source["use_connector_schedule"] == False and source["crawl_schedule"]: # an interval schedule is being used
        crawl_schedule = source["crawl_schedule"] # this will be transformed in Step 4

    # populate a temporary hashmap
    temp_conf_map = {
        "output_index": output_index,
        "schedule": crawl_schedule
    }
    # pre-populate some necessary fields in preparation for upcoming steps
    temp_conf_map["domains_temp"] = {}
    temp_conf_map["output_sink"] = "elasticsearch"
    temp_conf_map["full_html_extraction_enabled"] = False
    temp_conf_map["elasticsearch"] = {
        "host": "",
        "port": "",
        "api_key": "",
        # "username": "",
        # "password": "",
    }
    # populate the in-memory data structure
    inflight_configuration_data[crawler_oid] = temp_conf_map

# pprint.pprint(inflight_configuration_data) # REMOVE BEFORE FLIGHT

1. search-search-crawler-fully-loaded-8.18
   Crawler ID is 67b74f16204956a3ce9fd0a4

2. search-daggerfall-unity-website-crawler-8.18
   Crawler ID is 67b74f84204956efce9fd0b7

3. search-migration-crawler
   Crawler ID is 67b7509b2049567f859fd0d4

4. search-basic
   Crawler ID is 67b75aeb20495617d59fd0ea



**Before continuing, please verify in the output above that the correct number of Crawlers was found!**

Now that we have some basic data about your Crawlers, let's use this information to get more configuration values!

### Step 2: URLs, Sitemaps, and Crawl Rules

In this cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules.

In [670]:
crawler_ids_to_query = inflight_configuration_data.keys()

crawler_counter = 1
for crawler_oid in crawler_ids_to_query:
    # query ES to get the crawler's domain configurations
    crawler_domains = es_client.search(
        index=".ent-search-actastic-crawler2_domains",
        query={"match": {"configuration_oid": crawler_oid}},
        _source=["name",
                 "configuration_oid",
                 "id",
                 "sitemaps",
                 "crawl_rules",
                 "seed_urls",
                 "auth"]
        )
    print (f"{crawler_counter}.) Crawler ID {crawler_oid}")
    crawler_counter += 1
    
    # for each domain the Crawler has, grab its config values
    # and update the in-memory data structure
    for domain_info in crawler_domains["hits"]["hits"]:
        source = domain_info["_source"]

        # extract values
        domain_oid = str(source["id"])
        domain_url = source["name"]
        seed_urls = source["seed_urls"]
        sitemap_urls = source["sitemaps"]
        crawl_rules = source["crawl_rules"]

        print (f"    Domain {domain_url} found!")
        
        # transform seed, sitemap, and crawl rules into arrays
        seed_urls_list = []
        for seed_obj in seed_urls:
            seed_urls_list.append(seed_obj["url"])

        sitemap_urls_list= []
        for sitemap_obj in sitemap_urls:
            sitemap_urls_list.append(sitemap_obj["url"])

        crawl_rules_list = []
        for crawl_rules_obj in crawl_rules:
            crawl_rules_list.append({
                "policy" : crawl_rules_obj["policy"],
                "type": crawl_rules_obj["rule"],
                "pattern": crawl_rules_obj["pattern"]
            })

        # populate a temporary hashmap
        temp_domain_conf = {"url": domain_url}
        if seed_urls_list:
            temp_domain_conf["seed_urls"] = seed_urls_list
            print (f"    Seed URls found: {seed_urls_list}")
        if sitemap_urls_list:
            temp_domain_conf["sitemap_urls"] = sitemap_urls_list
            print (f"    Sitemap URLs found: {sitemap_urls_list}")
        if crawl_rules_list:
            temp_domain_conf["crawl_rules"] = crawl_rules_list
            print (f"    Crawl rules found: {crawl_rules_list}")
                
        # populate the in-memory data structure
        inflight_configuration_data[crawler_oid]["domains_temp"][domain_oid] = temp_domain_conf

# pprint.pprint(inflight_configuration_data) # REMOVE BEFORE FLIGHT

1.) Crawler ID 67b74f16204956a3ce9fd0a4
    Domain https://www.speedhunters.com found!
    Seed URls found: ['https://www.speedhunters.com/2025/01/the-mystery-of-the-hks-zero-r/', 'https://www.speedhunters.com/2025/02/daniel-arsham-eroded-porsche-911/', 'https://www.speedhunters.com/2025/02/5-plus-7-equals-v12-a-custom-bmw-super-saloon/']
    Sitemap URLs found: ['https://www.speedhunters.com/post_tag-sitemap2.xml']
2.) Crawler ID 67b74f84204956efce9fd0b7
    Domain https://www.dfworkshop.net found!
    Seed URls found: ['https://www.dfworkshop.net/']
    Crawl rules found: [{'policy': 'allow', 'type': 'begins', 'pattern': '/word'}, {'policy': 'deny', 'type': 'contains', 'pattern': 'DOS'}]
    Domain https://www.speedhunters.com found!
    Seed URls found: ['https://www.speedhunters.com/']
    Crawl rules found: [{'policy': 'deny', 'type': 'begins', 'pattern': '/BMW'}]
3.) Crawler ID 67b7509b2049567f859fd0d4
    Domain https://justinjackson.ca found!
    Seed URls found: ['https://just

### Step 3: Extracting the Extraction Rules

In the following cell, we will be acquiring any extraction rules you may have set in your Elastic Crawlers.

In [671]:
extraction_rules = es_client.search(
    index=".ent-search-actastic-crawler2_extraction_rules",
    _source=["configuration_oid", "domain_oid", "rules", "url_filters"]
)

for exr_rule in extraction_rules["hits"]["hits"]:
    source = exr_rule["_source"]

    config_oid = source["configuration_oid"]
    domain_oid = source["domain_oid"]
    
    all_rules = source["rules"]
    all_url_filters = source["url_filters"]

    # extract url filters
    url_filters = []
    if all_url_filters:
        url_filters = [{
            "type": all_url_filters[0]["filter"],
            "pattern": all_url_filters[0]["pattern"],
        }]

    # extract rulesets
    action_translation_map = {
        "fixed": "set",
        "extracted": "extract",
    }
    
    ruleset = {}
    if all_rules:
        ruleset = [{
            "action": action_translation_map[all_rules[0]["content_from"]["value_type"]],
            "field_name": all_rules[0]["field_name"],
            "selector": all_rules[0]["selector"],
            "join_as": all_rules[0]["multiple_objects_handling"],
            "value": all_rules[0]["content_from"]["value"],
            "source": all_rules[0]["source_type"],
        }]

    # populate the in-memory data structure
    temp_extraction_rulesets = [{
        "url_filters": url_filters,
        "rules": ruleset,
    }]
    inflight_configuration_data[config_oid]["domains_temp"][domain_oid]["extraction_rulesets"] = temp_extraction_rulesets

# pprint.pprint(inflight_configuration_data) # REMOVE BEFORE FLIGHT

### Step 4: Schedules

In the upcoming cell, we will be gathing any schedules your Crawlers have set.

In [672]:
def generate_cron_expression(interval_values: dict) -> str:
    return interval_values # TODO TODO this ** might not be needed? **

# ---------------------------

for crawler_oid, crawler_config in inflight_configuration_data.items():
    output_index = crawler_config["output_index"]
    
    existing_schedule_value = crawler_config["schedule"]

    if not existing_schedule_value:
        # query ES to get this Crawler's specific time schedule
        schedules_result = es_client.search(
            index=".elastic-connectors-v1",
            query={"match": {"index_name": output_index}},
            _source=["index_name", "scheduling"]
        )
        # update schedule field with cron expression if specific time scheduling is enabled
        if schedules_result["hits"]["hits"][0]["_source"]["scheduling"]["full"]["enabled"]:
            specific_time_schedule = schedules_result["hits"]["hits"][0]["_source"]["scheduling"]["full"]["interval"]
            crawler_config["schedule"] = specific_time_schedule
    elif isinstance(existing_schedule_value[0], dict):
        crawler_config["schedule"] = generate_cron_expression(existing_schedule_value)
    
# pprint.pprint(inflight_configuration_data) # REMOVE BEFORE FLIGHT  

### Step 5: Creating the Open Crawler YAML configuration files

In this final step, we will be creating the actual YAML files you need to get up and running with Open Crawler!

The upcoming cell performs some final transformations to the in-memory data structure that is keeping track of your configurations.

In [673]:
# Final transform of the in-memory data structure to a form we can dump to YAML
# for each crawler, collect all of its domain configurations into a list
for crawler_config in inflight_configuration_data.values():
    all_crawler_domains = []
    
    for domain_config in crawler_config["domains_temp"].values():
        all_crawler_domains.append(domain_config)
    # create a new key called "domains" that points to a list of domain configs only - no domain_oid values as keys
    crawler_config["domains"] = all_crawler_domains
    # delete the temporary domain key
    del crawler_config["domains_temp"]

# pprint.pprint(inflight_configuration_data) # REMOVE BEFORE FLIGHT 

#### **Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.**

In the following cell, please enter the following details about the _Elasticsearch instance you will be using with Open Crawler_:
- The Elasticsearch endpoint URL
- The port number of your Elasticsearch endpoint
- An API key

In [660]:
ENDPOINT = input("Elasticsearch endpoint URL: ")
PORT = input("The Elasticsearch endpoint's port number: ")
API_KEY = getpass("Elasticsearch API key: ")

# set the above values in each Crawler's configuration
for crawler_config in inflight_configuration_data.values():
    crawler_config["elasticsearch"]["host"] = ENDPOINT
    crawler_config["elasticsearch"]["port"] = int(PORT)
    crawler_config["elasticsearch"]["api_key"] = API_KEY

Elasticsearch endpoint URL:  https://4911ebad5ed44d149fe8ddad4a4b3751.us-west2.gcp.elastic-cloud.com
The Elasticsearch endpoint's port number:  443
Elasticsearch API key:  ········


#### **This is the final step! You have two options here:**

- The "Write to YAML" cell will create _n_ number of YAML files, one for each Crawler you have.
- The "Print to output" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.

Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files.

#### Option 1: Write to YAML file

In [661]:
# Dump each Crawler's configuration into its own YAML file
for crawler_config in inflight_configuration_data.values():
    base_dir = os.getcwd()
    file_name = f"{crawler_config['output_index']}-config.yml" # autogen a custom filename
    output_path = os.path.join(base_dir, file_name)

    if os.path.exists(base_dir):
        with open(output_path, 'w') as file:
            yaml.safe_dump(
                crawler_config,
                file,
                sort_keys=False
            )

#### Option 2: Print to output

In [674]:
for crawler_config in inflight_configuration_data.values():
    yaml_out = yaml.safe_dump(
        crawler_config,
        sort_keys=False
    )
    
    print (f"YAML config => {crawler_config['output_index']}-config.yml\n--------")
    print (yaml_out)
    print ("--------------------------------------------------------------------------------")

YAML config => search-search-crawler-fully-loaded-8.18-config.yml
--------
output_index: search-search-crawler-fully-loaded-8.18
schedule: []
output_sink: elasticsearch
full_html_extraction_enabled: false
elasticsearch:
  host: ''
  port: ''
  api_key: ''
domains:
- url: https://www.speedhunters.com
  seed_urls:
  - https://www.speedhunters.com/2025/01/the-mystery-of-the-hks-zero-r/
  - https://www.speedhunters.com/2025/02/daniel-arsham-eroded-porsche-911/
  - https://www.speedhunters.com/2025/02/5-plus-7-equals-v12-a-custom-bmw-super-saloon/
  sitemap_urls:
  - https://www.speedhunters.com/post_tag-sitemap2.xml

--------------------------------------------------------------------------------
YAML config => search-daggerfall-unity-website-crawler-8.18-config.yml
--------
output_index: search-daggerfall-unity-website-crawler-8.18
schedule: 0 30 8 * * ?
output_sink: elasticsearch
full_html_extraction_enabled: false
elasticsearch:
  host: ''
  port: ''
  api_key: ''
domains:
- url: https: