# Fetch, Clean, and Prepare Train-Ready Datasets for AI Tasks in Networking
This notebook provides a step-by-step guide to fetch, clean, and prepare datasets for AI tasks in networking. It includes downloading datasets, normalizing formats, and creating training-ready data.

## Setup Repository and Paths
Set up the repository structure and paths for datasets and tools. Use WSL commands to create directories and verify the structure.

In [None]:
# Create directories for datasets and tools
import os

base_dir = "datasets"
sub_dirs = ["topologies", "traffic", "telemetry", "configs"]

for sub_dir in sub_dirs:
    path = os.path.join(base_dir, sub_dir)
    os.makedirs(path, exist_ok=True)
    print(f"Created directory: {path}")

## Pull Base Datasets
Download and organize datasets such as Topology Zoo, SNDlib, MAWI, CIC-IDS2018, and telemetry samples. Include commands to fetch and unzip files.

In [None]:
# Example: Download Topology Zoo dataset
import requests
import zipfile
import io

url = "http://www.topology-zoo.org/files/TopologyZoo2010.zip"
output_dir = "datasets/topologies"

response = requests.get(url)
if response.status_code == 200:
    with zipfile.ZipFile(io.BytesIO(response.content)) as z:
        z.extractall(output_dir)
    print(f"Downloaded and extracted Topology Zoo dataset to {output_dir}")
else:
    print(f"Failed to download dataset. Status code: {response.status_code}")

## Download Topology Zoo
Download the Topology Zoo dataset, unzip it, and organize it under the `datasets/topologies` directory.

In [None]:
# Verify Topology Zoo dataset structure
import os

topology_zoo_dir = "datasets/topologies"
files = os.listdir(topology_zoo_dir)
print(f"Files in Topology Zoo directory: {files}")

## Download SNDlib
Fetch SNDlib archives, unzip them, and store them in the `datasets/topologies/sndlib` directory.

In [None]:
# Example: Download SNDlib dataset
sndlib_url = "http://sndlib.zib.de/download/sndlib_xml_2010-10-01.zip"
sndlib_dir = "datasets/topologies/sndlib"

response = requests.get(sndlib_url)
if response.status_code == 200:
    with zipfile.ZipFile(io.BytesIO(response.content)) as z:
        z.extractall(sndlib_dir)
    print(f"Downloaded and extracted SNDlib dataset to {sndlib_dir}")
else:
    print(f"Failed to download SNDlib dataset. Status code: {response.status_code}")

## Download Traffic Datasets
Download traffic datasets such as MAWI and CIC-IDS2018. Include optional steps for CAIDA datasets.

In [None]:
# Placeholder for downloading MAWI and CIC-IDS2018 datasets
# Add specific download and extraction logic here
print("Downloading MAWI and CIC-IDS2018 datasets...")

## Clone Telemetry Samples
Clone the Cisco Innovation Edge telemetry repository into the `datasets/telemetry` directory.

In [None]:
# Clone telemetry samples repository
import subprocess

telemetry_repo_url = "https://github.com/CiscoDevNet/telemetry-sample-code.git"
telemetry_dir = "datasets/telemetry"

subprocess.run(["git", "clone", telemetry_repo_url, telemetry_dir], check=True)
print(f"Cloned telemetry samples to {telemetry_dir}")

## Fetch Config Schemas and Examples
Clone OpenConfig models and fetch FRR example configurations. Organize them under the `datasets/configs` directory.

In [None]:
# Clone OpenConfig models
openconfig_repo_url = "https://github.com/openconfig/public.git"
configs_dir = "datasets/configs/openconfig"

subprocess.run(["git", "clone", openconfig_repo_url, configs_dir], check=True)
print(f"Cloned OpenConfig models to {configs_dir}")

## Create Dataset Catalog
Create a CSV catalog to track dataset metadata, including source, license, and notes.

In [None]:
# Create a dataset catalog
import pandas as pd

catalog_data = [
    {"Dataset": "Topology Zoo", "Source": "http://www.topology-zoo.org", "License": "Unknown", "Notes": "Network topologies"},
    {"Dataset": "SNDlib", "Source": "http://sndlib.zib.de", "License": "Unknown", "Notes": "Network design library"},
    {"Dataset": "MAWI", "Source": "http://mawi.wide.ad.jp", "License": "Unknown", "Notes": "Traffic traces"},
    {"Dataset": "CIC-IDS2018", "Source": "https://www.unb.ca/cic/datasets/ids-2018.html", "License": "Unknown", "Notes": "Intrusion detection dataset"}
]

catalog_df = pd.DataFrame(catalog_data)
catalog_path = "datasets/dataset_catalog.csv"
catalog_df.to_csv(catalog_path, index=False)
print(f"Dataset catalog saved to {catalog_path}")