# Domain Name Trends: Popularity & Pricing of `.io` and Other Alt-TLDs

This notebook explores the rise of alternative top-level domains (TLDs)—specifically `.io`, along with `.ai`, `.co`, `.dev`, and `.app`—with a focus on two core questions:

### Research Questions

1. **When did `.io` (and other alt-TLDs) begin to take off in popularity?**  
   In other words: when did developers, startups, and tech products begin registering them in significant numbers?

2. **How have prices for `.io` domains changed over time?**  
   Can we detect major shifts in value, spikes in demand, or effects from registry-driven price changes?

---

### Methods & Data Sources

To answer these questions, we use two main approaches:

#### 1. Certificate Transparency Data via [`crt.sh`](https://crt.sh/)
Every public HTTPS-enabled website must issue a certificate. By querying `crt.sh`, a searchable index of Certificate Transparency logs, we can:
- Estimate when a domain first appeared “in the wild” (first cert issuance)
- Build a year-by-year timeline of domain adoption per TLD

This gives us a strong, passive signal of when TLDs gained traction—especially in public-facing apps and startups.

#### 2. Domain Sale Price Data (Optional Extension)
Where possible, we’ll supplement our analysis with public domain sale data from:
- NameBio (if available)
- Historical WHOIS APIs or registrars (if feasible)
- Public datasets on secondary market sales

This gives us insight into how perceived value has changed—particularly for `.io`, where speculation and hype have played major roles.

---

### Notebook Outputs

- 📈 Trendline: first-seen cert dates for top domains, grouped by TLD and year
- 💸 (Optional) Timeline of average public sale prices for `.io` and peers
- 🔎 Visual cues on inflection points, e.g. post-2014 tech startup boom


# Step 1: Choosing a Set of Domain Names for Each TLD

We'll start by digging into [Cisco's top 1-Million domains](https://s3-us-west-1.amazonaws.com/umbrella-static/index.html) list to select cohorts of about 1,000 domains for each TLD of interest. We'll aim for a good mix of popular and less popular domains in each.

In [22]:
!pip --quiet install tldextract

In [30]:
### TLDs of interest
og_tlds = ["com", "net", "org"]
alt_tlds = ["io", "ai", "co"]

In [3]:
import pandas as pd

# Load the dataset
top_domains_df = pd.read_csv("../input/top-domain-data/cisco-top-1m.csv", names=["rank", "domain"])

# Preview
top_domains_df.head()


Unnamed: 0,rank,domain
0,1,google.com
1,2,microsoft.com
2,3,data.microsoft.com
3,4,e2ro.com
4,5,node.e2ro.com


In [5]:
def get_tld(domain):
    return domain.split(".")[-1]

top_domains_df["tld"] = top_domains_df["domain"].apply(get_tld)
top_domains_df.head()

Unnamed: 0,rank,domain,tld
0,1,google.com,com
1,2,microsoft.com,com
2,3,data.microsoft.com,com
3,4,e2ro.com,com
4,5,node.e2ro.com,com


In [26]:
# Print counts for each of our TLDs of interest
all_tlds = og_tlds + alt_tlds
filtered = top_domains_df[top_domains_df["tld"].isin(all_tlds)]
counts = filtered["tld"].value_counts().reindex(all_tlds, fill_value=0)
counts_df = counts.reset_index()
counts_df.columns = ["TLD", "Count"]
counts_df

Unnamed: 0,TLD,Count
0,com,583019
1,net,152577
2,org,27546
3,io,28358
4,ai,2821
5,co,5585
6,dev,2320


In [28]:
# We want to filter out all of the infra subdomains 
# to get a clearer picture of how many individual companies
# are actually using these domains

import tldextract

# Add eTLD+1 (root domain) column
top_domains_df["root_domain"] = top_domains_df["domain"].apply(
    lambda d: f"{tldextract.extract(d).domain}.{tldextract.extract(d).suffix}"
)

# Define unwanted infrastructure keywords
infra_keywords = ["cdn", "ads", "akamai", "edge", "sdk", "analytics", "api", "gateway", "internal", "tooling", "uat", "metrics"]

def is_infra(domain):
    return any(kw in domain.lower() for kw in infra_keywords)

# Filter out infrastructure and drop duplicate root domains
filtered_top_domains_df = top_domains_df[~top_domains_df["domain"].apply(is_infra)].copy()
filtered_top_domains_df = filtered_top_domains_df.drop_duplicates("root_domain")


In [33]:
# Print counts for each TLD after filtering out infra subdomains
all_tlds = og_tlds + alt_tlds
filtered = filtered_top_domains_df[filtered_top_domains_df["tld"].isin(all_tlds)]
counts = filtered["tld"].value_counts().reindex(all_tlds, fill_value=0)
counts_df = counts.reset_index()
counts_df.columns = ["TLD", "Count"]
counts_df

Unnamed: 0,TLD,Count
0,com,99829
1,net,11704
2,org,7862
3,io,4299
4,ai,992
5,co,1828


In [37]:
import random

# Parameters
num_samples = counts_df["Count"].min()       # total per TLD
num_top = 100           # how many from the top-ranked domains
num_random = num_samples - num_top

# Store results
top_samples = []
random_samples = []

# Loop through each TLD of interest
for tld in all_tlds:
    tld_df = filtered_top_domains_df[filtered_top_domains_df["tld"] == tld].copy()
    
    # Sort by rank (ascending: rank 1 is most popular)
    tld_df_sorted = tld_df.sort_values("rank")
    
    # Grab top N
    top_n_df = tld_df_sorted.head(num_top)
    
    # Random sample from the remaining
    top_n_df = tld_df_sorted.head(num_top).copy()
    random_df = remaining_df.sample(n=min(num_random, len(remaining_df)), random_state=42).copy()
    
    top_n_df["cohort"] = "top"
    random_df["cohort"] = "random"

    
    # Append to results
    top_samples.append(top_n_df)
    random_samples.append(random_df)

# Combine all into one final DataFrame
final_domains = pd.concat(top_samples + random_samples).reset_index(drop=True)

# Preview
final_domains.head()

Unnamed: 0,rank,domain,tld,root_domain,cohort
0,1,google.com,com,google.com,top
1,2,microsoft.com,com,microsoft.com,top
2,4,e2ro.com,com,e2ro.com,top
3,7,windowsupdate.com,com,windowsupdate.com,top
4,10,office.com,com,office.com,top


In [38]:
# Number of domains per TLD to display
n_per_tld = 10

# Create dictionary: {tld: [list of domains]}
tld_columns = {
    tld: final_domains[final_domains["tld"] == tld]
            .sample(n=min(n_per_tld, len(final_domains[final_domains["tld"] == tld])), random_state=1)
            .sort_values("rank")["domain"]
            .tolist()
    for tld in all_tlds
}

# Convert to DataFrame, aligning shorter columns
preview_table = pd.DataFrame.from_dict(tld_columns, orient="columns")

# Display the table
preview_table


Unnamed: 0,com,net,org,io,ai,co
0,aaplimg.com,bidswitch.net,oneget.org,bidmachine.io,powerad.ai,wunderkind.co
1,instagram.com,smaato.net,getgreenshot.org,codepen.io,forethought.ai,teramind.co
2,app-measurement.com,a-mo.net,collegeboard.org,shopee.io,go2app.ai,redcanary.co
3,3lift.com,ttoverseaus.net,nuget.org,fpjs.io,lunio.ai,mailmunch.co
4,amazonalexa.com,cedexis.net,beachapedia.org,anonymised.io,writefull.ai,9lib.co
5,sharethrough.com,nflximg.net,hltv.org,zprk.io,ivastudio.ai,invol.co
6,inmobi.com,fwmrm.net,sciencecareers.org,bitdrift.io,afp.ai,squidapp.co
7,appsflyer.com,nintendo.net,adtidy.org,pubeasy.io,pushy.ai,pango-paas.co
8,liadm.com,conferdeploy.net,acm.org,airbrake.io,superops.ai,smartytouch.co
9,lijit.com,discordapp.net,sciencemag.org,customer.io,prod.yospace.ai,w-mt.co


# Part 2: Searching for Years of First Certificates

We'll use `crt.sh` to search for the dates from the first certificates issued to each of the domains we've found to approximate how long ago they were registered.

In [40]:
import requests
import time
from datetime import datetime

# Query for the domain's cert history and extract the date from the earliest one
def get_first_cert_year(domain):
    url = f"https://crt.sh/?q={domain}&output=json"
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        data = response.json()

        if not data:
            return None
        
        # Get earliest not_before date
        cert_dates = [entry.get("not_before") for entry in data if "not_before" in entry]
        cert_dates = [datetime.fromisoformat(d) for d in cert_dates if d]
        return min(cert_dates).year if cert_dates else None

    except Exception as e:
        print(f"[WARN] {domain}: {e}")
        return None


In [41]:
get_first_cert_year("artlist.io")

2014