<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/main/geocities/geocities_domain_frequency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Geocities Derivatives | Domain Frequency

In this notebook we'll download some data from the [GeoCities Web Archive Collection Derivatives](https://archive.org/details/geocities-webarchive-collection-derivatives) to demonstrate a few examples of further exploration of web archive data.

# Datasets

First, we will need to download some derivative data from [GeoCities Web Archive Collection Derivatives](https://archive.org/details/geocities-webarchive-collection-derivatives).

In [1]:
%%capture

!mkdir data

!wget "https://archive.org/download/geocities-webarchive-collection-derivatives/geocities-domain-frequency.csv.gz" -P data

Unzip the data.

In [23]:
!gunzip data/*

gzip: data/geocities-domain-frequency.csv: unknown suffix -- ignored


Let's check our `data` directory, and make sure they've downloaded.

In [24]:
!ls -1 data

geocities-domain-frequency.csv


# Environment

Next, we'll setup our environment so we can load our derivatives into [pandas](https://pandas.pydata.org), build charts with [Altair](https://altair-viz.github.io/), and use the [Data Table extension for Colab](https://colab.research.google.com/notebooks/data_table.ipynb).

In [28]:
import pandas as pd
import altair as alt

In [19]:
%load_ext google.colab.data_table

## Let's take a look at the domain frequency derivative.

In [25]:
domain_frequency = pd.read_csv("data/geocities-domain-frequency.csv")
domain_frequency



Unnamed: 0,domain,count
0,geocities.com,57922449
1,yahoo.com,1110567
2,amazon.com,87675
3,myspace.com,67706
4,bravenet.com,62904
...,...,...
147918,manosguardanapo.blogspot.com,1
147919,bragaplasticos.com.ar,1
147920,abed.org.br,1
147921,mailearners.net,1


What does the distribution of domains look like?

Here we can see which domains are the most frequent within the collection.

In [29]:
top_domains = domain_frequency.sort_values("count", ascending=False).head(10)

top_domains_bar = (
    alt.Chart(top_domains)
    .mark_bar()
    .encode(
        x=alt.X("domain:O", title="Domain", sort="-y"),
        y=alt.Y("count:Q", title="Count, Mean of Count"),
    )
)

top_domains_rule = (
    alt.Chart(top_domains).mark_rule(color="red").encode(y="mean(count):Q")
)

top_domains_text = top_domains_bar.mark_text(align="center", baseline="bottom").encode(
    text="count:Q"
)

(top_domains_bar + top_domains_rule + top_domains_text).properties(
    width=1400, height=700, title="Domains Distribution"
)

### Top Level Domain Analysis

pandas allows you to create new columns in a DataFrame based off of existing data. This comes in handy for a number of use cases with the available data that we have. In this case, let's create a new column, `tld`, which is based off an existing column, 'domain'. This example should provide you with an implementation pattern for expanding on these datasets to do further research and analysis.

A [top-level domain](https://en.wikipedia.org/wiki/Top-level_domain) refers to the highest domain in an address - i.e. `.ca`, `.com`, `.org`, or yes, even `.pizza`.

Things get a bit complicated, however, in some national TLDs. While `qc.ca` (the domain for Quebec) isn't really a top-level domain, it has many of the features of one as people can directly register under it. Below, we'll use the command `suffix` to include this. 

> You can learn more about suffixes at https://publicsuffix.org.

We'll take the `domain` column and extract the `tld` from it with [`tldextract`](https://github.com/john-kurkowski/tldextract).

First we'll add the [`tldextract`](https://github.com/john-kurkowski/tldextract) library to the notebook. Then, we'll create the new column.

In [30]:
%%capture

!pip install tldextract

In [31]:
import tldextract

domain_frequency["tld"] = domain_frequency.apply(
    lambda row: tldextract.extract(row.domain).suffix, axis=1
)
domain_frequency



Unnamed: 0,domain,count,tld
0,geocities.com,57922449,com
1,yahoo.com,1110567,com
2,amazon.com,87675,com
3,myspace.com,67706,com
4,bravenet.com,62904,com
...,...,...,...
147918,manosguardanapo.blogspot.com,1,com
147919,bragaplasticos.com.ar,1,com.ar
147920,abed.org.br,1,org.br
147921,mailearners.net,1,net


In [32]:
tld_count = domain_frequency["tld"].value_counts()
tld_count

com          87723
net          11855
org          11018
de            4450
tk            3508
             ...  
sc.gov.br        1
gen.nz           1
kg               1
tn.us            1
wi.us            1
Name: tld, Length: 741, dtype: int64

In [33]:
tld_count = (
    domain_frequency["tld"]
    .value_counts()
    .rename_axis("TLD")
    .reset_index(name="Count")
    .head(10)
)

tld_bar = (
    alt.Chart(tld_count)
    .mark_bar()
    .encode(x=alt.X("TLD:O", sort="-y"), y=alt.Y("Count:Q"))
)

tld_rule = alt.Chart(tld_count).mark_rule(color="red").encode(y="mean(Count):Q")

tld_text = tld_bar.mark_text(align="center", baseline="bottom").encode(text="Count:Q")

(tld_bar + tld_rule + tld_text).properties(
    width=1400, height=700, title="Top Level Domain Distribution"
)