# Common Crawl Data

https://www.codecademy.com/courses/big-data-pyspark/projects/pyspark-common-crawl

## Assumptions

- data is from https://commoncrawl.org/
- The Common Crawl is a non-profit organization that crawls, archives, and analyzes content on all public websites
- data is publicly available

## Analyzing Common Crawl Data with RDDs

### Initialize a new Spark Context

In [7]:
from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .getOrCreate()

sc = spark.sparkContext

In [12]:
# read domains csv file into an rdd
common_crawl_domain_counts = sc.textFile('./crawl/cc-main-limited-domains.csv')

# Display first few domains from the RDD
print(common_crawl_domain_counts.take(5))

['367855\t172-in-addr\tarpa\t1', '367856\taddr\tarpa\t1', '367857\tamphic\tarpa\t1', '367858\tbeta\tarpa\t1', '367859\tcallic\tarpa\t1']


### Adjust output

In [16]:
def fmt_domain_graph_entry(entry):
    """
    Formats a Common Crawl domain graph entry. Extracts the site_id, 
    top-level domain (tld), domain name, and subdomain count as seperate items.
    """
    # Split the entry on delimiter ('\t') into site_id, domain, tld, and num_subdomains
    site_id, domain, tld, num_subdomains = entry.split('\t')        
    return int(site_id), domain, tld, int(num_subdomains)

In [19]:
# Apply `fmt_domain_graph_entry` to the raw data RDD
formatted_host_counts = common_crawl_domain_counts.map(fmt_domain_graph_entry)

# Display the first few entries of the new RDD
print(formatted_host_counts.take(5))

[(367855, '172-in-addr', 'arpa', 1), (367856, 'addr', 'arpa', 1), (367857, 'amphic', 'arpa', 1), (367858, 'beta', 'arpa', 1), (367859, 'callic', 'arpa', 1)]


In [21]:
def extract_subdomain_counts(entry):
    """
    Extract the subdomain count from a Common Crawl domain graph entry.
    """
    
    # Split the entry on delimiter ('\t') into site_id, domain, tld, and num_subdomains
    site_id, domain, tld, num_subdomains = entry.split('\t')
    
    # return ONLY the num_subdomains
    return int(num_subdomains)


# Apply `extract_subdomain_counts` to the raw data RDD
host_counts = common_crawl_domain_counts.map(extract_subdomain_counts)

# Display the first few entries
print(host_counts.take(5))

[1, 1, 1, 1, 1]


### Calculate total nuber of subdomains across all domains in the dataset

In [25]:
# Reduce the RDD to a single value, the sum of subdomains, with a lambda function
# as the reduce function
total_host_counts = host_counts.reduce(lambda x,y: x+y)
print(total_host_counts)

595466


### Stop the SparkSession

In [28]:
spark.stop()

## Exploring Domain Counts with PySpark DataFrames and SQL

### Initalize new SparkSession and read data

In [56]:
from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .getOrCreate()

In [57]:
# read csv file - it is stored as a DF as default
common_crawl = spark.read \
               .option('format', 'csv') \
               .option('delimiter', '\t') \
               .option('inferSchema', True) \
               .option('header', False) \
               .csv('./crawl/cc-main-limited-domains.csv')


type(common_crawl)
common_crawl.show(5)

+------+-----------+----+---+
|   _c0|        _c1| _c2|_c3|
+------+-----------+----+---+
|367855|172-in-addr|arpa|  1|
|367856|       addr|arpa|  1|
|367857|     amphic|arpa|  1|
|367858|       beta|arpa|  1|
|367859|     callic|arpa|  1|
+------+-----------+----+---+
only showing top 5 rows



### Adjust data

names of columns
- site_id
- domain
- top_level_domain
- num_subdomains

In [59]:
common_crawl = common_crawl.toDF('site_id', 'domain', 'top_level_domain', 'num_subdomains')

In [61]:
common_crawl.show(5)
common_crawl.schema

+-------+-----------+----------------+--------------+
|site_id|     domain|top_level_domain|num_subdomains|
+-------+-----------+----------------+--------------+
| 367855|172-in-addr|            arpa|             1|
| 367856|       addr|            arpa|             1|
| 367857|     amphic|            arpa|             1|
| 367858|       beta|            arpa|             1|
| 367859|     callic|            arpa|             1|
+-------+-----------+----------------+--------------+
only showing top 5 rows



StructType([StructField('site_id', IntegerType(), True), StructField('domain', StringType(), True), StructField('top_level_domain', StringType(), True), StructField('num_subdomains', IntegerType(), True)])

## Reading and Writing Datasets to Disk

### Save DF as parquet files

In [64]:
common_crawl.write.parquet('./results/common_crawl/')

In [66]:
common_crawl_domains = spark.read.parquet('./results/common_crawl/')
common_crawl_domains.show(5)

+-------+-----------+----------------+--------------+
|site_id|     domain|top_level_domain|num_subdomains|
+-------+-----------+----------------+--------------+
| 367855|172-in-addr|            arpa|             1|
| 367856|       addr|            arpa|             1|
| 367857|     amphic|            arpa|             1|
| 367858|       beta|            arpa|             1|
| 367859|     callic|            arpa|             1|
+-------+-----------+----------------+--------------+
only showing top 5 rows



## Query in Domain Counts with PySpark DataFrames and SQL

### Create a temp view

In [68]:
common_crawl_domains.createOrReplaceTempView('common_crawl_domains_view')

### Calculate the total number of domains for each top-level domain in the dataset

In [85]:
# Aggregate the DataFrame using DataFrame methods
common_crawl_domains_total_number = common_crawl_domains.groupBy('top_level_domain').count()
common_crawl_domains_total_number.show()

+----------------+-----+
|top_level_domain|count|
+----------------+-----+
|          travel| 6313|
|             map|   34|
|             gov|15007|
|             edu|18547|
|            arpa|   11|
|            jobs| 3893|
|            post|  117|
|            coop| 5319|
+----------------+-----+



In [86]:
# Aggregate the DataFrame using SQL
common_crawl_domains_total_number_query = """SELECT top_level_domain, count(*)
                                             FROM common_crawl_domains_view
                                             GROUP BY 1
                                          """

spark.sql(common_crawl_domains_total_number_query).show(truncate=False)

+----------------+--------+
|top_level_domain|count(1)|
+----------------+--------+
|travel          |6313    |
|map             |34      |
|gov             |15007   |
|edu             |18547   |
|arpa            |11      |
|jobs            |3893    |
|post            |117     |
|coop            |5319    |
+----------------+--------+



### Calculate the total number of subdomains for each top-level domain in the dataset

In [89]:
# Aggregate the DataFrame using DataFrame methods
common_crawl_domains_total_number_subdomains = common_crawl_domains.groupBy('top_level_domain').sum('num_subdomains')
common_crawl_domains_total_number_subdomains.show()

+----------------+-------------------+
|top_level_domain|sum(num_subdomains)|
+----------------+-------------------+
|          travel|              10768|
|             map|                 40|
|             gov|              85354|
|             edu|             484438|
|            arpa|                 17|
|            jobs|               6023|
|            post|                143|
|            coop|               8683|
+----------------+-------------------+



In [90]:
# Aggregate the DataFrame using SQL
common_crawl_domains_total_number_query = """SELECT top_level_domain, sum(num_subdomains)
                                             FROM common_crawl_domains_view
                                             GROUP BY 1
                                          """

spark.sql(common_crawl_domains_total_number_query).show(truncate=False)

+----------------+-------------------+
|top_level_domain|sum(num_subdomains)|
+----------------+-------------------+
|travel          |10768              |
|map             |40                 |
|gov             |85354              |
|edu             |484438             |
|arpa            |17                 |
|jobs            |6023               |
|post            |143                |
|coop            |8683               |
+----------------+-------------------+



### How many sub-domains does nps.gov have?

In [95]:
# Filter the DataFrame using DataFrame Methods
common_crawl_domains_nps_gov = common_crawl_domains.filter(common_crawl_domains.top_level_domain == 'gov').filter(common_crawl_domains.domain == 'nps')
common_crawl_domains_nps_gov.show()

+--------+------+----------------+--------------+
| site_id|domain|top_level_domain|num_subdomains|
+--------+------+----------------+--------------+
|57661852|   nps|             gov|           178|
+--------+------+----------------+--------------+



In [96]:
# Filter the DataFrame using SQL
common_crawl_domains_nps_gov_query = """
                                     SELECT * FROM common_crawl_domains_view
                                     WHERE domain = 'nps' and top_level_domain = 'gov'
                                     """

spark.sql(common_crawl_domains_nps_gov_query).show(truncate=False)

+--------+------+----------------+--------------+
|site_id |domain|top_level_domain|num_subdomains|
+--------+------+----------------+--------------+
|57661852|nps   |gov             |178           |
+--------+------+----------------+--------------+



### Stop the SparkSession

In [98]:
spark.stop()