# Creative Commons DSD Data Exploration/Engineering Scratchwork

Before doing anything, here are the packages needed:

In [1]:
import requests
import numpy as np
import pandas as pd
import os
import random
import time

## Exponential Backoff

This section will implement exponential backoff.
<br> Querying functions will include a call to `expo_backoff` for waiting.

In [2]:
CALLBACK_INDEX = 2
CALLBACK_EXPO = 0
MAX_WAIT = 64

In [3]:
def expo_backoff():
    """Performs exponential backoff upon call.
    
    The function will force a wait of CALLBACK_INDEX ** CALLBACK_EXPO + r seconds,
    where r is a decimal number between 0.001 and 0.999, inclusive.
    If that value is higher than MAX_WAIT, then it will just wait MAX_WAIT seconds
    instead.
    """
    global CALLBACK_EXPO
    backoff = random.randint(1, 1000) / 1000 + CALLBACK_INDEX ** CALLBACK_EXPO
    time.sleep(min(backoff, MAX_WAIT))
    if backoff < MAX_WAIT:
        CALLBACK_EXPO += 1

def expo_backoff_reset():
    """Resets the CALLBACK_EXPO to 0.
    """
    global CALLBACK_EXPO
    CALLBACK_EXPO = 0

## Google Custom Search

Credentials are listed below, as well as the file we write data onto:

In [19]:
API_KEY = "CENSORED"
PSE_KEY = "CENSORED"
DATA_WRITE_FILE = os.getcwd() + "/GoogleCustomSearch/data_20220925.txt"

### Sampling Method

The objective is to attain the approximate total number of pages that are CC attributed in the following scopes:
1. Without assumption of originating country and document language.
2. With assumption to type of license.
3. With assumption of originating country.
4. With assumption of originating language.

We will define a CC-attributed page as a page that contains a link towards the license type's description from `creativecommons.org`, because that was the purposed method of attributing a work to CC Licenses.
<br> We approximate the total number of CC attributed pages via the number of results from a query. However, since many pages are not listed on the Google Search Engine, we will take some complementary data later to get a better count of total CC-attributed works.

The final result of data gathering from this notebook would then be a table whose column labels are country/language assumptions, row labels are license types, and the cells containing the count of Google-searchable pages with specified label combinations.

### Acquiring Parameters

To get the types of licenses, we will take the following approach:
1. Get the 2018 data for Google Queries to get types of licenses that were queried for
2. Filter the licenses by type and version, which means we will not look at language-specific versions of licenses.
3. The list of licenses will at last be stored in a numpy array, as shown below.

Meanwhile, ver 2.1 licenses are filtered out because they are no longer implemented.

In [3]:
data_2018 = pd.read_csv(os.getcwd() + "/GoogleCustomSearch/data_2018.txt") \
    .set_index("License Address") \
    .iloc[:, :1] \
    .sort_values(by = "License Address")
license_pattern = r"/creativecommons.org/((?:[^/]+/){3}).*"
license_list = pd.Series(data_2018.index)
license_list = license_list[~license_list.str.contains("2.1")]\
                .str.extract(license_pattern, expand = False)\
                .dropna()\
                .unique()
license_list.sort()
print(f"There are {len(license_list)} licenses listed here.")
license_list

There are 31 licenses listed here.


array(['licenses/GPL/2.0/', 'licenses/by-nc-nd/2.0/',
       'licenses/by-nc-nd/2.5/', 'licenses/by-nc-nd/3.0/',
       'licenses/by-nc-nd/4.0/', 'licenses/by-nc-sa/1.0/',
       'licenses/by-nc-sa/2.0/', 'licenses/by-nc-sa/2.5/',
       'licenses/by-nc-sa/3.0/', 'licenses/by-nc-sa/4.0/',
       'licenses/by-nc/2.0/', 'licenses/by-nc/2.5/',
       'licenses/by-nc/3.0/', 'licenses/by-nc/4.0/',
       'licenses/by-nd-nc/1.0/', 'licenses/by-nd/2.0/',
       'licenses/by-nd/2.5/', 'licenses/by-nd/3.0/',
       'licenses/by-nd/4.0/', 'licenses/by-sa/2.0/',
       'licenses/by-sa/2.5/', 'licenses/by-sa/3.0/',
       'licenses/by-sa/4.0/', 'licenses/by/1.0/', 'licenses/by/2.0/',
       'licenses/by/2.5/', 'licenses/by/3.0/', 'licenses/by/4.0/',
       'licenses/sampling+/1.0/', 'publicdomain/mark/1.0/',
       'publicdomain/zero/1.0/'], dtype=object)

To get the languages that documents can be in, I reference all possible languages of the lr parameter in the reference of this API:
<br> https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list
<br> I copied the table into a .txt file, and read its contents treating it as a tsv dataset.

To reduce the amount of languages to deal with, I selected some of the most spoken languages to search cc related metadata from. 
<br> Consequentially, we selected 8 out of the 35 languages to look at. These langauges are listed as follows.

In [4]:
langs = pd.read_csv(os.getcwd() + "/GoogleCustomSearch/google_lang.txt", sep = ":")
langs = langs.set_index("Language")
print(f"There are {len(langs)} languages.")
selected_langs = langs.iloc[[7, 33, 34, 8, 11, 0, 25, 14], :].sort_index()
selected_langs.head(8)

There are 35 languages.


Unnamed: 0_level_0,Code
Language,Unnamed: 1_level_1
Arabic,lang_ar
Chinese (Simplified),lang_zh-CN
Chinese (Traditional),lang_zh-TW
English,lang_en
French,lang_fr
Indonesian,lang_id
Portuguese,lang_pt
Spanish,lang_es


A similar approach was taken to get all the countries from which search results can originate at:
<br> https://developers.google.com/custom-search/docs/json_api_reference#countryCollections

We also have a lot of countries to deal with: 242. This time, we selected 10 out of 242, aiming to have at least one representative from each continents while focusing on those noted 

In [5]:
cntrs = pd.read_csv(os.getcwd() + "/GoogleCustomSearch/google_cntrs.txt", sep = "\t")
cntrs = cntrs.set_index("Country")
print(f"There are {len(cntrs)} countries.")
selected_cntrs = cntrs.loc[
    [
        'India', 'Japan', 'United States', 'Canada', 'Brazil', 
        'Germany', 'United Kingdom', 'Spain', 'Australia', 'Egypt'
    ], :
].sort_index()
selected_cntrs.head(10)

There are 242 countries.


Unnamed: 0_level_0,Country Collection Name
Country,Unnamed: 1_level_1
Australia,countryAU
Brazil,countryBR
Canada,countryCA
Egypt,countryEG
Germany,countryDE
India,countryIN
Japan,countryJP
Spain,countryES
United Kingdom,countryUK
United States,countryUS


### API Requests

First, we will come up with a method to get an API endpoint URL based on license typing, country of origin, and language of document:

In [6]:
def get_request_url(license = None, cntr = None, lang = None):
    """ Provides the API Endpoint URL for specified parameter combinations.
    
    Args:
        license:
            A string representing the type of license, and should be a segment of its
            URL towards the license description. Alternatively, the default None value
            stands for having no assumption about license type.
        cntr:
            A string representing the country code of country that the search results
            would be originating from. Alternatively, the default None value or "all"
            stands for having no assumption about country of origin.
        lang:
            A string representing the language that the search results are presented 
            in. Alternatively, the default None value or "all" stands for having no 
            assumption about language of document.
    
    Returns:
        A string representing the API Endpoint URL for the query specified by this
        function's parameters.
    """
    base_url = r"https://customsearch.googleapis.com/customsearch/v1"
    base_url += f"?key={API_KEY}&cx={PSE_KEY}"
    base_url += r"&q=link%3Acreativecommons.org"
    if license is not None:
        base_url += license.replace("/", "%2F")
    else:
        base_url += "/licenses".replace("/", "%2F")
    if cntr is not None:
        base_url += "&cr=" + cntr
    if lang is not None:
        base_url += "&lr=" + lang
    return base_url

In [7]:
#Interestingly, these addresses don't take away daily quota!
search_jp = get_request_url(cntr = "countryJP")
search_zhTW = get_request_url(license = "licenses/by-nc-nd/2.5/", lang = "lang_zh-TW")
search_us_en = get_request_url(cntr = "countryUS")
print(search_jp)
print(search_zhTW)
print(search_us_en)

https://customsearch.googleapis.com/customsearch/v1?key=AIzaSyAL1cMkmOh7239CpxSNwyI12ednpUtW0qg&cx=603f12cb449574599&q=link%3Acreativecommons.org%2Flicenses&cr=countryJP
https://customsearch.googleapis.com/customsearch/v1?key=AIzaSyAL1cMkmOh7239CpxSNwyI12ednpUtW0qg&cx=603f12cb449574599&q=link%3Acreativecommons.orglicenses%2Fby-nc-nd%2F2.5%2F&lr=lang_zh-TW
https://customsearch.googleapis.com/customsearch/v1?key=AIzaSyAL1cMkmOh7239CpxSNwyI12ednpUtW0qg&cx=603f12cb449574599&q=link%3Acreativecommons.org%2Flicenses&cr=countryUS


Following the URL getter method is a metadata getter method for specified Google Custom Search queries:

In [8]:
def get_response_elems(license = None, cntr = None, lang = None):
    """ Provides the metadata for query of specified parameters
    
    Args:
        license:
            A string representing the type of license, and should be a segment of its
            URL towards the license description. Alternatively, the default None value
            stands for having no assumption about license type.
        cntr:
            A string representing the country code of country that the search results
            would be originating from. Alternatively, the default None value or "all"
            stands for having no assumption about country of origin.
        lang:
            A string representing the language that the search results are presented 
            in. Alternatively, the default None value or "all" stands for having no 
            assumption about language of document.
    
    Returns:
        A dict mapping metadata to its value provided from the API query of specified
        parameters.
    """
    url = get_request_url(license = license, cntr = cntr, lang = lang)
    #expo_backoff()
    search_data = requests.get(url).json()
    search_data_dict = {
        "totalResults": search_data["queries"]["request"][0]["totalResults"]
    }
    return search_data_dict

In [68]:
get_response_elems()

{'totalResults': '480000000'}

### Recording Data from API Request

Here is a method to set up the .txt csv-formatted file's header row

In [15]:
def set_up_data_file():
    """ Writes the header row to file to contain Google Query data.
    """
    header_title = "LICENSE TYPE,No Priori,"
    for title in selected_cntrs.index:
        header_title += title + ","
    for title in selected_langs.index:
        header_title += title + ","
    with open(DATA_WRITE_FILE, 'a') as f:
        f.write(header_title + "\n")

In [None]:
set_up_data_file()

And here is the method for recording a row of data based on the `license_type` of query

In [9]:
def record_license_data(license_type = None):
    """ Writes the row for LICENSE_TYPE to file to contain Google Query data.
    
    Args:
        license:
            A string representing the type of license, and should be a segment of its
            URL towards the license description. Alternatively, the default None value
            stands for having no assumption about license type.
    """
    if license_type is None:
        data_log = "all" + ","
    else:
        data_log = license_type + ","
    no_priori_search = get_response_elems(license = license_type)
    data_log += no_priori_search['totalResults'] + ","
    for cntr_name in selected_cntrs.iloc[:, 0]:
        response = get_response_elems(license = license_type, cntr = cntr_name)
        data_log += response['totalResults'] + ","
    for lang_name in selected_langs.iloc[:, 0]:
        response = get_response_elems(license = license_type, lang = lang_name)
        data_log += response['totalResults'] + ","
    with open(DATA_WRITE_FILE, 'a') as f:
        f.write(data_log + "\n")

In [75]:
record_license_data()

countryAU
countryBR
countryCA
countryEG
countryDE
countryIN
countryJP
countryES
countryUK
countryUS


In [20]:
for license_type in license_list[28:]:
    record_license_data(license_type)
    print(f"Logging for {license_type} is completed.")

Logging for licenses/sampling+/1.0/ is completed.
Logging for publicdomain/mark/1.0/ is completed.
Logging for publicdomain/zero/1.0/ is completed.


In [21]:
pd.read_csv(DATA_WRITE_FILE)

Unnamed: 0,LICENSE TYPE,No Priori,Australia,Brazil,Canada,Egypt,Germany,India,Japan,Spain,...,United States,Arabic,Chinese (Simplified),Chinese (Traditional),English,French,Indonesian,Portuguese,Spanish,Unnamed: 20
0,all,480000000,411000,1190000,598000,27600,8590000,228000,215000,253000,...,356000000,243000,193000,140000,393000000,952000,346000,5900000,18300000,
1,licenses/GPL/2.0/,66500,88,307,72,2,4010,23,106,969,...,37200,758,429,139,51500,2370,127,1410,3510,
2,licenses/by-nc-nd/2.0/,19000000,11500,4150,5530,262,57300,3530,9180,14000,...,10100000,1370,3890,384,22400000,105000,3410,22100,68500,
3,licenses/by-nc-nd/2.5/,26800000,17800,4680,16500,316,53800,4700,10300,16100,...,14900000,527,3110,497,26400000,55600,1430,16400,166000,
4,licenses/by-nc-nd/3.0/,29400000,23300,15500,7270,901,71600,4510,25400,22900,...,14200000,1650,5920,1140,23500000,180000,3540,141000,148000,
5,licenses/by-nc-nd/4.0/,75200000,28200,33400,22000,683,185000,15700,54600,39700,...,50300000,2530,15600,6100,77900000,286000,20300,95800,252000,
6,licenses/by-nc-sa/1.0/,4240000,7100,3760,2290,84,29400,1080,3120,6070,...,407000,284,3750,91,3880000,53800,1880,15200,13500,
7,licenses/by-nc-sa/2.0/,2570000,6090,2680,2840,105,31200,1550,3310,8510,...,367000,394,2200,248,1350000,89300,1930,23900,37600,
8,licenses/by-nc-sa/2.5/,8120000,8730,3390,7210,140,26800,1910,2970,8610,...,424000,211,1400,136,6810000,50700,668,16500,105000,
9,licenses/by-nc-sa/3.0/,6060000,10200,6040,3730,477,48000,2340,3820,9900,...,758000,1470,4050,539,3550000,155000,1610,135000,128000,


In [80]:
!tar chvfz notebook.tar.gz ./*

./Creative Commons DSD Data Engineering Scratchwork.ipynb
./GoogleCustomSearch/
./GoogleCustomSearch/.ipynb_checkpoints/
./GoogleCustomSearch/.ipynb_checkpoints/data_2018-checkpoint.txt
./GoogleCustomSearch/.ipynb_checkpoints/google_lang-checkpoint.txt
./GoogleCustomSearch/.ipynb_checkpoints/google_cntrs-checkpoint.txt
./GoogleCustomSearch/.ipynb_checkpoints/data_20220925-checkpoint.txt
./GoogleCustomSearch/data_2018.txt
./GoogleCustomSearch/google_lang.txt
./GoogleCustomSearch/google_cntrs.txt
./GoogleCustomSearch/data_20220925.txt
