# HTML corpus analysis

**Aims:**

For the HTML documents in the CCLW corpus,

- understand the range of sources (domain names; countries)
- how many are pages containing the full policy text
- how many are from sources from which PDFs or structured (e.g. XML) sources can be obtained

Full write up is [in Notion](https://www.notion.so/climatepolicyradar/HTML-corpus-analysis-b14dde899c7a4388977fff98cdf784ad).


In [1]:
from urllib.parse import urlparse
import pandas as pd
import seaborn as sns
from IPython.display import HTML

pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# pd.set_option('display.width', 1000)

In [19]:
PROCESSED_POLICIES_CSV_PATH = "../../policy-search/data/corpus/processed_policies.csv"

df = pd.read_csv(PROCESSED_POLICIES_CSV_PATH, index_col=0)
df.loc[:, 'url_domain'] = df.loc[:, 'url'].apply(lambda i: urlparse(i).netloc)
df_html_en = df[(df['doc_mime_type'] == 'text/html') & (df['language'] == 'en')]
df_html_en.head(1)

Unnamed: 0,source_policy_id,policy_name,policy_title,policy_type,country_code,keyword_list,document_type_list,id,url,source_name,...,policy_txt_file,language,error,responses_list,hazard_list,policy_description,policy_date,sector_list,instrument_list,url_domain
0,10088.0,Circular economy promotion law,,legislative,CHN,circular economy,Law,0,https://www.lawinfochina.com/display.aspx?id=7025&lib=law,cclw,...,cclw-10088-c6879d5bea4f431cb10f27fbaff6104e.txt,en,,Mitigation,,"This law promotes the development of a circular economy, aims to improve resource usage and protect the environment. It notably seeks to enhance energy efficiency, increase the role of renewable energy sources",29/08/2008,Industry;Agriculture (general);Economy-wide,Standards and obligations;Multi-level governance,www.lawinfochina.com


## Examining HTML sources

By converting the URL of each document with MIME type = application/html to its domain, we can look at the distribution of document sources in the corpus.

In [4]:
df_html_cclw = df_html_en[df_html_en['source_name'] == 'cclw']
df_html_en.groupby('source_name').count()["source_policy_id"]

source_name
cclw     133
cpd     2774
Name: source_policy_id, dtype: int64

In [5]:
source_df = pd.concat([df_html_cclw['url_domain'].value_counts(normalize=False), df_html_cclw['url_domain'].value_counts(normalize=True), df_html_cclw['url_domain'].value_counts(normalize=True).cumsum()], axis=1, keys=["count", "percentage", "cum. percentage"]).reset_index().rename(columns={"index": "Title"})

HTML(source_df.to_html(index=False))

Title,count,percentage,cum. percentage
eur-lex.europa.eu,27,0.203008,0.203008
www.legislation.gov.uk,13,0.097744,0.300752
laws-lois.justice.gc.ca,12,0.090226,0.390977
www.gov.uk,10,0.075188,0.466165
www.meti.go.jp,6,0.045113,0.511278
www.canada.ca,6,0.045113,0.556391
climate-laws.org,4,0.030075,0.586466
www.legislation.gov.au,4,0.030075,0.616541
www.retsinformation.dk,4,0.030075,0.646617
cis-legislation.com,3,0.022556,0.669173


In [16]:
# For inspecting the URLs for one source
df_html_cclw.loc[df_html_cclw['url_domain'] == "www.meti.go.jp", ['policy_name', 'country_code', 'url']]

Unnamed: 0,policy_name,country_code,url
654,Green Growth Strategy Through Achieving Carbon Neutrality in 2050,JPN,https://www.meti.go.jp/english/press/2020/1225_001.html
1306,Basic Hydrogen Strategy,JPN,https://www.meti.go.jp/english/press/2019/0312_002.html
1308,The 5th Strategic Energy Plan,JPN,http://www.meti.go.jp/english/press/2015/0716_01.html
1309,The 5th Strategic Energy Plan,JPN,https://www.meti.go.jp/english/press/2018/0703_002.html
1641,Act on Purchase of Renewable Energy Sourced Electricity by Electric Utilities (Law No. 108 of 2011),JPN,https://www.meti.go.jp/english/press/2020/0225_001.html
1929,Act No. 89 of 2018 on Promoting Utilization of Sea Areas in Development of Power Generation Facilities Using Maritime Renewable Energy Resources,JPN,https://www.meti.go.jp/english/press/2019/0315_003.html


## PDF sources

Given the above analysis, we decided it would also be interesting to look at the sources of PDFs, to see if any sources have URLs which can be converted into URLs to structured data.

In [20]:
df_pdf_cclw_en = df[(df['doc_mime_type'] == 'application/pdf') & (df['language'] == 'en') & (df['source_name'] == 'cclw')]
df_pdf_cclw_en.groupby('source_name').count()["source_policy_id"]

source_name
cclw    939
Name: source_policy_id, dtype: int64

In [22]:
pd.concat(
    [
        df_pdf_cclw_en['url_domain'].value_counts(normalize=False), 
        df_pdf_cclw_en['url_domain'].value_counts(normalize=True), 
        df_pdf_cclw_en['url_domain'].value_counts(normalize=True).cumsum()
    ], 
    axis=1, 
    keys=["count", "percentage", "cum. percentage"],
).reset_index().rename(columns={"index": "Title"}).head(20)



Unnamed: 0,Title,count,percentage,cum. percentage
0,climate-laws.org,787,0.838126,0.838126
1,www.lse.ac.uk,32,0.034079,0.872204
2,ec.europa.eu,26,0.027689,0.899894
3,extwprlegs1.fao.org,8,0.00852,0.908413
4,assets.publishing.service.gov.uk,5,0.005325,0.913738
5,www.ifrc.org,3,0.003195,0.916933
6,policy.asiapacificenergy.org,3,0.003195,0.920128
7,www.env.go.jp,3,0.003195,0.923323
8,www.preventionweb.net,3,0.003195,0.926518
9,www.meti.go.jp,2,0.00213,0.928647


In [32]:
# For inspecting the URLs for one source
df_pdf_cclw_en.loc[df_pdf_cclw_en['url_domain'] == "eur-lex.europa.eu", ['policy_name', 'country_code', 'url']]

Unnamed: 0,policy_name,country_code,url
1329,An EU Strategy to harness the potential of offshore renewable energy for a climate neutral future,EUR,https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52020DC0741&from=EN
