The UNESCO 'World Heritage List' can be found online: [UNESCO Homepage](https://whc.unesco.org), [XML File](https://whc.unesco.org/en/list/xml) \
**Copyright © 1992 - 2024 UNESCO/World Heritage Centre. All rights reserved.**

In [45]:
import urllib.request
import io

import pandas as pd
import re

from bs4 import BeautifulSoup
from unicodedata import normalize

In [None]:
# Download file directly
file = urllib.request.urlopen('https://whc.unesco.org/en/list/xml') 
raw_xml = file.read().decode('utf8')
file.close()

sites = pd.read_xml(io.StringIO(raw_xml))

In [15]:
# Or access from pre-downloaded file
with open('raw-files/whc-en.xml', 'r', encoding="utf8") as f:
    raw_xml = f.read()
    
sites = pd.read_xml(io.StringIO(raw_xml))

# Data Cleaning
## Text Columns
- Strings of comma separated lists to be converted to list datatype
- HTML to be converted to plaintext
- Criteria text to be converted to multiple columns '(i)(iii)' -> C1 = True, C2 = False, C3 = True, C4-10 = False

In [13]:
comma_sep_cols = ["iso_code", "states", "secondary_dates"]
html_cols = ["site", "short_description", "justification"]

### Comma-Separated Columns

In [None]:
for colname in comma_sep_cols:
    sites[colname] = sites[colname].str.split(',')

### HTML Columns
Initial implementation of a HTML stripping function was attempted using regex substitutions for tags. However, many special characters (e.g. "\&ndash;") would slip through the cracks.

Mapping these characters manually was inflexible and led me to the BeautifulSoup & unicodedata implementation below. BeautifulSoup parses the html and the unicodedata library translates remaining special characters (now unicode rather than HTML) into their literals (e.g. "\xa0" -> " ")

In [88]:
#Set of all character entities included in original text, may be able to produce faster implementations:
html_char_entities = set()
for colname in html_cols:
    for rowtext in sites[colname][sites[colname].notna()]:
        found_tags = re.findall(r'&.*?;', rowtext)
        html_char_entities.update(found_tags)

In [40]:
def html_to_plaintext(text):
    parsed_text = BeautifulSoup(text, 'html.parser').get_text()
    plaintext = normalize('NFKD', parsed_text)
    return plaintext

for colname in html_cols:
    notna_col_entries = sites.loc[sites[colname].notna(), colname]
    notna_col_entries = notna_col_entries.map(html_to_plaintext)

  parsed_text = BeautifulSoup(text, 'html.parser').get_text()


### Criteria Column

In [86]:
criteria_mapping = {'i': 'C1',
                    'ii': 'C2',
                    'iii': 'C3',
                    'iv': 'C4',
                    'v': 'C5',
                    'vi': 'C6',
                    'vii': 'N7',
                    'viii': 'N8',
                    'ix': 'N9',
                    'x': 'N10'}

list_from_parentheses = lambda x: re.findall(r'\((.*?)\)', x)
all_criteria = sites['criteria_txt'].map(list_from_parentheses).explode()

# If any entries extracted from the column are not in the mapping (i.e. not C1-N10) an error is raised.
if (min(all_criteria.isin(criteria_mapping.keys()))):
    grouped_criteria = all_criteria.groupby([all_criteria.index, all_criteria]).any()
    criteria_df = grouped_criteria.unstack(fill_value=False)
    criteria_df = criteria_df.rename(columns = criteria_mapping)
else:
    raise ValueError("Given criteria outside of base-mapping table")

sites = sites.assign(**criteria_df)
sites = sites.drop(labels='criteria_txt', axis='columns')

# Data Exploration

In [21]:
# Columns with null entries
na_pct = sites.isna().sum() / len(sites)
na_pct[na_pct > 0]

danger             0.948290
iso_code           0.000834
justification      0.724771
latitude           0.000834
location           0.350292
longitude          0.000834
secondary_dates    0.922435
dtype: float64

In [26]:
# Which entries have null long and lat?
sites[np.logical_or(sites['latitude'].isna(), sites['longitude'].isna())]

Unnamed: 0,category,criteria_txt,danger,date_inscribed,extension,http_url,id_number,image_url,iso_code,justification,...,location,longitude,region,revision,secondary_dates,short_description,site,states,transboundary,unique_number
13,Cultural,(iii)(iv)(vi),,2023,0,https://whc.unesco.org/en/list/1567,1567,https://whc.unesco.org/uploads/sites/site_1567...,"be,fr",,...,,,Europe and North America,0,,This transnational serial property encompasses...,Funerary and memory sites of the First World W...,"Belgium,France",1,2559


In [11]:
sites['transboundary'].value_counts()

transboundary
0    1151
1      48
Name: count, dtype: int64