# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':',)
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
import pandas as pd

### 10 countries with the lowest infant mortality rates

looking at the structure, the tag infant_mortality is under the country labels. To answer this question, I will step through the tree and pull out country and infant mortality into a structured variable, which can then be read into a DataFrame, so that easy sorting can be done.

In [7]:
# create list of tuples of country, infant mortality pairs.
data = []
for element in document.iterfind('country'):
    if element.find('name') is None:        
        country = ''
    else:
        country = element.find('name').text
    if element.find('infant_mortality') is None:
        infmort = pd.np.nan
    else:
        infmort = element.find('infant_mortality').text
    data.append((country,infmort))
df = pd.DataFrame(data,columns=['country','infant_mortality'])

In [8]:
df.sort_values('infant_mortality',ascending=True).head(10)

Unnamed: 0,country,infant_mortality
38,Monaco,1.81
30,Romania,10.16
153,Fiji,10.2
69,Brunei,10.48
132,Grenada,10.5
237,Mauritius,10.59
124,Panama,10.7
243,Seychelles,10.77
102,United Arab Emirates,10.92
113,Barbados,10.93


### 10 cities with the largest population

looking at the structure, the city tags are under the country labels, and population is a child of the cities. The data provides population as a function of time. I will step through the tree and pull out cities and all populations for various years into a structured variable, which can then be read into a DataFrame, so that easy sorting can be done.

The question just asks, what are the 10 cities with the largest population; I could interpret cities and year to be a unique key, meaning Paris in 1990 is a different city from Paris in 2000, for unique cities for the latest year, or for unique cities irrespective of year. I will provide answers to all of these interpretations.

In [9]:
# create list of tuples of country, city, population, and year.
data = []
for element in document.iterfind('country'):
    if element.find('name') is None:        
        country = ''
    else:
        country = element.find('name').text
    for subelem in element.iterfind('city'):
        if subelem.find('name') is None:
            city = ''
        else:
            city = subelem.find('name').text
        for child in subelem.iterfind('population'):
            year = child.get('year')
            pop = child.text
            data.append((country,city,year,pop))
df = pd.DataFrame(data,columns=['country','city','year','population'])

In [10]:
df.dtypes

country       object
city          object
year          object
population    object
dtype: object

In [11]:
df.year = pd.to_numeric(df.year,errors='coerce')
df.population = pd.to_numeric(df.population,errors='coerce')

In [12]:
# top 10 largest cities in time
df.sort_values(['population'],ascending=False).head(10)

Unnamed: 0,country,city,year,population
430,South Korea,Seoul,1995,10229262
431,South Korea,Seoul,2000,9895217
432,South Korea,Seoul,2005,9820171
433,South Korea,Seoul,2010,9708483
412,Egypt,Al Qahirah,2006,8471859
204,Thailand,Bangkok,1999,7506700
322,Hong Kong,Hong Kong,2009,7055071
411,Egypt,Al Qahirah,1996,6801931
410,Egypt,Al Qahirah,1986,6053000
229,Vietnam,Ho Chi Minh,2009,5968384


In [13]:
# top 10 largest cities for latest year
df.sort_values(['city','year'],ascending=False).\
    drop_duplicates(subset='city').\
    sort_values('population',ascending=False).head(10)

Unnamed: 0,country,city,year,population
433,South Korea,Seoul,2010,9708483
412,Egypt,Al Qahirah,2006,8471859
204,Thailand,Bangkok,1999,7506700
322,Hong Kong,Hong Kong,2009,7055071
229,Vietnam,Ho Chi Minh,2009,5968384
554,Singapore,Singapore,2010,5076700
409,Egypt,Al Iskandariyah,2006,4123869
566,Taiwan,New Taipei,2012,3939305
437,South Korea,Busan,2010,3403135
270,North Korea,Pyongyang,2008,3255288


In [14]:
# top 10 largest cities irrespective of year
df.sort_values(['city','population'],ascending=False).\
    drop_duplicates(subset='city').\
    sort_values('population',ascending=False).head(10)

Unnamed: 0,country,city,year,population
430,South Korea,Seoul,1995,10229262
412,Egypt,Al Qahirah,2006,8471859
204,Thailand,Bangkok,1999,7506700
322,Hong Kong,Hong Kong,2009,7055071
229,Vietnam,Ho Chi Minh,2009,5968384
554,Singapore,Singapore,2010,5076700
409,Egypt,Al Iskandariyah,2006,4123869
566,Taiwan,New Taipei,2012,3939305
434,South Korea,Busan,1995,3813814
270,North Korea,Pyongyang,2008,3255288


### 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

looking at the structure, the ethnicgroup tags are under the country labels and provides a percentage of the total population. Population is also under the country labels, and is a function of time. I will step through the tree and pull out country, pick out the population for the latest year, and the ethnic groups, and write the country, year, ethnic group, and calculated group population into a structured variable. This can then be read into a DataFrame, so that easy sorting can be done.

In [15]:
# create list of tuples of country, population*percentage, 
# year, and ethnic group.
data = []
for element in document.iterfind('country'):
    year=0
    totalpop=0
    if element.find('name') is None:        
        country = ''
    else:
        country = element.find('name').text
    for subelem in element.iterfind('population'):
        thisyear = int(subelem.get('year'))
        thispop = int(subelem.text)
        if thisyear > year:
            year = thisyear
            totalpop = thispop
    for subelem in element.iterfind('ethnicgroup'):
        pop = float(subelem.get('percentage'))/100*totalpop
        group = subelem.text
        data.append((country,group,pop))
df = pd.DataFrame(data,columns=['country','ethnicgroup','population'])

In [16]:
df.head()

Unnamed: 0,country,ethnicgroup,population
0,Albania,Albanian,2660131.0
1,Albania,Greek,84004.14
2,Greece,Greek,10059150.0
3,Macedonia,Macedonian,1322388.0
4,Macedonia,Albanian,519068.1


In [17]:
# top 10 ethnic groups
df.groupby('ethnicgroup').sum().sort_values('population',ascending=False).head(10)

Unnamed: 0_level_0,population
ethnicgroup,Unnamed: 1_level_1
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0


### name and country of a) longest river, b) largest lake and c) airport at highest elevation

Where in the xml are the rivers, lakes, and airports?

In [18]:
# do they exist as tags in the document?
root = document.getroot() 
alltags = []
for child in root:
    alltags.append(child.tag)
set(alltags)

{'airport',
 'continent',
 'country',
 'desert',
 'island',
 'lake',
 'mountain',
 'organization',
 'river',
 'sea'}

These features are listed at the same level as the countries. Get example of each one to see structure.

In [19]:
elem = document.find('river')
ET.tostring(elem)

b'<river country="IS" id="river-Thjorsa">\n      <name>Thjorsa</name>\n      <to water="sea-Atlantic" watertype="sea" />\n      <area>7530</area>\n      <length>230</length>\n      <source country="IS">\n         <latitude>65</latitude>\n         <longitude>-18</longitude>\n      </source>\n      <estuary country="IS">\n         <latitude>63.9</latitude>\n         <longitude>-20.8</longitude>\n      </estuary>\n   </river>\n   '

In [20]:
elem = document.find('lake')
ET.tostring(elem)

b'<lake country="SF" id="lake-Inarisee">\n      <name>Inari</name>\n      <located country="SF" province="lteil-LAP-SF" />\n      <to water="river-Paatsjoki" watertype="river" />\n      <area>1040</area>\n      <latitude>68.95</latitude>\n      <longitude>27.7</longitude>\n      <elevation>119</elevation>\n      <depth>92</depth>\n   </lake>\n   '

In [21]:
elem = document.find('airport')
ET.tostring(elem)

b'<airport city="cty-Afghanistan-2" country="AFG" iatacode="HEA">\n      <name>Herat</name>\n      <latitude>34.210017</latitude>\n      <longitude>62.2283</longitude>\n      <elevation>977</elevation>\n      <gmtOffset>5</gmtOffset>\n   </airport>\n   '

Country information is listed as an attribute, in the form of country code. This requires a mapping from country code to country. for task a) "length" is a child of "river". I will pull out a list of names of rivers, country, and length, b) "area" is a child of "lake". I will pull out a list of names of lakes, country, and area, and c) "elevation" is a child of "airport". I will pull out a list of names of airports, country, and elevation. 

In [22]:
#pull out country data for mapping country name to code
data = []
for element in document.iterfind('country'):
    if element.find('name') is not None:        
        country = element.find('name').text
        code = element.get('car_code')
        data.append((code,country))
country_mapping = dict(data)

#### country and name of longest river

In [23]:
# create list of tuples of river, country, and length
# country attribute may list all relevant country codes separated by space
data = []
for element in document.iterfind('river'):
    if element.find('name') is not None:        
        river = element.find('name').text
        if element.find('length') is not None:
            length = element.find('length').text
            codes = element.get('country')
            for code in codes.split(' '):
                data.append((river,country_mapping[code],length))
df = pd.DataFrame(data,columns=['river','country','length'])

df.length = pd.to_numeric(df.length, errors='coerce')

In [24]:
df.sort_values('length',ascending=False).head(1)

Unnamed: 0,river,country,length
300,Amazonas,Peru,6448.0


#### name and country of largest lake

In [25]:
# create list of tuples of lake, country, and area
# country attribute may list all relevant country codes separated by space
data = []
for element in document.iterfind('lake'):
    if element.find('name') is not None:        
        obj = element.find('name').text
        if element.find('area') is not None:
            meas = element.find('area').text
            codes = element.get('country')
            for code in codes.split(' '):
                data.append((obj,country_mapping[code],meas))
df = pd.DataFrame(data,columns=['lake','country','area'])

df.area = pd.to_numeric(df.area, errors='coerce')

In [26]:
df.sort_values('area',ascending=False).head(1)

Unnamed: 0,lake,country,area
68,Caspian Sea,Russia,386400.0


#### name and country of airport at highest elevation

In [27]:
# create list of tuples of airport, country, and elevation
# country attribute may list all relevant country codes separated by space
data = []
for element in document.iterfind('airport'):
    if element.find('name') is not None:        
        obj = element.find('name').text
        if element.find('elevation') is not None:
            meas = element.find('elevation').text
            codes = element.get('country')
            for code in codes.split(' '):
                data.append((obj,country_mapping[code],meas))
df = pd.DataFrame(data,columns=['airport','country','elevation'])

df.elevation = pd.to_numeric(df.elevation, errors='coerce')

In [28]:
df.sort_values('elevation',ascending=False).head(1)

Unnamed: 0,airport,country,elevation
80,El Alto Intl,Bolivia,4063.0
