# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [6]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':')
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [20]:
document = ET.parse( './data/mondial_database.xml' )

In [8]:
import pandas as pd

In [14]:
infant_mortality = {}
for country in document.iterfind('country'):
    cname = country.find('name')
    if cname != None:
        mortality = country.find('infant_mortality')
        if mortality != None:
            infant_mortality[cname.text] = float(mortality.text)
pd.Series(infant_mortality).sort_values().head(10)

Monaco            1.81
Japan             2.13
Norway            2.48
Bermuda           2.48
Singapore         2.53
Sweden            2.60
Czech Republic    2.63
Hong Kong         2.73
Macao             3.13
Iceland           3.15
dtype: float64

In [32]:
city_pop = {}
for country in document.iterfind('country'):
    for city in country.iter('city'):
        cname = city.find('name')
        if cname != None:
            for population in city.findall('population'):
                city_pop[cname.text] = float(population.text)
pd.Series(city_pop).sort_values(ascending=False).head(10)

Shanghai     22315474
Istanbul     13710512
Mumbai       12442373
Moskva       11979529
Beijing      11716620
São Paulo    11152344
Tianjin      11090314
Guangzhou    11071424
Delhi        11034555
Shenzhen     10358381
dtype: float64

In [36]:
ethnic_pop = {}
for country in document.iterfind('country'):
    total_pop = 0
    for population in country.findall('population'):
        total_pop = int(population.text)
    for eg in country.findall('ethnicgroup'):
        eg_name = eg.text
        eg_percent = eg.attrib['percentage']
        if eg_percent != None:
            if eg_name in ethnic_pop:
                ethnic_pop[eg_name] += round(total_pop * float(eg_percent)/100)
            else:
                ethnic_pop[eg_name] = round(total_pop*float(eg_percent)/100)
pd.Series(ethnic_pop).sort_values(ascending=False).head(10)

Han Chinese    1245058800
Indo-Aryan      871815583
European        494872221
African         318325121
Dravidian       302713744
Mestizo         157734355
Bengali         146776917
Russian         131856994
Japanese        126534212
Malay           121993550
dtype: int64

In [68]:
river_len = 0
river_name = ''
c_river = ''
lake_width = 0
lake_name = ''
c_lake = ''
airport_ele = 0
airport_name = ''
c_airport = ''
for river in document.findall('river'):
        temp = river.find('length')
        if temp != None:
            temp = float(temp.text)
            if temp > river_len:
                river_len = temp
                river_name = river.find('name').text
                c_river = river.attrib['country']
for lake in document.findall('lake'):
        temp = lake.find('area')
        if temp != None:
            temp = float(temp.text)
            if temp > lake_width:
                lake_width = temp
                lake_name = lake.find('name').text
                c_lake = lake.attrib['country']
for airport in document.findall('airport'):
        temp = airport.find('elevation')
        if temp != None and temp.text!= None:
            temp = float(temp.text)
            if temp > airport_ele:
                airport_ele = temp
                airport_name = airport.find('name').text
                c_airport = airport.attrib['country']

In [69]:
print('Country with longest river: ' + c_river + ', '+ river_name + ', '+str(river_len))
print('Country with largest lake: ' + c_lake + ', '+ lake_name + ', '+str(lake_width))
print('Country with highest airport: ' + c_airport + ', '+ airport_name + ', '+str(airport_ele))

Country with longest river: CO BR PE, Amazonas, 6448.0
Country with largest lake: R AZ KAZ IR TM, Caspian Sea, 386400.0
Country with highest airport: BOL, El Alto Intl, 4063.0
