# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
import pandas as pd
document = ET.parse( './data/mondial_database.xml' )

Solution 1. 10 countries with the lowest infant mortality rates

In [6]:
#find all infant moratlity rates
infmort = []
for child in document.getroot():
    try:
        infmort.append([child.find('name').text, float(child.find('infant_mortality').text)])
    except AttributeError:
        continue

#create dataframe for infant mortality rate then sort to get lowest 10
df_infmort = pd.DataFrame(infmort, columns=['country', 'infant_mortality'])
df_infmort.sort('infant_mortality', ascending=True).head(10)

Unnamed: 0,country,infant_mortality
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


Solution 2. 10 cities with the largest population

In [7]:
population = []
for element in document.iterfind('country/city'):
    try:
        for pop in element.findall('population'):
            population.append([element.find('name').text, int(pop.get('year')), float(pop.text)])
    except AttributeError:
        continue

#create dataframe with all city populations
df_pop = pd.DataFrame(population, columns = ['city', 'year', 'pop'])

#determine most recent population data for each city and sort to find largest 10 city populations
g= df_pop.groupby('city')
g.tail(1).sort('pop').tail(10)

Unnamed: 0,city,year,pop
270,Pyongyang,2008,3255288
437,Busan,2010,3403135
566,New Taipei,2012,3939305
409,Al Iskandariyah,2006,4123869
554,Singapore,2010,5076700
229,Ho Chi Minh,2009,5968384
322,Hong Kong,2009,7055071
204,Bangkok,1999,7506700
412,Al Qahirah,2006,8471859
433,Seoul,2010,9708483


Solution 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [8]:
#gather country, ethnic group, and 
#percentage * most recent population data (assuming most population data is in chronological order for a country)
ethnic_grp = []
for child in document.getroot():
    country_pop = child.findall('population')
    try:
        p = float(country_pop[-1].text)
        for eth in child.findall('ethnicgroup'):
            ethnic_grp.append([child.find('name').text, eth.text, float(eth.get('percentage'))/100*p])
    except IndexError:
        continue

#create dataframe of ethnic groups then sum by ethnic groups to find 10 most populous groups
df_ethnic = pd.DataFrame(ethnic_grp, columns = ['country', 'ethnic_group', 'pop'])
df_ethnic.groupby('ethnic_group').sum().sort('pop').tail(10)

Unnamed: 0_level_0,pop
ethnic_group,Unnamed: 1_level_1
Malay,121993600.0
Japanese,126534200.0
Russian,131857000.0
Bengali,146776900.0
Mestizo,157734400.0
Dravidian,302713700.0
African,318325100.0
European,494872200.0
Indo-Aryan,871815600.0
Han Chinese,1245059000.0


Solution 4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [9]:
# helper function to find country name
def find_country(code):
    countries = code.split(' ')
    arr = [country.find('name').text for country in document.iterfind('country') \
         if any([i == country.get('car_code') for i in countries])]
    return " ".join(arr)


#river is length, lake is area, airport is elevation
def find_name(itm, measurement):
    max_measurement, country, arr = 0, '', []
    for element in document.iterfind(itm):
        try:
            m = float(element.find(measurement).text)
            if m > max_measurement:
                max_measurement = m
                country = element.get('country')
                arr = [element.find('name').text, find_country(country), m]
        except (TypeError, AttributeError):
            continue
    return arr

#answers
longest_river = find_name('river', 'length')
largest_lake = find_name('lake', 'area')
highest_airport = find_name('airport', 'elevation')