# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [3]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [4]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [5]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [6]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [168]:
document = ET.parse( './data/mondial_database.xml' )

CPU times: user 1.37 s, sys: 22.8 ms, total: 1.39 s
Wall time: 1.4 s


In [171]:
# countries with the lowest infant mortality rate
mort_dict = {}
for c in document.findall('country'):
    for node in list(c):
        if node.tag == 'name':
            value = node.text
        if node.tag == 'infant_mortality':
            mort_dict[float(node.text)] = value
for i in range(10):
    rate = sorted(mort_dict.keys())[i]
    country = mort_dict[rate]
    print country + ' has an infant mortality rate of ' + str(rate) + '.'


Monaco has an infant mortality rate of 1.81.
Japan has an infant mortality rate of 2.13.
Bermuda has an infant mortality rate of 2.48.
Singapore has an infant mortality rate of 2.53.
Sweden has an infant mortality rate of 2.6.
Czech Republic has an infant mortality rate of 2.63.
Hong Kong has an infant mortality rate of 2.73.
Macao has an infant mortality rate of 3.13.
Iceland has an infant mortality rate of 3.15.
Italy has an infant mortality rate of 3.31.
CPU times: user 14.7 ms, sys: 4.37 ms, total: 19 ms
Wall time: 15.1 ms


In [170]:
# cities with the hightest population
cty_dict = {}
for cty in document.findall('country/province/city'):
    max_year = 0
    pop = 0
    for node in list(cty):
        if node.tag == 'name':
            value = node.text
        elif node.tag == 'population':
            if int(node.attrib['year']) > max_year:
                max_year = int(node.attrib['year'])
                pop = int(node.text)
    cty_dict[pop] = value
    
for i in range(10):
    pop = sorted(cty_dict)[::-1][i]
    city = cty_dict[pop]
    print city + ' has a population of ' + str(pop) + '.'   

Shanghai has a population of 22315474.
Istanbul has a population of 13710512.
Mumbai has a population of 12442373.
Moscow has a population of 11979529.
Beijing has a population of 11716620.
São Paulo has a population of 11152344.
Tianjin has a population of 11090314.
Guangzhou has a population of 11071424.
Delhi has a population of 11034555.
Shenzhen has a population of 10358381.
CPU times: user 82.2 ms, sys: 20.1 ms, total: 102 ms
Wall time: 84.6 ms


In [172]:
# ethnic groups with the largest population
totals = {}
for c in document.findall('country'):
    max_year = 0
    pop = 0
    for node in list(c):
        if node.tag == 'population':
            if int(node.attrib['year']) > max_year:
                max_year = int(node.attrib['year'])
                pop = int(node.text)
        if node.tag == 'ethnicgroup':
            name = node.text
            percent = float(node.attrib['percentage']) / 100
            try:
                totals[name] += (pop * percent)
            except(KeyError):
                totals[name] = (pop * percent)
pops = sorted(totals.values())[::-1]
for x in range(10):
    pop = pops[x]
    for d in totals:
        if totals[d] == pop:
            print 'There are ' + str(int(pop)) + ' ' + d + ' people.'

There are 1245058800 Han Chinese people.
There are 871815583 Indo-Aryan people.
There are 494872219 European people.
There are 318325120 African people.
There are 302713744 Dravidian people.
There are 157734354 Mestizo people.
There are 146776916 Bengali people.
There are 131856996 Russian people.
There are 126534212 Japanese people.
There are 121993550 Malay people.
CPU times: user 26.5 ms, sys: 7.9 ms, total: 34.4 ms
Wall time: 27.5 ms


In [228]:
# river, lake, airport
def country_decode(codes):
    # Takes a string of country codes as an argument and returns a list of long-form country names.
    countries = []
    codes = codes.split()
    for code in codes:
        countries.append(concodes[code])
    return countries

def find_max(feature, metric):
    # Takes two strings as arguments, returns a tuple of the name of the feature, the max of the metric, and the 
    # countries associated with it.
    max_m = 0
    codes = ''
    max_name = ''
    for f in document.findall(feature):
        for node in list(f):
            if node.tag == 'name':
                name = node.text
            if node.tag == metric:
                try:
                    met = float(node.text)
                except TypeError:
                    met = 0
                if met > max_m:
                    max_m = met
                    codes = f.attrib['country']
                    max_name = name
        countries = country_decode(codes)
    return max_name, max_m, countries
        
concodes = {}
for c in document.findall('country'):
    code = c.attrib['car_code']
    for node in list(c):
        if node.tag == 'name':
            name = node.text
        concodes[code] = name
        
riv = find_max('river', 'length')
print 'The longest river is ' + riv[0] + ', which is ' + str(int(riv[1])) + 'km long.'
print 'It runs through: '
for e in riv[2]:
    print e

lake = find_max('lake', 'area')
print '\nThe largest lake is ' + lake[0] + ', with an area of ' + str(int(lake[1])) + 'km^2.'
print 'It borders: '
for e in lake[2]:
    print e

air = find_max('airport', 'elevation')
print '\n' + air[0] + ' in ' + str(air[2][0]) + ' is the highest airport, at an elevation of ' + str(air[1]) + ' meters.'


The longest river is Amazonas, which is 6448km long.
It runs through: 
Colombia
Brazil
Peru

The largest lake is Caspian Sea, with an area of 386400km^2.
It borders: 
Russia
Azerbaijan
Kazakhstan
Iran
Turkmenistan

El Alto Intl in Bolivia is the highest airport, at an elevation of 4063.0 meters.
