# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET
import pandas as pd

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

In [6]:
# 10 cities with the largest population

pop = pd.DataFrame()

for element in document.iterfind('country'):
    entry = pd.DataFrame({'country': element.find('name').text, 'population': int(element.find('population').text)}, index = range(1))
    pop = pop.append(entry, ignore_index=True)
    
#for country in root.iter('country'):
    #name = country.find('name').text
    #population = country.find('population').text
    #print name, population

In [7]:
pop.sort_values(by='population', ascending =False).head(10)

Unnamed: 0,country,population
55,China,543776080
67,India,238396327
120,United States,157813040
23,Russia,102798657
98,Japan,82199470
88,Indonesia,72592192
11,Germany,68230796
176,Brazil,53974725
53,United Kingdom,50616012
7,France,40502513


In [8]:
# 10 countries with the lowest infant mortality rates
from decimal import Decimal
mortality = pd.DataFrame()

for element in document.iterfind('country'):
    try:
        entry = pd.DataFrame({'country': element.find('name').text, 'mortality rate': Decimal(element.find('infant_mortality').text)}, index = range(1))
    except AttributeError:
        pass
    mortality = mortality.append(entry, ignore_index=True)

In [9]:
mortality.sort_values(by='mortality rate', ascending = True).head(10)

Unnamed: 0,country,mortality rate
38,Monaco,1.81
98,Japan,2.13
36,Norway,2.48
117,Bermuda,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


In [15]:
# Trying to extract only year

pop = pd.DataFrame()

for element in document.iterfind('country'):
    for subelement in element.getiterator('population'):
        if subelement.attrib['year'] == '2011':
            print element.find('name').text, subelement.text

Albania 2800138
Albania 418495
Albania 77075
Albania 113249
Albania 79513
Albania 78703
Albania 51152
Greece 10816286
Greece 608182
Greece 58790
Greece 3828434
Greece 664046
Greece 163688
Greece 139981
Greece 106943
Greece 679796
Greece 213984
Greece 283689
Greece 207855
Greece 102071
Greece 336856
Greece 112486
Greece 1880297
Greece 325182
Greece 623065
Greece 173993
Greece 108642
Greece 309015
Greece 115490
Greece 577903
Greece 547390
Greece 75315
Greece 102223
Greece 732762
Greece 162591
Greece 144449
Greece 199231
Greece 1811
Macedonia 2059794
Macedonia 514967
Macedonia 107745
Serbia 7120666
Serbia 1639121
Serbia 335701
Serbia 257867
Montenegro 620029
Montenegro 150977
Kosovo 1733872
Kosovo 198214
Andorra 78115
Andorra 22256
France 64933400
France 1852325
France 272222
France 110351
France 3254233
France 239399
France 1350682
France 140957
France 1475684
France 108793
France 3217767
France 208033
France 140547
France 1642734
France 151672
France 2556835
France 114185
France 134633
