# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [15]:
from xml.etree import ElementTree as ET
import numpy as np

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [5]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [6]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [7]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

# **Problem 1**

In [42]:
import pandas as pd
document = ET.parse( './data/mondial_database.xml' )
root=document.getroot()

In [40]:
temp = []
for country in document.getroot():
    if country.find('infant_mortality') is not None:
        temp.append({'Country': country.find('name').text, 
                  'Mortality': float(country.find('infant_mortality').text)})
data = pd.DataFrame.from_dict(temp).sort_values(by='Mortality').head(10)
data

Unnamed: 0,Country,Mortality
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


# **Problem 2**

In [63]:
populations=list([country.find('name').text, int(country.find('population').text)] \
              for country in root.findall("./country/city/population/.."))

In [68]:
pd.DataFrame(populations,columns=['City','Population'])\
.sort_values('Population',ascending=False)[:10]

Unnamed: 0,City,Population
165,Seoul,10229262
123,Hong Kong,7055071
154,Al Qahirah,6053000
75,Bangkok,5876000
87,Ho Chi Minh,3924435
166,Busan,3813814
205,New Taipei,3722082
84,Hanoi,3056146
153,Al Iskandariyah,2917000
204,Taipei,2626138


# **Problem 3**

In [107]:
ethgroup = []
pop = []
ethpop = []

for country in root.findall('country'):
    for population in reversed(country.findall('population')):
        pop.append(int(population.text))
        for eth in country.findall('ethnicgroup'):
            ethpop.append((int(population.text), float(eth.attrib['percentage']), \
            eth.text))
        break

In [116]:
df= pd.DataFrame(ethpop, columns=['Pop', 'Perc', 'Ethnicity'])
df['Population'] = (df.Pop * df.Perc)/100

df=df.groupby('Ethnicity').sum().sort_values(by='Population', ascending=False).head(10)
del df['Perc']
del df['Pop']
df

Unnamed: 0_level_0,Population
Ethnicity,Unnamed: 1_level_1
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0


# **Problem 4**

**name and country of a) longest river, b) largest lake and c) airport at highest elevation**

In [323]:
riv = []
con = []
for element in document.iterfind('country'):
    con.append({'Country': element.find('name').text, 'Country Code': element.attrib['car_code']})
country = pd.DataFrame.from_dict(con)
for j in document.iterfind('river'):
    if j.find('length') is not None: dist = float(j.find('length').text)
    else: dist = 0  
    x = j.attrib['country'].split(" ")
    for c in x: riv.append({'Country Code': c, 'River': j.find('name').text,'Length': dist})
riv2 = pd.DataFrame.from_dict(riv).merge(country)
riv2= riv2.sort_values(by='Length',ascending=False).head(1)
del riv2['Country Code']
del riv2['Length']
riv2

Unnamed: 0,River,Country
299,Amazonas,Colombia


In [324]:
l = []
for j in document.iterfind('lake'):
    if j.find('area') is not None: a = float(j.find('area').text)
    else: a = 0
    dist = j.attrib['country'].split(" ")
    for c in dist: l.append({'Country Code': c, 'Lake': j.find('name').text,'Area': a})
lake = pd.DataFrame.from_dict(l).merge(country)
lake=lake.sort_values(by='Area',ascending=False)[:1]
del lake['Country Code']
del lake['Area']
lake

Unnamed: 0,Lake,Country
56,Caspian Sea,Russia


In [315]:
apt=[]
for j in document.findall('airport'):
    if (j.find('elevation').text != None and j.find('name') != None):
        apt.append((float(j.find('elevation').text), \
        j.find('name').text, j.get('country')))
air = pd.DataFrame(apt, columns=['Elev', 'Airport Name', 'Country Code'])\
.sort_values(by='Elev', ascending=False)
air=air[:1]
del air['Elev']
air

Unnamed: 0,Airport Name,Country Code
80,El Alto Intl,BOL


BOL stands for Bolivia.