# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
doc = ET.parse( './data/mondial_database.xml' )

In [6]:
countrybabydeath = {}
for element in doc.getiterator('country'):
    countrybabydeath[element.findtext('name')]=element.findtext('infant_mortality')

In [7]:
import pandas as pd
df = pd.DataFrame.from_dict(countrybabydeath,  orient='index', dtype='float')
df.columns = ['infant_mortality']

In [8]:
df.sort('infant_mortality', ascending=True).head(10)

  if __name__ == '__main__':


Unnamed: 0,infant_mortality
Monaco,1.81
Japan,2.13
Norway,2.48
Bermuda,2.48
Singapore,2.53
Sweden,2.6
Czech Republic,2.63
Hong Kong,2.73
Macao,3.13
Iceland,3.15


In [9]:
citypop = {}
for element in doc.getiterator('country'):
    for subelement in element.getiterator('city'):
        citypop[subelement.findtext('name')]=subelement.findtext('population[@year="2011"]')

In [10]:
df2 = pd.DataFrame.from_dict(citypop,  orient='index', dtype='float')
df2.columns = ['population']

In [11]:
df2.sort('population', ascending=False).head(10)

  if __name__ == '__main__':


Unnamed: 0,population
Mumbai,12442373.0
Delhi,11034555.0
Bangalore,8443675.0
Tehran,8154051.0
Dhaka,7423137.0
Hyderabad,6731790.0
Ahmadabad,5577940.0
Luanda,5000000.0
Chennai,4646732.0
Sydney,4605992.0


In [12]:
#first make empty dict with all ethnic group as key
ethnic={}
for element in doc.getiterator('country'):
    for subelement in element.getiterator('ethnicgroup'):
        ethnic[subelement.text] = 0

#get population count for all countries
countrypop = {}
for element in doc.getiterator('country'):
    for pop in element.findall('population'):
        countrypop[element.findtext('name')]=float(pop.text)

#now population ethnic dictionary        
for element in doc.getiterator('country'):
    for subelement in element.getiterator('ethnicgroup'):
        ethnic[subelement.text] = ethnic[subelement.text] + ((float(subelement.get('percentage')) * countrypop[element.findtext('name')] )/100)

In [13]:
df3 = pd.DataFrame.from_dict(ethnic,  orient='index', dtype='float')
df3.columns = ['population']

In [14]:
df3.sort('population', ascending=False).head(10)

  if __name__ == '__main__':


Unnamed: 0,population
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0


In [15]:
doc.getiterator('river')[0].getchildren()

  if __name__ == '__main__':


[<Element 'name' at 0x1061d95d0>,
 <Element 'to' at 0x1061d9610>,
 <Element 'area' at 0x1061d9650>,
 <Element 'length' at 0x1061d9690>,
 <Element 'source' at 0x1061d96d0>,
 <Element 'estuary' at 0x1061d9790>]

In [16]:
doc.getiterator('river')[0].items()

[('country', 'IS'), ('id', 'river-Thjorsa')]

In [17]:
doc.getiterator('country')[0].keys()

['memberships', 'area', 'car_code', 'capital']

In [18]:
countrycode = {}
for country in doc.getiterator('country'):
    countrycode[country.get('car_code')] = country.findtext('name')

# name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [19]:
longestriver = 0
name = ''
for river in doc.getiterator('river'):
    length = river.findtext('length')
    if length!=None:
        length = float(length)
        if (length > longestriver):
            country=''
            longestriver = length
            name = river.findtext('name')
            for countries in river.get('country').split():
                country = country + " " + countrycode[countries]
print "Longest river is ", name, "and it is found in ", country, longestriver

Longest river is  Amazonas and it is found in   Colombia Brazil Peru 6448.0


In [20]:
doc.getiterator('lake')[0].getchildren()

  if __name__ == '__main__':


[<Element 'name' at 0x1064d3110>,
 <Element 'located' at 0x1064d3150>,
 <Element 'to' at 0x1064d3190>,
 <Element 'area' at 0x1064d31d0>,
 <Element 'latitude' at 0x1064d3210>,
 <Element 'longitude' at 0x1064d3250>,
 <Element 'elevation' at 0x1064d3290>,
 <Element 'depth' at 0x1064d32d0>]

In [21]:
doc.getiterator('lake')[0].items()

[('country', 'SF'), ('id', 'lake-Inarisee')]

In [22]:
doc.getiterator('lake')[0].findtext('area')

'1040'

In [23]:
biggestlake = 0
name = ''
for lake in doc.getiterator('lake'):
    area = lake.findtext('area')
    if area!=None:
        area = float(area)
        if (area > biggestlake):
            country=''
            biggestlake = area
            name = lake.findtext('name')
            for countries in lake.get('country').split():
                country = country + " " + countrycode[countries]
print "Biggest lake is ", name, "and it is found in ", country, biggestlake

Biggest lake is  Caspian Sea and it is found in   Russia Azerbaijan Kazakhstan Iran Turkmenistan 386400.0


In [24]:
doc.getiterator('airport')[0].items()

[('city', 'cty-Afghanistan-2'), ('iatacode', 'HEA'), ('country', 'AFG')]

In [25]:
doc.getiterator('airport')[0].getchildren()

  if __name__ == '__main__':


[<Element 'name' at 0x10691c7d0>,
 <Element 'latitude' at 0x10691c810>,
 <Element 'longitude' at 0x10691c850>,
 <Element 'elevation' at 0x10691c890>,
 <Element 'gmtOffset' at 0x10691c8d0>]

In [26]:
highestairport = 0
name = ''
for airport in doc.getiterator('airport'):
    elevation = airport.findtext('elevation')
    if elevation !='':
        elevation = float(elevation)
        if (elevation > highestairport):
            country=''
            highestairport = elevation
            name = airport.findtext('name')
            for countries in airport.get('country').split():
                country = country + " " + countrycode[countries]
print "Airport with highest elevation is ", name, "and it is found in ", country, highestairport

Airport with highest elevation is  El Alto Intl and it is found in   Bolivia 4063.0
