# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
ET.phone_home() # ok, come on, I really had to. You can't call it ET and expect me not to try to do that...

AttributeError: 'module' object has no attribute 'phone_home'

In [5]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [6]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [7]:
document = ET.parse( './data/mondial_database.xml' )

In [8]:
mondial_root = document.getroot()

In [11]:
# get all the elements that correspond to countries
countries = [child for child in mondial_root if child.tag == 'country'] # get country list




In [12]:
# get infant mortality rates
i_mortality = [(country.find('name').text, float(country.find('infant_mortality').text))\
               for country in countries if (country.find('infant_mortality') != None)]

In [13]:
i_mortality[:10] # take a look at 10 (not yet sorted)

[('Albania', 13.19),
 ('Greece', 4.78),
 ('Macedonia', 7.9),
 ('Serbia', 6.16),
 ('Andorra', 3.69),
 ('France', 3.31),
 ('Spain', 3.33),
 ('Austria', 4.16),
 ('Czech Republic', 2.63),
 ('Germany', 3.46)]

In [14]:
i_mortality.sort(key=lambda x: x[1]) # now sort

In [16]:
i_mortality[:10]

[('Monaco', 1.81),
 ('Japan', 2.13),
 ('Norway', 2.48),
 ('Bermuda', 2.48),
 ('Singapore', 2.53),
 ('Sweden', 2.6),
 ('Czech Republic', 2.63),
 ('Hong Kong', 2.73),
 ('Macao', 3.13),
 ('Iceland', 3.15)]

In [17]:
# another way, because I already am missing the nice formatting of Pandas
import pandas as pd
i_mortality_df = pd.DataFrame(i_mortality,columns=['Country','Infant Mortality Rate'])

In [18]:
i_mortality_df.head(10)

Unnamed: 0,Country,Infant Mortality Rate
0,Monaco,1.81
1,Japan,2.13
2,Norway,2.48
3,Bermuda,2.48
4,Singapore,2.53
5,Sweden,2.6
6,Czech Republic,2.63
7,Hong Kong,2.73
8,Macao,3.13
9,Iceland,3.15


In [21]:
# let's load all the ethnic info. Get it in tuples of the form:
#  (country name, country pop, ethnic group xml object)
ethnic_info = [(country.find('name').text,\
  float(country.findall('population')[-1].text),\
 country.findall('ethnicgroup'))\
 for country in countries]

In [22]:
# more transformation: convert elements. Functional programming rules!
# concept: the previous returned (country, population, ethnic group XML element objects)
# What we need to do is go through each entry, and transform the ethnic group element objects
# per country into tuples labeled by country and population (possibly redundantly) 
ethnic_info_df = pd.DataFrame(reduce(lambda x,y: x+y,\
       map(lambda tup: map(lambda elem: (tup[0],tup[1], elem.text, float(elem.attrib['percentage'])), tup[2]),\
           ethnic_info)),\
             columns=['Country', 'Population', 'Ethnic Group', 'Percentage of Population'])


In [24]:
ethnic_info_df.head(10) # yay, I got it into a data frame!

Unnamed: 0,Country,Population,Ethnic Group,Percentage of Population
0,Albania,2800138,Albanian,95.0
1,Albania,2800138,Greek,3.0
2,Greece,10816286,Greek,93.0
3,Macedonia,2059794,Macedonian,64.2
4,Macedonia,2059794,Albanian,25.2
5,Macedonia,2059794,Turkish,3.9
6,Macedonia,2059794,Gypsy,2.7
7,Macedonia,2059794,Serb,1.8
8,Serbia,7120666,Serb,82.9
9,Serbia,7120666,Montenegrin,0.9


In [25]:
edf = ethnic_info_df.set_index(['Ethnic Group','Country'])
# Now set the index, to group first by ethnic group
# and then country
edf.head(7)

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,Percentage of Population
Ethnic Group,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
Albanian,Albania,2800138,95.0
Greek,Albania,2800138,3.0
Greek,Greece,10816286,93.0
Macedonian,Macedonia,2059794,64.2
Albanian,Macedonia,2059794,25.2
Turkish,Macedonia,2059794,3.9
Gypsy,Macedonia,2059794,2.7


In [26]:
# assign the population based on the proportion
edf['Group Population']= edf['Population']*edf['Percentage of Population']/100.0

In [27]:
edf.head() # it's now correctly grouped, by ethnic group, rather than by country

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,Percentage of Population,Group Population
Ethnic Group,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Albanian,Albania,2800138,95.0,2660131.1
Greek,Albania,2800138,3.0,84004.14
Greek,Greece,10816286,93.0,10059145.98
Macedonian,Macedonia,2059794,64.2,1322387.748
Albanian,Macedonia,2059794,25.2,519068.088


In [28]:
# now use groupby to compute the sum
# first reset index to allow groupby
# then drop the now irrelevant columns
# group by ethinc group, then sum, and sort on population, and finally take the head
edf.reset_index()\
.drop(['Population','Percentage of Population'],axis=1)\
.groupby('Ethnic Group')\
.sum()\
.sort(columns='Group Population',ascending=False)\
.head(10)

Unnamed: 0_level_0,Group Population
Ethnic Group,Unnamed: 1_level_1
Han Chinese,1245059000.0
Indo-Aryan,871815600.0
European,494872200.0
African,318325100.0
Dravidian,302713700.0
Mestizo,157734400.0
Bengali,146776900.0
Russian,131857000.0
Japanese,126534200.0
Malay,121993600.0
