# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET
from lxml import etree
import pandas as pd
import numpy as np

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':')
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

Lets first try to understand the structure of the tree, at least in terms of what we have to find (above 4 questions), i.e. country name, infant mortality rate, population, ethnic group, river, lake and airport.

In [45]:
document = ET.parse('./data/mondial_database.xml')
print("Root is: ", document.getroot().tag)
print("Children are: ", set([document.getroot()[x].tag for x in np.arange(len(document.getroot()))]))

Root is:  mondial
Children are:  {'organization', 'continent', 'island', 'airport', 'country', 'lake', 'desert', 'river', 'mountain', 'sea'}


It looks like the country information is stored in the "country" element. Lets explore that in detail.

In [67]:
from xml.dom import minidom
minidom.parseString(ET.tostring(document.getiterator("country")[0], 'utf-8')).toprettyxml(indent="\t")

'<?xml version="1.0" ?>\n<country area="28750" capital="cty-Albania-Tirane" car_code="AL" memberships="org-BSEC org-CEI org-CD org-SELEC org-CE org-EAPC org-EBRD org-EITI org-FAO org-IPU org-IAEA org-IBRD org-ICC org-ICAO org-ICCt org-Interpol org-IDA org-IFRCS org-IFC org-IFAD org-ILO org-IMO org-IMF org-IOC org-IOM org-ISO org-OIF org-ITU org-ITUC org-IDB org-MIGA org-NATO org-OSCE org-OPCW org-OAS org-OIC org-PCA org-UN org-UNCTAD org-UNESCO org-UNIDO org-UPU org-WCO org-WFTU org-WHO org-WIPO org-WMO org-UNWTO org-WTO">\n\t\n      \n\t<name>Albania</name>\n\t\n      \n\t<population measured="est." year="1950">1214489</population>\n\t\n      \n\t<population measured="est." year="1960">1618829</population>\n\t\n      \n\t<population measured="est." year="1970">2138966</population>\n\t\n      \n\t<population measured="est." year="1980">2734776</population>\n\t\n      \n\t<population measured="est." year="1990">3446882</population>\n\t\n      \n\t<population year="1997">3249136</populat

Above is an example of the contents in the first country element. It has information about the name of the country, population in several years, infant mortality, ethnic group percentages, city names and populations (again at multiple years) etc. These sub-elements are required to answer questions 1 through 3. Let us address these three questions first.

## 10 countries with the lowest infant mortality rates

In [68]:
#1
contlist = []
imlist = []
for ce in document.iterfind('country'): # Loop through all "country" elements
    contlist.append(ce.find("name").text) # store country name
    ime = ce.find("infant_mortality") # find "infant_mortality" sub-elements
    if ime is None: # some countries do not have the record
        imlist.append(np.nan)
    else:
        imlist.append(ime.text)
# create a data frame
df1 =  pd.DataFrame({"Country": contlist, "Infant Mortality":  imlist})
df1["Infant Mortality"] = df1["Infant Mortality"].astype(float)
df1 = df1.sort_values("Infant Mortality").reset_index(drop=True)
df1["Rank"] = np.arange(1, len(df1)+1)
df1[["Rank", "Country", "Infant Mortality"]].head(10)

Unnamed: 0,Rank,Country,Infant Mortality
0,1,Monaco,1.81
1,2,Japan,2.13
2,3,Bermuda,2.48
3,4,Norway,2.48
4,5,Singapore,2.53
5,6,Sweden,2.6
6,7,Czech Republic,2.63
7,8,Hong Kong,2.73
8,9,Macao,3.13
9,10,Iceland,3.15


## 10 cities with the largest population

Since populations are recorded for multiple years for cities, we will first store city names and their population at all given years.

In [95]:
#2
citlist = []
poplist = []
yrlist = []
for ce in document.iter("city"):
    for pe in ce.iter("population"):
        citlist.append(ce.find("name").text)
        if pe is None:
            poplist.append(np.nan)
            yrlist.append(np.nan)
        else:
            poplist.append(pe.text)
            yrlist.append(pe.get("year"))

df2 =  pd.DataFrame({"City": citlist, "Population":  poplist, "Year": yrlist})
df2.head()

Unnamed: 0,City,Population,Year
0,Tirana,192000,1987
1,Tirana,244153,1990
2,Tirana,418495,2011
3,Shkodër,62000,1987
4,Shkodër,77075,2011


We can create two ranking tables: one with the latest year population data for all cities, and another where we consider the maximum population record for all cities.

The table where only the latest year's population data is considered for all cities, and then ranked:

In [96]:
df2["Population"] = df2["Population"].fillna(0).astype(int)
df2["Year"] = df2["Year"].fillna(0).astype(int)
df21 = df2.sort_values(["City", "Year"], ascending=False).groupby("City").head(1)
df21 = df21.rename(columns = {'Year':'Latest Year'})
df21 = df21.sort_values("Population", ascending=False).reset_index(drop=True)
df21["Rank"] = np.arange(1, len(df21)+1)
df21[["Rank", "City", "Population", "Latest Year"]].head(10)

Unnamed: 0,Rank,City,Population,Latest Year
0,1,Shanghai,22315474,2010
1,2,Istanbul,13710512,2012
2,3,Mumbai,12442373,2011
3,4,Moskva,11979529,2013
4,5,Beijing,11716620,2010
5,6,São Paulo,11152344,2010
6,7,Tianjin,11090314,2010
7,8,Guangzhou,11071424,2010
8,9,Delhi,11034555,2011
9,10,Shenzhen,10358381,2010


Lets get the other table now, that is with the maximum population in all given years for cities. 

In [100]:
df22 = df2.sort_values(["City", "Population"], ascending=False).groupby("City").head(1)
df22 = df22.rename(columns = {'Population':'Maximum Population', 'Year': 'Year of Maximum Population'})
df22 = df22.sort_values("Maximum Population", ascending=False).reset_index(drop=True)
df22["Rank"] = np.arange(1, len(df22)+1)
df22[["Rank", "City", "Maximum Population", "Year of Maximum Population"]].head(10)

Unnamed: 0,Rank,City,Maximum Population,Year of Maximum Population
0,1,Shanghai,22315474,2010
1,2,Istanbul,13710512,2012
2,3,Delhi,12877470,2001
3,4,Mumbai,12442373,2011
4,5,Moskva,11979529,2013
5,6,Beijing,11716620,2010
6,7,São Paulo,11152344,2010
7,8,Tianjin,11090314,2010
8,9,Guangzhou,11071424,2010
9,10,Shenzhen,10358381,2010


The two tables are very much similar, except for shift in position for Delhi. Delhi's population decreased from 2001 to 2011 while all other top cities population are maximum at latest years. 

## 10 ethnic groups with the largest overall populations

We know the percentages of various ethnic groups in all countries. We also know the "latest" population of all countries. Knowing these two information, we can canculate the population of all ethnic groups in all countries. We can then sum the population for each ethnic group over all countries. 

In [109]:
ethname = []
ethpop = []
for x in document.findall("country"):
    pl = []
    yr = []
    for pr in x.findall("population"):
        pl.append(float(pr.text))
        yr.append(int(pr.get("year")))
    pop = pl[yr.index(max(yr))] # Population at latest year
    for eth in x.findall("ethnicgroup"):
        ethname.append(eth.text)
        ethpop.append(float(eth.get("percentage"))*pop/100)
df3 = pd.DataFrame({"Ethnic Group": ethname, "Population": ethpop})
df3["Population"] = df3["Population"].astype(int)
df3 = df3.groupby("Ethnic Group")["Population"].sum().reset_index()
df3 = df3.sort_values("Population", ascending=False).reset_index(drop=True)
df3["Rank"] = np.arange(1, len(df3)+1)
df3[["Rank", "Ethnic Group", "Population"]].head(10)

Unnamed: 0,Rank,Ethnic Group,Population
0,1,Han Chinese,1245058800
1,2,Indo-Aryan,871815583
2,3,European,494872201
3,4,African,318325104
4,5,Dravidian,302713744
5,6,Mestizo,157734349
6,7,Bengali,146776916
7,8,Russian,131856989
8,9,Japanese,126534212
9,10,Malay,121993548


## Name and country of the longest river

For the final question, let us explore river, lake and airport!

In [110]:
minidom.parseString(ET.tostring(document.getiterator("river")[0], 'utf-8')).toprettyxml(indent="\t")

'<?xml version="1.0" ?>\n<river country="IS" id="river-Thjorsa">\n\t\n      \n\t<name>Thjorsa</name>\n\t\n      \n\t<to water="sea-Atlantic" watertype="sea"/>\n\t\n      \n\t<area>7530</area>\n\t\n      \n\t<length>230</length>\n\t\n      \n\t<source country="IS">\n\t\t\n         \n\t\t<latitude>65</latitude>\n\t\t\n         \n\t\t<longitude>-18</longitude>\n\t\t\n      \n\t</source>\n\t\n      \n\t<estuary country="IS">\n\t\t\n         \n\t\t<latitude>63.9</latitude>\n\t\t\n         \n\t\t<longitude>-20.8</longitude>\n\t\t\n      \n\t</estuary>\n\t\n   \n</river>\n'

This is an example of the first river mentioned in the tree. It has information on river name, its length and country (this possibly has the names of countries through which this river flows).

In [111]:
#4a
rivername = []
riverlength = []
country = []
for rie in document.findall("river"):
    rivername.append(rie.find("name").text)
    country.append(rie.get("country"))
    rle = rie.find("length")
    if rle is None:
        riverlength.append(np.nan)
    else:
        riverlength.append(rle.text)
df_river = pd.DataFrame({"River": rivername, "Country": country, "Length": riverlength})
df_river["Length"] = df_river["Length"].fillna(0).astype(float)
df_river = df_river.sort_values("Length", ascending=False).reset_index(drop=True)
print("Longest river (based on available data) is", df_river.River[0], "with length = ", 
      df_river.Length[0], "located in", df_river["Country"][0])

Longest river (based on available data) is Amazonas with length =  6448.0 located in CO BR PE


There is an issue here with the country name, they are coming out as codes. We need to first map these codes to their actual names. Lets first do that using the "car_code" attribute in all country elements. 

In [136]:
contlist = []
codelist = []
for ce in document.iterfind('country'): # Loop through all "country" elements
    contlist.append(ce.find("name").text) # store country name
    codelist.append(ce.get("car_code")) # store country code
    #country_map.append(dict("Code": ce.get("car_code"), "Country": ce.find("name").text))
country_map = pd.DataFrame({"Code": codelist, "Country": contlist}) 
country_map = country_map.set_index("Code")

In [144]:
list(country_map.loc[df_river["Country"][0].split(sep=" ")].Country)

['Colombia', 'Brazil', 'Peru']

In [145]:
print("Longest river (based on available data) is", df_river.River[0], "with length = ", 
      df_river.Length[0], "located in", list(country_map.loc[df_river["Country"][0].split(sep=" ")].Country))

Longest river (based on available data) is Amazonas with length =  6448.0 located in ['Colombia', 'Brazil', 'Peru']


## Name and country of the largest lake

Similarly, we can find information about lakes.

In [146]:
minidom.parseString(ET.tostring(document.getiterator("lake")[0], 'utf-8')).toprettyxml(indent="\t")

'<?xml version="1.0" ?>\n<lake country="SF" id="lake-Inarisee">\n\t\n      \n\t<name>Inari</name>\n\t\n      \n\t<located country="SF" province="lteil-LAP-SF"/>\n\t\n      \n\t<to water="river-Paatsjoki" watertype="river"/>\n\t\n      \n\t<area>1040</area>\n\t\n      \n\t<latitude>68.95</latitude>\n\t\n      \n\t<longitude>27.7</longitude>\n\t\n      \n\t<elevation>119</elevation>\n\t\n      \n\t<depth>92</depth>\n\t\n   \n</lake>\n'

Knowing the area of all lakes, we can rank them based on largest area.

In [147]:
#4b
lakename = []
lakearea = []
country = []
for le in document.findall("lake"):
    lakename.append(le.find("name").text)
    country.append(le.get("country"))
    lae = le.find("area")
    if lae is None:
        lakearea.append(np.nan)
    else:
        lakearea.append(lae.text)
df_lake = pd.DataFrame({"Lake": lakename, "Country": country, "Area": lakearea})
df_lake["Area"] = df_lake["Area"].fillna(0).astype(float)
df_lake = df_lake.sort_values("Area", ascending=False).reset_index(drop=True)
print("Largest lake (based on available data) is", df_lake.Lake[0], "with area = ", 
      df_lake.Area[0], "located in", list(country_map.loc[df_lake["Country"][0].split(sep=" ")].Country))

Largest lake (based on available data) is Caspian Sea with area =  386400.0 located in ['Russia', 'Azerbaijan', 'Kazakhstan', 'Iran', 'Turkmenistan']


This is interesting because the name suggests that it is sea but actually its a lake! For more information (and fun), read [this](http://www.thenational.ae/business/industry-insights/energy/lake-or-sea-a-tricky-question-for-the-caspian).

## Name and country of the highest airport

Finally, lets find out the highest airport in the world. Lets first quickly check the structure for airport element.

In [148]:
minidom.parseString(ET.tostring(document.getiterator("airport")[0], 'utf-8')).toprettyxml(indent="\t")

'<?xml version="1.0" ?>\n<airport city="cty-Afghanistan-2" country="AFG" iatacode="HEA">\n\t\n      \n\t<name>Herat</name>\n\t\n      \n\t<latitude>34.210017</latitude>\n\t\n      \n\t<longitude>62.2283</longitude>\n\t\n      \n\t<elevation>977</elevation>\n\t\n      \n\t<gmtOffset>5</gmtOffset>\n\t\n   \n</airport>\n'

We can focus on airport name and elevation here.

In [149]:
#4c
airportname = []
airportelev = []
country = []
for ae in document.findall("airport"):
    airportname.append(ae.find("name").text)
    country.append(ae.get("country"))
    aee = ae.find("elevation")
    if aee is None:
        airportelev.append(np.nan)
    else:
        airportelev.append(aee.text)
df_airport = pd.DataFrame({"Airport": airportname, "Country": country, "Elevation": airportelev})
df_airport["Elevation"] = df_airport["Elevation"].fillna(0).astype(float)
df_airport = df_airport.sort_values("Elevation", ascending=False).reset_index(drop=True)
print("Highest elevated airport (based on avaialble data) is", df_airport.Airport[0], "with elevation = ", 
      df_airport.Elevation[0], "located in", country_map.loc[df_airport.Country[0]])

Highest elevated airport (based on avaialble data) is El Alto Intl with elevation =  4063.0 located in Country    Bolivia
Name: BOL, dtype: object
