# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print( child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print( '* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print( capitals_string[:-2] )

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

First, import libraries and read the xml file in.

In [5]:
import xml.etree.ElementTree as ET
import numpy as np
import pandas as pd

# read in the 'tree'
tree = ET.parse('data/mondial_database.xml')

# (1) Find the 10 countries with the lowest infant mortality rates
My approach is to loop through all countries and get the infant mortality rates (some countries do not have a value). Then I put the results in a dataframe and sort to find the 10 lowest rates.

In [6]:
# Make empty lists to store results in
country_list  = []
inf_mort_list = []

# loop through each country
for country in tree.findall('country'):
    # get country name and add to list
    this_country = country.find('name').text
    country_list.append(this_country)
    # check if there is an 'infant_mortality' for this country
    inf_mort_node = country.find('infant_mortality')#.text
    # if not, will return 'None' type; put a nan in the list
    if inf_mort_node==None:
        inf_mort=np.nan
        #print('missing')
    else:
        # if exists, get value
        inf_mort = float(inf_mort_node.text)
    # add to list
    inf_mort_list.append(inf_mort)
        
# make into a dataframe and sort to find 10 countries w/ lowest rates
df = pd.DataFrame({'country':country_list,'infant_mort':inf_mort_list})
df.sort_values('infant_mort',).head(10)

Unnamed: 0,country,infant_mort
38,Monaco,1.81
98,Japan,2.13
117,Bermuda,2.48
36,Norway,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


Some countries did not have an infant mortality value; these are listed below for reference

In [7]:
# list countries where there was no infant mortality given
df[df.infant_mort.isnull()]

Unnamed: 0,country,infant_mort
4,Montenegro,
5,Kosovo,
41,Holy See,
42,Ceuta,
43,Melilla,
52,Svalbard,
82,Christmas Island,
83,Cocos Islands,
137,Curacao,
139,Saint Martin,


# (2) Find the 10 cities with the largest population
First I loop through and get a list of country, province, year, and population. Some countries do not have provinces listed.

In [8]:
population_list = []

for country in tree.findall('country'):

    #print('\n')
    this_country = country.find('name').text
    #print(this_country)
    
    #print('\n')
    for city in country.getiterator('city'):
        city_name = city.find('name').text
       # print('city : ' + city_name)
       # print(city.attrib)
        if 'province' in city.attrib:
            prov = city.attrib['province']
        else:
            prov='None'
        
        for pop in city.getiterator('population'):
            #print('pop')
            if pop != None:
             #   print(pop.attrib)
             #   print(pop.text)
                population_list.append([this_country , prov, city_name, int(pop.attrib['year']), int(pop.text)])
                
pop_df = pd.DataFrame.from_records(population_list,columns=['country','province','city','year','population'])
pop_df['population'] = pop_df['population'].astype('int')
pop_df.head()

Unnamed: 0,country,province,city,year,population
0,Albania,,Tirana,1987,192000
1,Albania,,Tirana,1990,244153
2,Albania,,Tirana,2011,418495
3,Albania,,Shkodër,1987,62000
4,Albania,,Shkodër,2011,77075


We have population estimates for different years for each country. I then get the latest year for each city, and extract those rows into a new dataframe. 

In [9]:
# Find latest year for each city
b = pop_df.groupby('city').year.max()
b.to_dict()

# Make a new dataframe with just the latest years for each city.
pop_latest = pd.DataFrame(columns=['country','province','city','year','population'])
for key in b.keys():
    pop_latest = pop_latest.append( pop_df[ (pop_df.city==key) & (pop_df.year==b[key] ) ])
pop_latest.head()

Unnamed: 0,country,province,city,year,population
2335,Netherlands,prov-Netherlands-12,'s-Hertogenbosch,2014.0,143822.0
409,Spain,prov-Spain-13,A Coruña,2011.0,245053.0
719,Germany,prov-Germany-11,Aachen,2011.0,236420.0
2755,Denmark,prov-DK-1,Aalborg,2012.0,104885.0
8841,Nigeria,prov-WAN-1,Aba,1991.0,500183.0


Finally, we can sort this dataframe and find the top 10 cities.

In [10]:
pop_latest.sort_values('population',ascending=False).head(10)

Unnamed: 0,country,province,city,year,population
3750,China,prov-China-32,Shanghai,2010.0,22315474.0
2607,Turkey,prov-Turkey-38,Istanbul,2012.0,13710512.0
4303,India,prov-India-14,Mumbai,2011.0,12442373.0
1546,Russia,prov-Russia-19,Moskva,2013.0,11979529.0
3746,China,prov-China-31,Beijing,2010.0,11716620.0
8208,Brazil,prov-Brazil-25,São Paulo,2010.0,11152344.0
3754,China,prov-China-33,Tianjin,2010.0,11090314.0
3364,China,prov-China-5,Guangzhou,2010.0,11071424.0
4399,India,prov-India-32,Delhi,2011.0,11034555.0
3371,China,prov-China-5,Shenzhen,2010.0,10358381.0


# (3) Find the 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
I use a similar approach as for the 1st question. I loop over countries, getting populations for each ethnic group listed. The ethnic group populations are given as percentages, so I multiply by the country's total population (from most recent year) to get the ethnic group populations.

In [11]:

eth_list = []

for country in tree.findall('country'):
    #print('\n')
    this_country = country.find('name').text
    #print('country: ' + this_country)
    
    # find the population from the most recent year
    latest_year = 0
    the_pop = []
    for popel in country.findall('population'):  
        # use most recent year for population
        if int(popel.attrib['year'])>latest_year:
            th_year=popel.attrib['year']
            the_pop=popel.text
    #print('latest year for pop is ' + th_year)
    #print('total pop for ' + th_year + ' is ' + the_pop)
    
    for eth in country.getiterator('ethnicgroup'):
        #print('ethnic group: ' + eth.text)
        #print(eth.attrib)
        eth_list.append([this_country, eth.text, float(eth.attrib['percentage'])*float(the_pop)/100 ])
        
eth_df = pd.DataFrame.from_records(eth_list,columns=['country','ethnic_group','population'])
eth_df.head()

Unnamed: 0,country,ethnic_group,population
0,Albania,Albanian,2660131.0
1,Albania,Greek,84004.14
2,Greece,Greek,10059150.0
3,Macedonia,Macedonian,1322388.0
4,Macedonia,Albanian,519068.1


Finally, I group by ethnic group, sum populations (some groups appear in multiple countries), and sort.

In [12]:
eth_df.groupby('ethnic_group').population.sum().sort_values(ascending=False).head(10)

ethnic_group
Han Chinese    1.245059e+09
Indo-Aryan     8.718156e+08
European       4.948722e+08
African        3.183251e+08
Dravidian      3.027137e+08
Mestizo        1.577344e+08
Bengali        1.467769e+08
Russian        1.318570e+08
Japanese       1.265342e+08
Malay          1.219936e+08
Name: population, dtype: float64

# (4) Find the name and country of a) longest river, b) largest lake and c) airport at highest elevation

The country codes are given for rivers, lakes, and airports. So first i'll make a dictionary of country names and codes, so we can get the full country names for the answers to the questions.

In [13]:
country_dict = {}
for country in tree.findall('country'):
    country_dict[country.attrib['car_code']] = country.find('name').text

country_dict

{'A': 'Austria',
 'AFG': 'Afghanistan',
 'AG': 'Antigua and Barbuda',
 'AL': 'Albania',
 'AMSA': 'American Samoa',
 'AND': 'Andorra',
 'ANG': 'Angola',
 'ARM': 'Armenia',
 'ARU': 'Aruba',
 'AUS': 'Australia',
 'AXA': 'Anguilla',
 'AZ': 'Azerbaijan',
 'B': 'Belgium',
 'BD': 'Bangladesh',
 'BDS': 'Barbados',
 'BEN': 'Benin',
 'BERM': 'Bermuda',
 'BF': 'Burkina Faso',
 'BG': 'Bulgaria',
 'BHT': 'Bhutan',
 'BI': 'Burundi',
 'BIH': 'Bosnia and Herzegovina',
 'BOL': 'Bolivia',
 'BR': 'Brazil',
 'BRN': 'Bahrain',
 'BRU': 'Brunei',
 'BS': 'Bahamas',
 'BVIR': 'British Virgin Islands',
 'BY': 'Belarus',
 'BZ': 'Belize',
 'C': 'Cuba',
 'CAM': 'Cameroon',
 'CAYM': 'Cayman Islands',
 'CDN': 'Canada',
 'CEU': 'Ceuta',
 'CH': 'Switzerland',
 'CI': 'Cote dIvoire',
 'CL': 'Sri Lanka',
 'CN': 'China',
 'CO': 'Colombia',
 'COCO': 'Cocos Islands',
 'COM': 'Comoros',
 'COOK': 'Cook Islands',
 'CR': 'Costa Rica',
 'CUR': 'Curacao',
 'CV': 'Cape Verde',
 'CY': 'Cyprus',
 'CZ': 'Czech Republic',
 'D': 'German

## (a) Longest River

In [14]:
river_list = []

for river in tree.findall('river'):
    #print('\n')
    river_name = river.find('name').text
    #print('river name : ' + river_name)
    country = river.attrib['country']
    #print('country : ' + country)
    if river.find('length')!=None:
        river_length = river.find('length').text
        #print(river_length)
    else:
        river_length=np.nan
        
    river_list.append([country, river_name, float(river_length) ])
    
river_df = pd.DataFrame.from_records(river_list,columns=['country','river','length'])
longest_river = river_df.sort_values('length',ascending=False).head(1)
longest_river

Unnamed: 0,country,river,length
174,CO BR PE,Amazonas,6448.0


Print out the full country names:

In [15]:
countries = longest_river.country.str.split().tolist()[0]
for country in countries:
    print(country_dict[country])

Colombia
Brazil
Peru


## (b) Largest lake

In [16]:
lake_list = []

for lake in tree.findall('lake'):
    #print('\n')
    lake_name = lake.find('name').text
    #print('lake name : ' + lake_name)
    country = lake.attrib['country']
    #print('country : ' + country)
    if lake.find('area')!=None:
        lake_area = lake.find('area').text
    #    print(lake_area)
    else:
        lake_area=np.nan
        
    lake_list.append([country, lake_name, float(lake_area) ])
    
lake_df = pd.DataFrame.from_records(lake_list,columns=['country','lake','area'])
largest_lake = lake_df.sort_values('area',ascending=False).head(1)
largest_lake

Unnamed: 0,country,lake,area
54,R AZ KAZ IR TM,Caspian Sea,386400.0


In [17]:
countries = largest_lake.country.str.split().tolist()[0]
for country in countries:
    print(country_dict[country])

Russia
Azerbaijan
Kazakhstan
Iran
Turkmenistan


## (c) Highest Elevation Airport

In [18]:
airport_list = []

for airport in tree.findall('airport'):
    #print('\n')    
    airport_name = airport.find('name').text
    #print('airport name : ' + airport_name)
    country = airport.attrib['country']
    #print('country : ' + country)
    if airport.find('elevation').text!=None:        
        airport_elev = airport.find('elevation').text
        #print(airport_elev)
    else:
        airport_elev = np.nan
        
    airport_list.append([country, airport_name, float(airport_elev) ])
    
airport_df = pd.DataFrame.from_records(airport_list,columns=['country','airport','elev'])
highest_airport = airport_df.sort_values('elev',ascending=False).head(1)
highest_airport

Unnamed: 0,country,airport,elev
80,BOL,El Alto Intl,4063.0


In [19]:
countries = highest_airport.country.str.split().tolist()[0]
for country in countries:
    print(country_dict[country])

Bolivia
