# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [11]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [12]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [17]:
document_tree.getroot()[0].find('name').text

'Albania'

In [14]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [9]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [37]:
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()
root.attrib

{}

### 1. 10 countries with the lowest infant mortality rates
I couldn't find a way to use ElementTree purely to sort the list, but that may not be the point of this after all. Here I've just loaded it into a dict and sorted the dict to get the countries with the lowest infant mortality rates.

In [43]:
t = {}
for country in root.findall('country'):
    if ET.iselement(country.find('infant_mortality')):
        name = country.find('name').text
        inf_mort = country.find('infant_mortality').text
        t[name] = float(inf_mort)

t_sorted = sorted(t.items(), key=lambda x: x[1])
t_sorted[:10]

[('Monaco', 1.81),
 ('Japan', 2.13),
 ('Norway', 2.48),
 ('Bermuda', 2.48),
 ('Singapore', 2.53),
 ('Sweden', 2.6),
 ('Czech Republic', 2.63),
 ('Hong Kong', 2.73),
 ('Macao', 3.13),
 ('Iceland', 3.15)]

### 2. 10 cities with the largest population

In [75]:
t = {}
for city in root.iterfind('country/city'):
    p = {}
    if ET.iselement(city.find('population')):
        
        # Collect all population elements into a dict to sort
        for pop_element in city.findall('population'):
            year = pop_element.attrib['year']
            population = pop_element.text
            p[int(year)] = int(population)
        
        # Sort descending by year, extract first value
        p_sorted = sorted(p.items(), key=lambda x:x[0], reverse=True)
        city_name = city.find('name').text
        latest_popl = p_sorted[0][1]
        t[city_name] = int(latest_popl)

# Sort descending by population, print first 10
t_sorted = sorted(t.items(), key=lambda x:x[1], reverse=True)
t_sorted[:10]

[('Seoul', 9708483),
 ('Al Qahirah', 8471859),
 ('Bangkok', 7506700),
 ('Hong Kong', 7055071),
 ('Ho Chi Minh', 5968384),
 ('Singapore', 5076700),
 ('Al Iskandariyah', 4123869),
 ('New Taipei', 3939305),
 ('Busan', 3403135),
 ('Pyongyang', 3255288)]

#### Avoiding the second for loop
After eyeballing the xml input file, I think it's safe to assume that the population would be in ascending order of year, so I'd just need to pick the last population element for each city.

In [79]:
t = {}
for city in root.iterfind('country/city'):
    if ET.iselement(city.find('population')):
        
        #Get last population element
        population = city.findall('population')[-1].text
        city_name = city.find('name').text
        t[city_name] = int(population)

# Sort descending by population, print first 10
t_sorted = sorted(t.items(), key=lambda x:x[1], reverse=True)
t_sorted[:10]

[('Seoul', 9708483),
 ('Al Qahirah', 8471859),
 ('Bangkok', 7506700),
 ('Hong Kong', 7055071),
 ('Ho Chi Minh', 5968384),
 ('Singapore', 5076700),
 ('Al Iskandariyah', 4123869),
 ('New Taipei', 3939305),
 ('Busan', 3403135),
 ('Pyongyang', 3255288)]

### 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
Now it gets interesting. Each country element could have multiple 'population' elements and multiple 'ethnicgroup' elements. Each 'ethnicgroup' element has a 'percentage' attribute. For instance, this is an example element:
    
    <country car_code="AL"
            area="28750">
        <name>Albania</name>
        <population year="2001" measured="census">3069275</population>
        <population year="2011" measured="census">2800138</population>
        ...
        <ethnicgroup percentage="95">Albanian</ethnicgroup>
        <ethnicgroup percentage="3">Greek</ethnicgroup>
    </country>
    
This means that we will need a list of ethnicity-wise populations for each country that we parse. I'm going to use a Pandas Dataframe here because it will be useful to have for other calculations too.

In [105]:
from pandas import DataFrame

t = []
for country in root.findall('country'):
    if ET.iselement(country.find('ethnicgroup')):
        name = country.find('name').text
        population = int(country.findall('population')[-1].text)
        
        # Calculate population for each ethnic group
        for ethnicgroup in country.findall('ethnicgroup'):
            eg_name = ethnicgroup.text
            eg_population = float(population) * float(ethnicgroup.attrib['percentage']) / 100
            
            t.append({'country': name, 'ethnicgroup': eg_name, 'population': eg_population})

df = DataFrame(t)
pd.set_option('display.float_format', lambda x: '%.3f' % x)    # Setting display format of float to avoid scientific format
df.groupby('ethnicgroup')['population'].sum().sort_values(ascending=False)[:10]

ethnicgroup
Han Chinese   1245058800.000
Indo-Aryan     871815583.440
European       494872219.720
African        318325120.369
Dravidian      302713744.250
Mestizo        157734354.937
Bengali        146776916.720
Russian        131856996.077
Japanese       126534212.000
Malay          121993550.374
Name: population, dtype: float64

### 4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [146]:
# Longest river

# Defining a function to get country names from country codes
def get_country_from_code(country_codes):
    countries = ''
    for country_code in country_codes:
        country = root.find('country[@car_code="{}"]'.format(country_code))
        if ET.iselement(country):
            countries += country.find('name').text + ','
    return countries.rstrip(',')

r = []
for river in root.findall('river'):
    if ET.iselement(river.find('length')):
        name = river.find('name').text
        country_codes = river.attrib['country'].split()
        length = float(river.find('length').text)

        countries = get_country_from_code(country_codes)
        r.append({'name': name, 'countries': countries, 'length': length})
        
riversdf = DataFrame(r).sort_values(by='length',ascending=False)[:10]
riversdf

Unnamed: 0,countries,length,name
174,"Colombia,Brazil,Peru",6448.0,Amazonas
137,China,6380.0,Jangtse
136,China,4845.0,Hwangho
123,Russia,4400.0,Lena
201,"Congo,Zaire",4374.0,Zaire
138,"China,Laos,Thailand,Cambodia,Vietnam",4350.0,Mekong
115,"Russia,Kazakhstan,China",4248.0,Irtysch
186,"Mali,Niger,Nigeria,Guinea",4184.0,Niger
160,United States,4130.0,Missouri
119,Russia,4092.0,Jenissej


In [145]:
# Largest lake

l = []
for lake in root.findall('lake'):
    if ET.iselement(lake.find('area')):
        name = lake.find('name').text
        country_codes = lake.attrib['country'].split()
        area = float(lake.find('area').text)
        
        countries = get_country_from_code(country_codes)
        l.append({'name': name, 'countries': countries, 'area': area})
        
lakesdf = DataFrame(l).sort_values(by='area',ascending=False)[:10]
lakesdf

Unnamed: 0,area,countries,name
54,386400.0,"Russia,Azerbaijan,Kazakhstan,Iran,Turkmenistan",Caspian Sea
107,82103.0,"Canada,United States",Lake Superior
79,68870.0,"Tanzania,Kenya,Uganda",Lake Victoria
104,59600.0,"Canada,United States",Lake Huron
106,57800.0,United States,Lake Michigan
47,41650.0,"Israel,Jordan,West Bank",Dead Sea
81,32893.0,"Zaire,Zambia,Burundi,Tanzania",Lake Tanganjika
96,31792.0,Canada,Great Bear Lake
43,31492.0,Russia,Ozero Baikal
87,29600.0,"Malawi,Mozambique,Tanzania",Lake Malawi


In [144]:
# Airport at highest elevation

a = []
for airport in root.findall('airport'):
    if (ET.iselement(airport.find('elevation'))) & (airport.find('elevation').text != None):
        name = airport.find('name').text
        country_codes = airport.attrib['country'].split()
        elevation = float(airport.find('elevation').text) 
        countries = get_country_from_code(country_codes)
        a.append({'name': name, 'countries': countries, 'elevation': elevation})
        
airportsdf = DataFrame(a).sort_values(by='elevation',ascending=False)[:10]
airportsdf


Unnamed: 0,countries,elevation,name
80,Bolivia,4063.0,El Alto Intl
212,China,4005.0,Lhasa-Gonggar
230,China,3963.0,Yushu Batang
787,Peru,3827.0,Juliaca
789,Peru,3311.0,Teniente Alejandro Velasco Astete Intl
82,Bolivia,2905.0,Juana Azurduy De Padilla
308,Ecuador,2813.0,Mariscal Sucre Intl
779,Peru,2719.0,Coronel Fap Alfredo Mendivil Duarte
781,Peru,2677.0,Mayor General FAP Armando Revoredo Iglesias Ai...
666,Mexico,2581.0,Licenciado Adolfo Lopez Mateos Intl
