# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

#### 10 countries with the lowest infant mortality rates

In [6]:
root = document.getroot()
infant_mortality_stats = []
infant_mortality_stats_unavailable = []

for country in root.findall('country'):
    try:
        infant_mortality_stats.append((country.find('name').text,float(country.find('infant_mortality').text)))
    except:
        infant_mortality_stats_unavailable.append(country.find('name').text)        
infant_mortality_stats.sort(key = lambda x: x[1],reverse=True)

print('Top 10 Infant Mortality Rates\n')
for stat in infant_mortality_stats[:10]:
    print("{0}) {1}: {2}".format(infant_mortality_stats.index(stat)+1,stat[0],stat[-1]))

Top 10 Infant Mortality Rates

1) Western Sahara: 145.82
2) Afghanistan: 117.23
3) Mali: 104.34
4) Somalia: 100.14
5) Central African Republic: 92.86
6) Guinea-Bissau: 90.92
7) Chad: 90.3
8) Niger: 86.27
9) Angola: 79.99
10) Burkina Faso: 76.8


#### 10 cities with the largest population

In [7]:
city_popn = []
city_popn_unk = []

for country in root.findall('country'):
    for province in country.findall('province'):
        for city in province.findall('city'):
            popn = [int(population.text) for population in city.findall('population')]
            try:
                city_popn.append((city.find('name').text,popn[-1]))
            except:
                city_popn_unk.append(city.find('name').text)
city_popn.sort(key = lambda x: x[1],reverse=True)

print('10 cities with the largest population')
for pop in city_popn[:10]:
    print("{0}) {1}: {2}".format(city_popn.index(pop)+1,pop[0],pop[-1]))

10 cities with the largest population
1) Shanghai: 22315474
2) Istanbul: 13710512
3) Mumbai: 12442373
4) Moskva: 11979529
5) Beijing: 11716620
6) São Paulo: 11152344
7) Tianjin: 11090314
8) Guangzhou: 11071424
9) Delhi: 11034555
10) Shenzhen: 10358381


In [342]:
#city_popn

#### 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [349]:
#list of ethnic groups
ethnic_popn = set()

for country in root.findall('country'):
    for ethnic_group in country.findall('ethnicgroup'):
        ethnic_popn.add(ethnic_group.text)

#prov_list = set()
#for country in root.findall('country'):
#    for province in country.findall('province'):
#        for province_d in province.findall('name'):
#            prov_list.add(province_d.text)

In [350]:
ethnicity_popn_per_country = {}
for ethnicity in ethnic_popn:
    ethnicity_popn_per_country[ethnicity] = []

In [351]:
for country in root.findall('country'):
    country_popn = int([popn for popn in country.findall('population')][-1].text)
    for ethnicity in country.findall('ethnicgroup'):
        ethnicity_popn_per_country[ethnicity.text].append(float(ethnicity.attrib['percentage'])*.01*country_popn)

In [352]:
ethnic_totalpopn = []

for ethnicity, country_popn in ethnicity_popn_per_country.items():
    ethnic_totalpopn.append((ethnicity, sum(country_popn)))    
ethnic_totalpopn.sort(key=lambda x: x[1], reverse=True)

print('Top 10 Ethnicities By (Rounded) Population\n')
for stat in ethnic_totalpopn[:10]:
    print("{0}) {1}: {2:}".format(ethnic_totalpopn.index(stat)+1,stat[0],int(stat[1])))

Top 10 Ethnicities By (Rounded) Population

1) Han Chinese: 1245058800
2) Indo-Aryan: 871815583
3) European: 494872219
4) African: 318325120
5) Dravidian: 302713744
6) Mestizo: 157734354
7) Bengali: 146776916
8) Russian: 131856996
9) Japanese: 126534212
10) Malay: 121993550


#### name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [192]:
countries = {}
for country in root.findall('country'):
    countries[country.attrib['car_code']]=country.find('name').text
rivertest = set()

for river in root.findall('river'):
    for rivername in river.findall('name'):
        rivertest.add(rivername.text)

riverlength = set()
for river in root.findall('river'):
    for riverlen in river.findall('length'):
        riverlength.add(riverlen.text)

rivercountry = set()
for river in root.findall('river'):
    for riverc in river.findall(river.attrib['country']):
        rivercountry.add(riverc.text)

In [354]:
#for river in root.findall('river'):
#   for rivername in river.findall('name'):
#        rivertest.add(rivername.text)
#rivertest

In [247]:
rivertest = []
import numpy as np
for river in root.findall('river'):
    name = river.find('name').text
    try:
        length = river.find('length').text
        length = int(length)
    except:
        length = -999
    country = river.get('country')
    rivertest.append([name,length,country])

In [334]:
#rivertest = np.array(rivertest)
#rivertest

In [374]:
from operator import itemgetter, attrgetter, methodcaller
riversort = sorted(rivertest,key=itemgetter(1),reverse=True)
print(riversort[0])

['Amazonas', 6448, 'CO BR PE']


In [355]:
countries = {}
for country in root.findall('country'):
    countries[country.attrib['car_code']]=country.find('name').text
print(countries['CO'])

Colombia


In [281]:
print('The longest river in the database is ' + str(riversort[0]) + ' and is located in the countries of ' + countries['CO'] +' , ' +countries['BR']+ ' and ' + countries['PE'] + ' (note: the Nile may be longer, but there is no data for length for the Nile)') 

The longest river in the database is ['Amazonas', 6448, 'CO BR PE'] and is located in the countries of Colombia , Brazil and Peru (note: the Nile may be longer, but there is no data for length for the Nile)


#### b) largest lake 

In [298]:
laketest = []
for lake in root.findall('lake'):
    name = lake.find('name').text
    try:
        area = lake.find('area').text
        area = int(area)
    except:
        area = -999
    country = lake.get('country')
    laketest.append([name,area,country])

In [335]:
#laketest

In [300]:
lakesort = sorted(laketest,key=itemgetter(1),reverse=True)
print(lakesort[0])

['Caspian Sea', 386400, 'R AZ KAZ IR TM']


In [301]:
print('The largest lake in the database is the ' + str(lakesort[0]))
print('It is located in the countries of ' + countries['R'] +' , ' +countries['AZ'] +' , '+ countries['KAZ'] +' , ' + countries['IR'] + ' and ' + countries['TM'])

The largest lake in the database is the ['Caspian Sea', 386400, 'R AZ KAZ IR TM']
It is located in the countries of Russia , Azerbaijan , Kazakhstan , Iran and Turkmenistan


#### c) airport at highest elevation

In [328]:
airporttest = []
for airport in root.findall('airport'):
    name = airport.find('name').text
    try:
        elevation = airport.find('elevation').text
        elevation = int(elevation)
    except:
        elevation = -999
    country = airport.get('country')
    airporttest.append([name,elevation,country])

In [336]:
#airporttest

In [330]:
airportsort = sorted(airporttest,key=itemgetter(1),reverse=True)
print(airportsort[0])

['El Alto Intl', 4063, 'BOL']


In [333]:
print('The airport (in the database) at the highest elevation is ' + str(airportsort[0]))
print('It is located in the country of ' + countries['BOL'])

The airport (in the database) at the highest elevation is ['El Alto Intl', 4063, 'BOL']
It is located in the country of Bolivia
