# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in root:
    print(child.find('name').text)

NameError: name 'root' is not defined

In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [6]:
document = ET.parse( './data/mondial_database.xml' )

In [7]:
root = document.getroot()
infant_mortality_stats = []
infant_mortality_stats_unavailable = []

for country in root.findall('country'):
    try:
        infant_mortality_stats.append((country.find('name').text,float(country.find('infant_mortality').text)))
    except:
        infant_mortality_stats_unavailable.append(country.find('name').text)        
infant_mortality_stats.sort(key = lambda x: x[1],reverse=True)

print('Top 10 Infant Mortality Rates\n')
for stat in infant_mortality_stats[:10]:
    print("{0}. Country: {1} - {2}".format(infant_mortality_stats.index(stat)+1,stat[0],stat[1]))

Top 10 Infant Mortality Rates

1. Country: Western Sahara - 145.82
2. Country: Afghanistan - 117.23
3. Country: Mali - 104.34
4. Country: Somalia - 100.14
5. Country: Central African Republic - 92.86
6. Country: Guinea-Bissau - 90.92
7. Country: Chad - 90.3
8. Country: Niger - 86.27
9. Country: Angola - 79.99
10. Country: Burkina Faso - 76.8


In [8]:
city_population = []
city_population_unavailable = []

for country in root.findall('country'):
    for city in country.findall('city'):
        population_data = [int(population.text) for population in city.findall('population')]
        try:
            city_population.append((city.find('name').text,population_data[-1]))
        except:
            city_population_unavailable.append(city.find('name').text)
city_population.sort(key = lambda x: x[1],reverse=True)

print('Top 10 Cities By Population\n')
for stat in city_population[:10]:
    print("{0}. {1} - {2:,}".format(city_population.index(stat)+1,stat[0],stat[1]))

Top 10 Cities By Population

1. Seoul - 9,708,483
2. Al Qahirah - 8,471,859
3. Bangkok - 7,506,700
4. Hong Kong - 7,055,071
5. Ho Chi Minh - 5,968,384
6. Singapore - 5,076,700
7. Al Iskandariyah - 4,123,869
8. New Taipei - 3,939,305
9. Busan - 3,403,135
10. Pyongyang - 3,255,288


In [9]:
#First I find all the unique ethnicities.

ethnicities = set()

for country in root.findall('country'):
    for ethnicity in country.findall('ethnicgroup'):
        ethnicities.add(ethnicity.text)
        
ethnicity_pop_per_country = {}
for ethnicity in ethnicities:
    ethnicity_pop_per_country[ethnicity] = []

In [10]:
for country in root.findall('country'):
    country_recent_population = int([pop for pop in country.findall('population')][-1].text)
    for ethnicity in country.findall('ethnicgroup'):
        ethnicity_pop_per_country[ethnicity.text].append(float(ethnicity.attrib['percentage'])*.01*country_recent_population)

In [11]:
ethnicity_total_pop = []

for ethnicity, country_pops in ethnicity_pop_per_country.items():
    ethnicity_total_pop.append((ethnicity, sum(country_pops)))    
ethnicity_total_pop.sort(key=lambda x: x[1], reverse=True)

print('Top 10 Ethnicities By (Rounded) Population\n')
for stat in ethnicity_total_pop[:10]:
    print("{0}. {1} - {2:,}".format(ethnicity_total_pop.index(stat)+1,stat[0],int(stat[1])))

Top 10 Ethnicities By (Rounded) Population

1. Han Chinese - 1,245,058,800
2. Indo-Aryan - 871,815,583
3. European - 494,872,219
4. African - 318,325,120
5. Dravidian - 302,713,744
6. Mestizo - 157,734,354
7. Bengali - 146,776,916
8. Russian - 131,856,996
9. Japanese - 126,534,212
10. Malay - 121,993,550
