****
## XML exercise

Using data in 'data/mondial_database.xml', find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

## Solutions
+ Refer Documentation for XML [here](https://docs.python.org/2/library/xml.etree.elementtree.html)
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial

In [1]:
import pandas as pd
from xml.etree import ElementTree as ET

In [2]:
doctree = ET.parse( './data/mondial_database.xml' )

In [3]:
# Get root
treeroot = doctree.getroot()

## 1. 10 countries with the lowest infant mortality rates

In [4]:
# Iterating through country names
country_cnt = 0
infant_mort = dict()
for country in treeroot :
    try :
        infant_mort[country.find('name').text] = float(country.find('infant_mortality').text)
    except :
        continue
    country_cnt = country_cnt + 1
print('There are {} countries'.format(country_cnt))

There are 228 countries


In [5]:
mort_df = pd.DataFrame({'Infant Mortality': infant_mort})
mort_df.sort_values('Infant Mortality').head(10)

Unnamed: 0,Infant Mortality
Monaco,1.81
Japan,2.13
Norway,2.48
Bermuda,2.48
Singapore,2.53
Sweden,2.6
Czech Republic,2.63
Hong Kong,2.73
Macao,3.13
Iceland,3.15


## 2. 10 cities with the largest population

In [6]:
# Iterating through cities
pop_dict = dict()
city_cnt = 0
for element in doctree.iterfind('country'):
    try :
        for subelement in element.getiterator('city'): # One needs to be careful here to use getiterator and not iterfind
            city_cnt = city_cnt + 1
            pop = 0
            try:
                for e in subelement.findall('population'):
                    if(pop < int(e.text)) :
                        pop = int(e.text)
                        year = e.get('year')
                pop_dict[subelement.find('name').text] = (int(pop),int(year))
            except:
                continue
    except:
        continue
print('There are {} cities'.format(city_cnt))

There are 3380 cities


In [7]:
city_pop_df = pd.DataFrame({'Population, Year': pop_dict})
city_pop_df.sort_values('Population, Year',ascending=False).head(10)

Unnamed: 0,"Population, Year"
Shanghai,"(22315474, 2010)"
Istanbul,"(13710512, 2012)"
Delhi,"(12877470, 2001)"
Mumbai,"(12442373, 2011)"
Moskva,"(11979529, 2013)"
Beijing,"(11716620, 2010)"
São Paulo,"(11152344, 2010)"
Tianjin,"(11090314, 2010)"
Guangzhou,"(11071424, 2010)"
Shenzhen,"(10358381, 2010)"


## 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [8]:
ethn_cnt = 0
ethn_dict = dict()
for element in doctree.iterfind('country'):
    pop = 0
    try:
         for e in element.findall('population'):
            if(pop < int(e.text)) :
                pop = int(e.text)
    except:
        print('hi')
# Ethnic group extraction
    try:
        for e in element.findall('ethnicgroup'):
            ethn_dict[e.text] = ethn_dict.get(e.text,0) + round(float(e.get('percentage'))*pop)
    except :
        continue
    ethn_cnt = ethn_cnt + 1
print('There are {} ethnic groups all over the world'.format(ethn_cnt))

There are 244 ethnic groups all over the world


In [9]:
eth_df = pd.DataFrame({'Ethnic Groups':ethn_dict})
eth_df.sort_values('Ethnic Groups',ascending=False).head(10)

Unnamed: 0,Ethnic Groups
Han Chinese,124505880000
Indo-Aryan,87181558344
European,49493951565
African,31835969804
Dravidian,30271374425
Mestizo,15785527300
Bengali,14677691672
Russian,13686655065
Japanese,12728900789
Malay,12199362027


## 4.name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [10]:
#a) Longest River
river_cnt = 0
river_dict = dict()
for element in doctree.iterfind('river'):
    try :
        river_dict[element.find('name').text] = (float(element.find('length').text),element.attrib['country'])
        river_cnt = river_cnt + 1
    except :
        continue
print('There are {} rivers'.format(river_cnt))

There are 233 rivers


In [11]:
(pd.DataFrame(river_dict,index=['Length','Country/Countries']).T).sort_values('Length',ascending=False).head(1)

Unnamed: 0,Length,Country/Countries
Amazonas,6448,CO BR PE


In [12]:
# b) Largest Lake
lake_cnt = 0
lake_dict = dict()
for element in doctree.iterfind('lake'):
    try :
        lake_dict[element.find('name').text] = (float(element.find('area').text),element.attrib['country'])
        lake_cnt = lake_cnt + 1
    except :
        continue
print('There are {} lakes'.format(lake_cnt))

There are 139 lakes


In [13]:
(pd.DataFrame(lake_dict,index=['Area','Country/Countries']).T).sort_values('Area',ascending=False).head(1)

Unnamed: 0,Area,Country/Countries
Caspian Sea,386400,R AZ KAZ IR TM


In [14]:
# c) Airport at highest altitude
airport_cnt = 0
airport_dict = dict()
for element in doctree.iterfind('airport'):
    try :
        airport_dict[element.find('name').text] = (float(element.find('elevation').text),element.attrib['country'])
        airport_cnt = airport_cnt + 1
    except :
        continue
print('There are {} airport'.format(airport_cnt))

There are 1289 airport


In [15]:
(pd.DataFrame(airport_dict,index=['Elevation','Country']).T).sort_values('Elevation',ascending=False).head(1)

Unnamed: 0,Elevation,Country
El Alto Intl,4063,BOL


With Minidom
Useful video [here](https://www.bing.com/videos/search?q=how+to+display+xml+tags+in+python&view=detail&mid=685EF420F64C693670CD685EF420F64C693670CD&FORM=VIRE)
I tried that method and found that Elementtree above method is more appealing to me