# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':',)
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
# print out the names of all the contries.
for child in document.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra
France
Spain
Austria
Czech Republic
Germany
Hungary
Italy
Liechtenstein
Slovakia
Slovenia
Switzerland
Belarus
Latvia
Lithuania
Poland
Ukraine
Russia
Belgium
Luxembourg
Netherlands
Bosnia and Herzegovina
Croatia
Bulgaria
Romania
Turkey
Denmark
Estonia
Faroe Islands
Finland
Norway
Sweden
Monaco
Gibraltar
Guernsey
Holy See
Ceuta
Melilla
Iceland
Ireland
San Marino
Jersey
Malta
Isle of Man
Moldova
Portugal
Svalbard
United Kingdom
Afghanistan
China
Iran
Pakistan
Tajikistan
Turkmenistan
Uzbekistan
Armenia
Georgia
Azerbaijan
Bahrain
Bangladesh
Myanmar
India
Bhutan
Brunei
Malaysia
Laos
Thailand
Cambodia
Vietnam
Kazakhstan
North Korea
Kyrgyzstan
Hong Kong
Macao
Mongolia
Nepal
Christmas Island
Cocos Islands
Cyprus
Gaza Strip
Israel
Egypt
Indonesia
Timor-Leste
Papua New Guinea
Iraq
Jordan
Kuwait
Saudi Arabia
Syria
Lebanon
West Bank
Japan
South Korea
Maldives
Oman
United Arab Emirates
Yemen
Philippines
Qatar
Singapore
Sri Lanka
Taiwan
Anguil

In [7]:
# to create a dictionary with keys being the country names 
# and the values being the infant mortality rate
d = {}  # create an empty dictionary
for child in document.getroot(): # runs through each country
    if child.find('infant_mortality') != None: #only selects countries with an infant mortality rate
        d[child.find('name').text] = child.find('infant_mortality').text

In [8]:
# prints the whole tree out so we can eyeball aspects to it.
import lxml.etree as etree
x =etree.parse( './data/mondial_database_less.xml' )
print (etree.tostring(x, pretty_print = True))

b'<!DOCTYPE mondial SYSTEM "mondial.dtd">\n<mondial>\n   <country car_code="AL" area="28750" capital="cty-Albania-Tirane" memberships="org-BSEC org-CEI org-CD org-SELEC org-CE org-EAPC org-EBRD org-EITI org-FAO org-IPU org-IAEA org-IBRD org-ICC org-ICAO org-ICCt org-Interpol org-IDA org-IFRCS org-IFC org-IFAD org-ILO org-IMO org-IMF org-IOC org-IOM org-ISO org-OIF org-ITU org-ITUC org-IDB org-MIGA org-NATO org-OSCE org-OPCW org-OAS org-OIC org-PCA org-UN org-UNCTAD org-UNESCO org-UNIDO org-UPU org-WCO org-WFTU org-WHO org-WIPO org-WMO org-UNWTO org-WTO">\n      <name>Albania</name>\n      <population measured="est." year="1950">1214489</population>\n      <population measured="est." year="1960">1618829</population>\n      <population measured="est." year="1970">2138966</population>\n      <population measured="est." year="1980">2734776</population>\n      <population measured="est." year="1990">3446882</population>\n      <population year="1997">3249136</population>\n      <population 

In [9]:
# we can see we now we have a dictionary with strings for both country name and rate
print(d)

{'Albania': '13.19', 'Greece': '4.78', 'Macedonia': '7.9', 'Serbia': '6.16', 'Andorra': '3.69', 'France': '3.31', 'Spain': '3.33', 'Austria': '4.16', 'Czech Republic': '2.63', 'Germany': '3.46', 'Hungary': '5.09', 'Italy': '3.31', 'Liechtenstein': '4.33', 'Slovakia': '5.35', 'Slovenia': '4.04', 'Switzerland': '3.73', 'Belarus': '3.64', 'Latvia': '7.91', 'Lithuania': '6', 'Poland': '6.19', 'Ukraine': '8.1', 'Russia': '7.08', 'Belgium': '4.18', 'Luxembourg': '4.28', 'Netherlands': '3.66', 'Bosnia and Herzegovina': '5.84', 'Croatia': '5.87', 'Bulgaria': '15.08', 'Romania': '10.16', 'Turkey': '21.43', 'Denmark': '4.1', 'Estonia': '6.7', 'Faroe Islands': '5.71', 'Finland': '3.36', 'Norway': '2.48', 'Sweden': '2.6', 'Monaco': '1.81', 'Gibraltar': '6.29', 'Guernsey': '3.47', 'Iceland': '3.15', 'Ireland': '3.74', 'San Marino': '4.52', 'Jersey': '3.86', 'Malta': '3.59', 'Isle of Man': '4.17', 'Moldova': '12.93', 'Portugal': '4.48', 'United Kingdom': '4.44', 'Afghanistan': '117.23', 'China': '14

In [10]:
# create a new dictionary with the value strings replaced with floats
the_list = d
for key in the_list:
    the_list[key] = float(the_list[key])

In [11]:
print(the_list)

{'Albania': 13.19, 'Greece': 4.78, 'Macedonia': 7.9, 'Serbia': 6.16, 'Andorra': 3.69, 'France': 3.31, 'Spain': 3.33, 'Austria': 4.16, 'Czech Republic': 2.63, 'Germany': 3.46, 'Hungary': 5.09, 'Italy': 3.31, 'Liechtenstein': 4.33, 'Slovakia': 5.35, 'Slovenia': 4.04, 'Switzerland': 3.73, 'Belarus': 3.64, 'Latvia': 7.91, 'Lithuania': 6.0, 'Poland': 6.19, 'Ukraine': 8.1, 'Russia': 7.08, 'Belgium': 4.18, 'Luxembourg': 4.28, 'Netherlands': 3.66, 'Bosnia and Herzegovina': 5.84, 'Croatia': 5.87, 'Bulgaria': 15.08, 'Romania': 10.16, 'Turkey': 21.43, 'Denmark': 4.1, 'Estonia': 6.7, 'Faroe Islands': 5.71, 'Finland': 3.36, 'Norway': 2.48, 'Sweden': 2.6, 'Monaco': 1.81, 'Gibraltar': 6.29, 'Guernsey': 3.47, 'Iceland': 3.15, 'Ireland': 3.74, 'San Marino': 4.52, 'Jersey': 3.86, 'Malta': 3.59, 'Isle of Man': 4.17, 'Moldova': 12.93, 'Portugal': 4.48, 'United Kingdom': 4.44, 'Afghanistan': 117.23, 'China': 14.79, 'Iran': 39.0, 'Pakistan': 57.48, 'Tajikistan': 35.03, 'Turkmenistan': 38.13, 'Uzbekistan': 1

In [12]:
# We use a sorted function to sort by the value but then return just the keys in a string
y= sorted(the_list,key=the_list.get)


In [13]:
# prints the top ten citys with lowest infate mortality rates
y[0:10]

['Monaco',
 'Japan',
 'Norway',
 'Bermuda',
 'Singapore',
 'Sweden',
 'Czech Republic',
 'Hong Kong',
 'Macao',
 'Iceland']

## 2222222222222222222222222222222222222222222


In [14]:
d = {} 
for child in document.getroot():
    if child.find('infant_mortality') != None:
        #print (child.find('infant_mortality').text)
        d[child.find('name').text] = child.find('infant_mortality').text

In [15]:
city_pop = {}
for element in document.iterfind('country'):
    for subelement in element.getiterator('city'):
        if subelement.find('population')!= None:
            city_pop[subelement.find('name').text] = subelement.find('population').text           

In [16]:
c_pop = city_pop
for key in c_pop:
    c_pop[key] = float(c_pop[key])

In [17]:
c= sorted(c_pop,key=c_pop.get, reverse= True)
c[0:10]

['Seoul',
 'Mumbai',
 'São Paulo',
 'Jakarta',
 'Shanghai',
 'Ciudad de México',
 'Moskva',
 'Tokyo',
 'Beijing',
 'Delhi']

In [18]:
city_pop['Seoul']

10229262.0

In [19]:
city_pop['Mumbai']

9925891.0

## 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [20]:
#d = {} 
for child in document.getroot():
    if child.find('ethnicgroup') != None:
        print (child.find('ethnicgroup').text)
        
        #d[child.find('name').text] = child.find('infant_mortality').text

Albanian
Greek
Macedonian
Serb
Montenegrin
Albanian
Spanish
Mediterranean Nordic
Austrian
Czech
German
Hungarian
Italian
Slovak
Slovene
German
Belorussian
Latvian
Lithuanian
German
Ukrainian
Russian
Fleming
Luxembourgish
Dutch
Muslim
Croat
Bulgarian
Romanian
Turkish
Estonian
Scandinavian
Finn
Norwegian
Swede
French
Norman-French
Celt
Irish
Norman-French
Moldavian/Romanian
Norwegian
English
Tajik
Han Chinese
Arab
Tajik
Turkmen
Uzbek
Armenian
Georgian
Azeri
Arab
Bengali
Indian
Dravidian
Bhote
Chinese
Malay
Lao Loum
Chinese
Chinese
Viet/Kinh
Kazakh
Kyrgyz
Chinese
Chinese
Mongol
Chinese
Greek
Jewish
Jewish
European
Javanese
Kurdish
Armenian
Arab
Arab
Arab
Armenian
Jewish
Japanese
South Asian
Chinese
Indian
Indian
Sinhalese
Chinese
Black
European/Caribbean Amerindian
European
Mestizo
Mestizo
Mestizo
Black
Black
British Isles
European
Black
Mestizo
Mestizo
Mestizo
European
Carib Indians
Mulatto
African
Mestizo
Mestizo
Danish
African
Chinese
African
Chinese
Mestizo
White
African
African
Basqu

In [21]:
#city_pop = {}
for element in document.iterfind('country'):
    for subelement in element.getiterator('ethnicgroup'):
        if subelement.find('ethnicgroup')!= None:
            #print(subelement.find('name').text)
            print(subelement.find('ethnicgroup').text)
           # city_pop[subelement.find('name').text] = subelement.find('population').text
#print(city_pop)  

In [22]:
for element in document.iterfind('ethnicgroup'):
    for subelement in element.getroot():
        #if subelement.find('population')!= None:
         print(subelement.find().text)
            #city_pop[subelement.find('name').text] = subelement.find('population').text 

In [23]:
  e_groups = []
for child in document.iterfind('country'):
    country = child.find('name').text
    for sub in child.iterfind('ethnicgroup'):
        etg_name = sub.text
        etg_prcnt = float(sub.attrib['percentage']) * 0.01
        e_groups.append([country,etg_name,etg_prcnt])
print(e_groups)        

[['Albania', 'Albanian', 0.9500000000000001], ['Albania', 'Greek', 0.03], ['Greece', 'Greek', 0.93], ['Macedonia', 'Macedonian', 0.642], ['Macedonia', 'Albanian', 0.252], ['Macedonia', 'Turkish', 0.039], ['Macedonia', 'Gypsy', 0.027000000000000003], ['Macedonia', 'Serb', 0.018000000000000002], ['Serbia', 'Serb', 0.8290000000000001], ['Serbia', 'Montenegrin', 0.009000000000000001], ['Serbia', 'Hungarian', 0.039], ['Serbia', 'Roma', 0.013999999999999999], ['Serbia', 'Bosniak', 0.018000000000000002], ['Serbia', 'Croat', 0.011000000000000001], ['Montenegro', 'Montenegrin', 0.43], ['Montenegro', 'Serb', 0.32], ['Montenegro', 'Bosniak', 0.08], ['Montenegro', 'Albanian', 0.05], ['Kosovo', 'Albanian', 0.92], ['Kosovo', 'Serbian', 0.05], ['Andorra', 'Spanish', 0.43], ['Andorra', 'Andorran', 0.33], ['Andorra', 'Portuguese', 0.11], ['Andorra', 'French', 0.02], ['Andorra', 'African', 0.05], ['Spain', 'Mediterranean Nordic', 1.0], ['Austria', 'Austrian', 0.9109999999999999], ['Austria', 'Turkish', 

In [24]:
  e_groups = []
for child in document.iterfind('country'):
    country = child.find('name').text
    pop = float(child.find('population').text)
    for sub in child.iterfind('ethnicgroup'):
        etg_name = sub.text
        etg_prcnt = float(sub.attrib['percentage']) * 0.01 *pop
        e_groups.append([country,etg_name,etg_prcnt])
print(e_groups)  

[['Albania', 'Albanian', 1153764.55], ['Albania', 'Greek', 36434.67], ['Greece', 'Greek', 1020033.3], ['Macedonia', 'Macedonian', 519200.808], ['Macedonia', 'Albanian', 203798.448], ['Macedonia', 'Turkish', 31540.236], ['Macedonia', 'Gypsy', 21835.548000000003], ['Macedonia', 'Serb', 14557.032000000001], ['Serbia', 'Serb', 5581040.224], ['Serbia', 'Montenegrin', 60590.304000000004], ['Serbia', 'Hungarian', 262557.984], ['Serbia', 'Roma', 94251.58399999999], ['Serbia', 'Bosniak', 121180.60800000001], ['Serbia', 'Croat', 74054.816], ['Montenegro', 'Montenegrin', 133876.63], ['Montenegro', 'Serb', 99629.12], ['Montenegro', 'Bosniak', 24907.28], ['Montenegro', 'Albanian', 15567.050000000001], ['Kosovo', 'Albanian', 1457684.8], ['Kosovo', 'Serbian', 79222.0], ['Andorra', 'Spanish', 2664.71], ['Andorra', 'Andorran', 2045.01], ['Andorra', 'Portuguese', 681.67], ['Andorra', 'French', 123.94], ['Andorra', 'African', 309.85], ['Spain', 'Mediterranean Nordic', 18618086.0], ['Austria', 'Austrian',

In [25]:
import pandas as pd

df = pd.DataFrame(e_groups,columns=["a","b","c"])

df.head()

Unnamed: 0,a,b,c
0,Albania,Albanian,1153764.55
1,Albania,Greek,36434.67
2,Greece,Greek,1020033.3
3,Macedonia,Macedonian,519200.808
4,Macedonia,Albanian,203798.448


In [33]:
#  Final answer 
df_g = df.groupby(['b'])['c'].agg('sum').sort_values(ascending=False) #.size().order(ascending=False)
df_g.head(10)

b
Han Chinese    4.975551e+08
European       1.928658e+08
Indo-Aryan     1.716454e+08
Russian        9.275844e+07
African        8.632937e+07
Japanese       8.170627e+07
German         6.623219e+07
Dravidian      5.959908e+07
English        4.231499e+07
Mestizo        3.554233e+07
Name: c, dtype: float64

In [None]:
country_pop = []
for child in document.iterfind('country'):
    #population = child.attrib['population']
    country = child.find('population').text
    print(country)
       
   # country_pop.append([population])
#print(country_pop)   

## 4444444444444444

In [None]:
for element in document.iterfind('country'):
    for subelement in element.getiterator('river'):
        if subelement.find('river')!= None:

## name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [54]:
city_pop = {}
for element in document.iterfind('river'):
    #for subelement in element.getiterator('rivers'):
        #if subelement.find('populat')!= None:
         #   city_pop[subelement.find('name').text] = subelement.find('population').text   
    print(element.find('length').text)
    print(element.find('name').text)
  #  print(element.find('host_country').text)
    #print(element.attrib['name'])
   # print(element.find('country').text)

230
Thjorsa
206
Joekulsa a Fjoellum
604
Glomma
322
Lagen
93
Goetaaelv
460
Klaraelv
470
Umeaelv
520
Dalaelv
320
Vaesterdalaelv
241
Oesterdalaelv
145
Paatsjoki
300
Ounasjoki
550
Kemijoki
107
Oulujoki
203
Kymijoki
121
Kokemaeenjoki
162
Vuoksi
346
Thames
925
Maas
1013
Loire
647
Garonne
812
Rhone
480
Saone
453
Doubs
290
Isere
776
Seine
514
Marne
1007
Tajo
897
Douro
742
Guadiana
657
Guadalquivir
925
Ebro
652
Po
248
Ticino
313
Adda
75
Mincio
415
Etsch
405
Tiber
240
Arno
2845
Donau
45.9
Breg
43
Brigach
147
Iller
264
Lech
295
Isar
168
Ammer
35
Würm
517
Inn
150
Alz
225
Salzach
254
Enns
358
March
250
Raab
403
Waag
749
Drau
453
Mur
1308
Theiss
945
Save
346
Drina
140
Tara
120
Piva
185
Morava
308
Western Morava
295
Southern Morava
615
Olt
953
Pruth
1352
Dnjestr
440
Weser
211
Aller
281
Leine
292
Werra
221
Fulda
1091
Elbe
440
Moldau
1324
Rhein
524
Main
544
Mosel
227
Saar
367
Neckar
288
Aare
164
Reuss
36.3
Limmat
866
Oder
1047
Weichsel
448
Narew
772
Western Bug
99.5
Moraca
44
Buna
152
Drin
175
White Dr

AttributeError: 'NoneType' object has no attribute 'text'

In [58]:

for element in document.iterfind('lake'):
    #for subelement in element.getiterator('rivers'):
        #if subelement.find('populat')!= None:
         #   city_pop[subelement.find('name').text] = subelement.find('population').text   
    #print(element.find('size').text)
    print(element.find('name').text)
   # print(print(element.attrib['size']))

Inari
Oulujaervi
Kallavesi
Saimaa
Paeijaenne
Mjoesa-See
Storuman
Siljan
Maelaren
Vaenern
Vaettern
Arresoe
Loch Ness
Loch Lomond
Bodensee
Chiemsee
Starnberger See
Ammersee
Laacher Maar
Lac Leman
Zurichsee
Thunersee
Brienzersee
Vierwaldstattersee
Lago Maggiore
Lago di Como
Lago di Garda 
Lago Trasimeno
Lago di Bolsena
Lago di Bracciano
Laguna de Gallocanta
Neusiedlersee
Balaton
Lake Skutari
Lake Prespa
Lake Ohrid
Kiev Reservoir
Kakhovka Reservoir
Kremenchuk Reservoir
Kuybyshev Reservoir
Ozero Ladoga
Ozero Onega
Ozero Pskovskoje
Ozero Baikal
Ozero Taimyr
Ozero Chanka
Ozero Tschany
Dead Sea
Lake Genezareth
Lake Van
Lake Keban
Lake Urmia
Daryacheh ye Namak
Hamun e Jaz Murian
Caspian Sea
Ozero Aral
Ozero Balchash
Issyk-Kul
Koli Sarez
Lop Nor
Laguna de Bay
Lake Toba
Segara Anak
Qinghai Lake
Nam Co
Lake Nasser
Lake Volta
Lake Bosumtwi
Lake Kainji
Chad Lake
Barrage de Mbakaou
Lake Nyos
Lac Assal
Lake Abbe
Lake Abaya
Chew Bahir
Lake Turkana
Lake Tana
Lake Sese Seko/Albertsee
Rutanzige/Eduardsee
