Climate data analysis (part I)
==============================

We want to analyze some world wide climate data from the [National Climatic Data Center](http://www.ncdc.noaa.gov/data-access), since they archived the world's largest climate data around the world with historical data dating back many centuries. To evaluate if their datasets will be relevant for our analysis, we can download their list of countries. The file is on the class websitefor you and is available as part of this exercise and each line contains the country name one can download data for. We would like to analyze it using list, sets, and dictionary comprehensions. In a subsequent exercise, we will use the original complete data file which provides not only the country name but its code to allow collecting and analyzing the data corresponding to it.

We load the data for you into a large string containing all the countries:

In [7]:
import requests
import codecs

url = 'https://gawron.sdsu.edu/python_for_ss/course_core/data/NCDC_country_list.txt'
r = requests.get(url)

countries_raw = r.content
# Downloaded web content is often a byte string.  We're pretty sure this one is utf8
countries = codecs.decode(countries_raw,encoding='utf-8')
# Let's normalize the content
countries = countries.lower()
print(countries)

norway
sweden
finland
united kingdom (includes northern ireland)
ireland
iceland
greenland
denmark, faeroe islands
netherlands
belgium
luxembourg
switzerland, liechtenstein
france
spain
gibraltar
portugal, azores, cape verde, madeira
germany (former east germany, west germany)
austria
czechoslovakia
poland
hungary
yugoslavia (former), albania, slovenia, croatia
albania
yugoslavia (former), slovenia, croatia
romania
bulgaria
italy
malta
greece
turkey
cyprus
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
former soviet union
syria
lebanon
israel
jordan
saudi arabia
kuwait
iraq
iran
afghanistan
saudi arabia
bahrain
qatar
united arab emirates
oman
yemen
democratic yemen
pakistan
bangladesh
india 
india
sri lanka
maldives
mongolia
nepal
hong kong
macau
hong kong
taiwan
north korea
south korea
japan
myanmar (was burma)
tha

Question 1
----------

We would like to list all the countries in this list that start with the letter "b" because we are interested in datasets for Brazil. This can be done with a `for` loop as follows:

    country_list = countries.split("\n")
    b_countries = []
    for country in coutry_list:
        if country[0] == "b":
            b_countries.append(country)

Re-write this to use a list comprehension instead.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('3')">Solution</button></div>

In [None]:
# We need to loop over all the words but are starting from a text with all the words together. 
# The first task is then to make a list of words:
country_list = countries.strip().split("\n")
# Then we can list all countries whose first letter is a 'b'
b_countries = [country for country in country_list if country[0] == "b"]
# Of course we can combine the 2 step into one command. 
# Here we are also testing if a word starts with the letter "b" using the dedicated string 
# method 'startswith'
b_countries = [country for country in countries.split("\n") if country.startswith("b")]

In [None]:
print("Countries that start with 'b':")
print(b_countries)

Question 2
----------

Several countries are repeated in the result generated by the list comprehension. This is because there are multiple codes used by NCDC for a given country when it is particularly large. Cast your list to another Python standard datastructure that will enforce uniqueness.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('5')">Solution</button></div>

In [None]:
unique_b_countries = set(b_countries)
print(unique_b_countries)

Question 3
----------

If we are always going to collect all the country names and then remove duplicates, we could build a set directly rather than going through a list. Use a set comprehension (or a generator expression if you are using an older version of Python) to produce the set of names that start with "b".

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('6')">Solution</button></div>

In [None]:
unique_b_countries = {country for country in countries.split("\n") if country.startswith("b")}
# generator expression version (for Python 2.6)
unique_b_countries = set(country for country in countries.split("\n") if country.startswith("b"))

In [None]:
print("unique countries starting with 'b':")
print(unique_b_countries)

Question 4
----------

Use a dictionary comprehension (or generator expression) to produce a dictionary whose keys are *all* the countries and whose values are the number of times they appear in the data file because they have been sub-divided. Print the content of the dictionary in a nice way, one country per line.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('8')">Solution</button></div>

In [None]:
country_frequencies = {country: countries.count(country) for country in countries.split("\n")}
# generator expression version
country_frequencies = dict((country, countries.count(country)) for country in countries.split("\n"))

In [None]:
# or alternatively, and more efficiently
unique_countries = set(countries.split("\n"))
country_frequencies = {country: countries.count(country) for country in unique_countries}
# generator expression version
country_frequencies = dict([(country, countries.count(country)) for country in unique_countries])

In [None]:
print("Number of times countries have been sub-divided:")
for key, value in country_frequencies.items():
    print("{key} : {value}".format(key=key, value=value))

Copyright 2008-2016, Enthought, Inc.  
Use only permitted under license.  Copying, sharing, redistributing or other unauthorized use strictly prohibited.  
http://www.enthought.com