<div align="right">Python 2.7 Jupyter Notebook</div>

# Sources of data

### Your completion of the Notebook exercises will be graded based on your ability to: 

> **Apply**: Are you able to execute code, using the supplied examples, that perform the required functionality on supplied or generated data sets? 

> **Evaluate**: Are you able to interpret the results and justify your interpretation based on the observed data?

> **Create**: Your ability to produce notebooks that serve as computational record of a session that can be used to share your insights with others? 

# Notebook introduction

Data collection is expensive and time consuming, as Arek Stopczynski alluded to in the video 2 resource on the learning path. 
In some cases you will be lucky enough to have existing datasets available to support your analysis. You may have datasets from previous analyses, access to providers, or curated datasets from your organization. In many cases, however, you will not have access to the data that you require to support your analysis, and you will have to find alternate mechanisms. 
The data quality requirements will differ based on the problem that you are trying to solve. Taking the hypothetical case of geocoding a location that was introduced in Module 1, the accuracy of the geocoded location does not need to be exact when you are simply trying to plot the locations of students on a map. Geocoding a location for an automated vehicle to turn off the highway, on the other hand, has an entirely different accuracy requirement.

> **Note**:

> Those of you who work in large organizations may be privileged enough to have company data governance and data quality initiatives. These efforts and teams can generally add significant value both in terms of supplying company-standard curated data, and making you aware of the internal policies that need to be adhered to.

As a data analyst or data scientist, it is important to be aware of the implications of your decisions. You need to choose the appropriate set of tools and methods to deal with sourcing and supplying data.

Technology has matured in recent years, and allowed access to a host of sources of data that can be used in our analyses. In many cases you can access free resources, or obtain data that has been curated, is at a lower latency, or comes with a service-level agreement at a cost. Some governments have even made datasets publicly available.

You have been introduced to [OpenPDS](http://openpds.media.mit.edu/) in the video content where the focus shifts from supplying raw data - where the provider needs to apply security principles before sharing datasets - to supplying answers rather than data. OpenPDS allows users to collect, store, and control access to their data, while also allowing them to protect their privacy. In this way, users still have ownership of their data, as defined in the new deal on data. 

This notebook will demonstrate another example of sourcing external data to enrich your analyses. The Python ecosystem contains a rich set of tools and libraries that can help you to exploit the available resources.

This course will not go into detail regarding the various options to source and interact with social data from sources such as Twitter, LinkedIn, Facebook, and Google Plus. However, you should be able to find libraries that will assist you in sourcing and manipulating these sources of data.

Twitter data is a good example as, depending on the options selected by the twitter user, every tweet contains not just the message or content that most users are aware of. It also contains a view on the network of the person, home location, location from which the message was sent, and a number of other features that can be very useful when studying networks around a topic of interest. Professor Pentland pointed out the difference in what you share with the world (how you want to be seen) compared to what you actually do and believe (what you commit to). Ensure you keep these concepts in mind when you start exploring the additional sources of data. Those who are interested in the topic can start to explore the options by visiting the [twitter library on pypi](https://pypi.python.org/pypi/twitter). 

Start with the five Rs introduced in module 1, and consider the following questions:
- How accurate does my dataset need to be?
- How often should the dataset be updated?
- What happens if the data provider is no longer available?
- Do I need to adhere to any organizational standards to ensure consistent reporting or integration with other applications?
- Are there any implications to getting the values wrong?

You may need to start with “untrusted” data sources as a means of validating that your analysis can be executed. Once this is done, you can replace the untrusted components with trusted and curated datasets as your analysis matures.

> **Note**: 

> It is strongly recommended that you save a checkpoint after applying significant changes or completing exercises. This allows you to return the notebook to a previous state should you wish to do so. On the Jupyter menu, select "File", then "Save and Checkpoint" from the dropdown menu that appears.

#### Load libraries and set options

In [1]:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
import matplotlib
import folium
import geocoder
#import urllib2
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 8)

Populating the interactive namespace from numpy and matplotlib


# 1. Source additional data from public sources 
## 1.1 World-bank

This example will demonstrate how to source data from an external source to enrich your existing analyses. You will need to combine the data sources and add additional features to the example of student locations plotted on the world map in Module 1, Notebook 3.

The specific indicator chosen has little relevance other than to demonstrate the process that you will typically follow in completing your projects. Population counts, from an untrusted source, is added to our map and we use scaling factors combined with the number of students and population size of the country to demonstrate adding external data with minimal effort.

You can read more about the library that is utilized in this notebook [here](https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#world-bank).

In [11]:
# Load the grouped_geocoded dataset from Module 1.
df1 = pd.read_csv('data/grouped_geocoded.csv',index_col=[0])

# Prepare the student location dataset for use in this example.
# We use the geometrical center by obtaining the mean location for all observed coordinates per country.
df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 
                                  'long': [np.mean]}).reset_index()
# Reset the index.
df3 = df2.reset_index(level=1, drop=True)

# Get the external dataset from worldbank
#  We have selected indicator, "SP.POP.TOTL"
df4 = wb.download(
                    # Specify indicator to retrieve
                    indicator='SP.POP.TOTL',
                    country=['all'],
                    # Start Year
                    start='2008',
                    # End Year
                    end=2016
                )

# The dataset contains entries for multiple years.
#    We just want the last entry and create a separate object containing the list of maximum values
df5 = df4.reset_index()
idx = df5.groupby(['country'])['SP.POP.TOTL'].transform(max) == df4['SP.POP.TOTL']

# Create a new dataframe where entries corresponds to maximum year indexes in previous list.
df6 = df5[idx]

# Combine the student and population datasets.
df7 = pd.merge(df3, df6, on='country', how='left')

# Rename the columns or our merged dataset.
df8 = df7.rename(index=str, columns={('lat', 'mean'): "lat_mean", 
                                ('long', 'mean'): "long_mean", 
                                ('SP.POP.TOTL'): "PopulationTotal_Latest_WB",
                                ('student_count', 'sum'): "student_count"}
           )

          country     (country, )   lat_mean  student_count   long_mean  year  \
0       Australia       Australia -34.256125             26  147.042406  2015   
1         Austria         Austria  47.796835              3   16.027805  2015   
2         Belgium         Belgium  50.807547              6    4.378567  2015   
3          Brazil          Brazil -21.556434             24  -44.808458  2015   
4          Canada          Canada  46.172621             61  -88.355898  2015   
5           Chile           Chile -33.448890              1  -70.669266  2015   
6           China           China  25.726065              6  113.358120  2015   
7      Costa Rica      Costa Rica   9.920695              2  -84.146152  2015   
8         Croatia         Croatia  45.810939              2   15.896201  2008   
9  Czech Republic  Czech Republic  49.911921              2   13.913994  2015   

   PopulationTotal_Latest_WB  
0               2.378117e+07  
1               8.611088e+06  
2              

> **Note**:

> The cell above will complete with a warning message the first time that you execute the cell. You can ignore the warning and continue to the next cell to plot the indicator added.

> The visualization below does not have any meaning. The scaling factors selected is used to demonstrate the difference in population sizes and number of students on this course per country.

In [3]:
# Plot the combined dataset

# Set map center and zoom level
mapc = [0, 30]
zoom = 2

# Create map object.
map_osm = folium.Map(location=mapc,
                   tiles='Stamen Toner',
                    zoom_start=zoom)

# Plot each of the locations that we geocoded.
for j in range(len(df8)):
    # Plot a blue circle marker for country population.
    folium.CircleMarker([df8.lat_mean[j], df8.long_mean[j]],
                    radius=df8.PopulationTotal_Latest_WB[j]/500,
                    popup='Population',
                    color='#3186cc',
                    fill_color='#3186cc',
                   ).add_to(map_osm)
    # Plot a red circle marker for students per country.
    folium.CircleMarker([df8.lat_mean[j], df8.long_mean[j]],
                    radius=df8.student_count[j]*10000,
                    popup='Students',
                    color='red',
                    fill_color='red',
                   ).add_to(map_osm)
# Show the map.
map_osm

<br>
<div class="alert alert-info">
<b>Exercise 1 Start.</b>
</div>

### Instructions

> Copy the code from the previous two cells into the cells below. After you've reviewed the available indicators in the [worldbank](http://data.worldbank.org/indicator) dataset, replace the population indicator with an indicator of your choice. Add comments (lines starting with #) giving a brief description of your view on the observed results. Make sure to provide the tutor with a clear description of why you selected the indicator, what your expectation was when you started and what you think the results may indicate.

> **Note**: Advanced users are welcome to source data from alternate data sources or manually upload files to be utilized to their virtual analysis environment.


In [18]:
# Your data preparation code
# Load the grouped_geocoded dataset from Module 1.
df1 = pd.read_csv('data/grouped_geocoded.csv',index_col=[0])

# Prepare the student location dataset for use in this example.
# We use the geometrical center by obtaining the mean location for all observed coordinates per country.
df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 
                                  'long': [np.mean]}).reset_index()
# Reset the index.
df3 = df2.reset_index(level=1, drop=True)

# Get the external dataset from worldbank
#  We have selected indicator, "NY.GDP.PCAP.KD"
df4 = wb.download(
                    # Specify indicator to retrieve
                    indicator='NY.GDP.PCAP.KD',
                    country=['all'],
                    # Start Year
                    start='2008',
                    # End Year
                    end=2016
                )

# The dataset contains entries for multiple years.
#    We just want the last entry and create a separate object containing the list of maximum values
df5 = df4.reset_index()
idx = df5.groupby(['country'])['NY.GDP.PCAP.KD'].transform(max) == df4['NY.GDP.PCAP.KD']

# Create a new dataframe where entries corresponds to maximum year indexes in previous list.
df6 = df5[idx]

# Combine the student and GDP datasets.
df7 = pd.merge(df3, df6, on='country', how='left')

# Rename the columns or our merged dataset.
df8 = df7.rename(index=str, columns={('lat', 'mean'): "lat_mean", 
                                ('long', 'mean'): "long_mean", 
                                ('NY.GDP.PCAP.KD'): "GDPTotal_Latest_WB",
                                ('student_count', 'sum'): "student_count"}
           )

In [19]:
# Your plotting code
# Plot the combined dataset

# Set map center and zoom level
mapc = [0, 30]
zoom = 2

# Create map object.
map_osm = folium.Map(location=mapc,
                   tiles='Stamen Toner',
                    zoom_start=zoom)

# Plot each of the locations that we geocoded.
for j in range(len(df8)):
    # Plot a blue circle marker for country GDP.
    folium.CircleMarker([df8.lat_mean[j], df8.long_mean[j]],
                    radius=df8.GDPTotal_Latest_WB[j]*2,
                    popup='GDP',
                    color='#3186cc',
                    fill_color='#3186cc',
                   ).add_to(map_osm)
    # Plot a red circle marker for students per country.
    folium.CircleMarker([df8.lat_mean[j], df8.long_mean[j]],
                    radius=df8.student_count[j]*10000,
                    popup='Students',
                    color='red',
                    fill_color='red',
                   ).add_to(map_osm)
# Show the map.
map_osm

In [None]:
# Comment: The reason why I chose GPA as the indicator is that it is representative data in the worldbank dataset. What my expectation was when
# I started is some European countries would have larger GDP radius than student radius. The results may indicate that my expectation is right
# and rich countries are probable to have some students despite not a lot.

<br>
<div class="alert alert-info">
<b>Exercise 1 End.</b>
</div>

> **Exercise complete**:
    
> This is a good time to "Save and Checkpoint".

## 1.2 Wikipedia

To demonstrate how quickly data can be sourced from public, "untrusted" data sources, you have been supplied with a number of sample scripts below. While these sources contain extremely rich datasets that you can acquire with minimal effort, they can be amended by anyone and may not be 100% accurate. In some cases you will have to manually transform the datasets, while in others you might be able to use pre-built libraries.

Execute the code cells below before completing exercise 2.

In [20]:
#!pip install wikipedia
import wikipedia

# Display page summary
print wikipedia.summary("MIT")

The Massachusetts Institute of Technology (MIT) is a private research university in Cambridge, Massachusetts. Founded in 1861 in response to the increasing industrialization of the United States, MIT adopted a European polytechnic university model and stressed laboratory instruction in applied science and engineering. Researchers worked on computers, radar, and inertial guidance during World War II and the Cold War. Post-war defense research contributed to the rapid expansion of the faculty and campus under James Killian. The current 168-acre (68.0 ha) campus opened in 1916 and extends over 1 mile (1.6 km) along the northern bank of the Charles River basin.
MIT, with five schools and one college which contain a total of 34 departments, is often cited as among the world's top universities. The Institute is traditionally known for its research and education in the physical sciences and engineering, and more recently in biology, economics, linguistics, and management as well. The "Enginee

In [21]:
# Display a single sentence summary.
wikipedia.summary("MIT", sentences=1)

u'The Massachusetts Institute of Technology (MIT) is a private research university in Cambridge, Massachusetts.'

In [22]:
# Create variable page that contains the wikipedia information.
page = wikipedia.page("List of countries and dependencies by population")

# Display the page title.
page.title

u'List of countries and dependencies by population'

In [23]:
# Display the page URL. This can be utilised to create links back to descriptions.
page.url

u'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'

<br>
<div class="alert alert-info">
<b>Exercise 2 Start.</b>
</div>

### Instructions

> After executing the cells for the Wikipedia example in section 1.2, think about the potential implications of using this "public" and in many cases "untrusted" data sources when doing analysis or creating data products.

> **Please compile and submit a short list of pros and cons (three each). Your submission will be evaluated.**

> Your submission can be a simple markdown list or you can use the table syntax provided below.

Add your answer in this markdown cell. The contents of this cell should be replaced with your answer.

**Submit as a list:**

ListType
- Pro: Quick to get the answer.
- Pro: Convenient to get answers to general questions.
- Pro: General information aggregated.
- Con: Not reliable.
- Con: Possible to be changed a lot of times.
- Con: Hard to get accurate answers to specific questions.

**or as a table.**

| Type | Description |
| ---- | ----------- |
| Pro | Description 1 |
| Pro | Description 2 |
| Pro | Description 3 |
| Con | Description 1 |
| Con | Description 2 |
| Con | Description 3 |


<br>
<div class="alert alert-info">
<b>Exercise 2 End.</b>
</div>

> **Exercise complete**:
    
> This is a good time to "Save and Checkpoint".

## 2. Submit your notebook

Please make sure that you:
- Perform a final "Save and Checkpoint";
- Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
- Submit a copy of this file to the online campus.