## Homework 1  | Pandas, Exploratory Data Analysis, & Data Provenance
### ASSIGNED: Monday, 11 Feb 2019.   DUE: Thurday, 21 Feb 2018 at 9:30 am. 
<b>PLEASE NOTE: If you're in a technical major, you should not submit either homework #1 (e.g., this homework) or homework #2 to be graded. You may still do this homework if you are unfamiliar python.</b>

#### Purpose:
The purpose of this assignment is to 
1. introduce you to working in python and pandas, 
2. heighten your critical acumen when evaluating claims using data,
3. and make you more sensitive to questions regarding data provenance.  

## Data Provenance & EDA

Alain Desrosieres argues that the central tension in histories examining the role of statistics in political discourse is that the statistical entities that statistics uses are both real and fabrications: real in that they must be taken as “uncontestable standards” of reference insofar as they serve as  compelling evidence for a particular claim; fabrications in that they are the result of “the provisional and fragile crowning of a series of conventions of equivalence between entities.”[1] The statistical entity of life expectancy, for instance, is real insofar as it serves as a proxy for the health of populations and individuals, and is used to justify disparities in health and life insurance pricing and coverage for different populations. Yet in calculating life expectancy, one quickly discovers not a single computational method, but hundreds--each with a different set of assumptions that yield different results. Deciding which life expectancy estimation to employ is tied up with what the measure will be used to do, and so involves political, ethical, and even moral decisions about who and what should be counted and excluded.[2] 

Tracing the historical transformation of a statistical entity from a contingent, context-sensitive description into a “universal” property provides insight into the political institutions that created it while also making legible the ways in which a statistical entity exerted a reciprocal pressure back upon the institutions and individuals that created them.[3] Exploring the political implications of statistical entities is further complicated by their historical tendency to be repurposed for use in new arguments. While life expectancy was developed for assigning and categorizing individuals according to their likelihood of death while their life insurance policy was in force, this statistical category was subsequently put to more sinister purposes: namely, to “demonstrate” the existence of racial biological characteristics and then to serve as “evidence” that race was an appropriate category for screening immigrants.[4]

<b>Our immediate objective is to get some practice using Pandas to explore a data set, but we also want to be mindful of the implicit assumptions using this data entails.</b> The data you'll be using was obtained at http://apps.who.int/gho/data/node.main.688 via the World Health Organization website. You can find a discussion of how they tabulate life expectancy in this PDF[http://www.who.int/healthinfo/statistics/LT_method_1990_2012.pdf] for each country. In many cases, the life expectancies for different countries are not equally reliable. To wit, for many countries life expectancies were inferred rather than being observed directly. Both the sources and amounts of error can be very different for different countries. Furthermore, life expectancy as calculated by WHO may not match the "official" life expectancy values as calculated by each country. If you want to make any claims regarding the data you're about to explore, you would first need to know what methods, data, and politics were involved in producing a life expectancy measurement for each country you're examining. As you perform some basic EDA below, ask yourself if the comparisons you're being asked to perform are ethical. For what decisions and contexts would such comparisons be unethical?


[1] Alain Desrosieres, <em>The Politics of Large Numbers: A History of Statistical Reasoning</em>, Cambridge, MA: Harvard University Press, 1998, 324-325.

[2] Desrosieres, <em>The Politics of Large Numbers</em>, 325.

[3]Desrosieres, <em>The Politics of Large Numbers</em>, 324.

[4] Dan Bouk, <em>How Our Days Became Numbered: Risk and the Rise of the Statistical Individual</em>, Chicago, IL: University of Chicago Press, 187-188, 201-202.

## Instructions 
#### This assignment is to be done on your own, but you can talk about the assignment with your classmates if you get stuck. (If you do this, be sure to mention who you worked with in the space provided below.) Feel free to also use stackoverflow (but please provide citation and link to the specific answer if you do this). Finally, you may also visit Will Yumou during his TA office hours. Provide your code to justify your answer to each question.  

#### Be sure to rename this homework notebook so that it includes your name. Finally, please note that your code must run with the "life.expectancy.countries.csv" as originally provided to you. 

#### List any students you talked with about this assignment here:
1. [person 1]
2. [person 2]
3. etc.


## Homework Problems


#### Question 1 (10 points)

Import the life.expectancy.countries.csv file into a pandas dataframe entitled "lifeexpectancy". Rename the column titles of this data frame using the list below entitled "column_names". Use the code provided below to help you do this. Also be sure to drop the first row of your CSV using the following code: 

`lifeexpectancy = lifeexpectancy.drop(lifeexpectancy.index[[0]])`  

In [1]:
column_names = ["country", "year", "life expectancy at birth (both sexes)", \
                          "life expectancy at birth (female)", "life expectancy at birth (male)", \
                          "life expectancy at age 60 (both sexes)", "life expectancy at age 60 (female)", \
                          "life expectancy at age 60 (male)"]

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [3]:
# TYPE ANSWER HERE






<b>Important</b>: For the current version of pandas, when you import "life.expectancy.countries.csv" into pandas in the usual manner, it sets all the life expectancy ages (i.e., columns 2 - 7) as "objects" instead of "floats". I'm not sure why it does this, but it will cause problems when you try to plot things. To fix this, be sure to run the following line of code once you've <b>finished</b> question 1 but before you begin question 2: 

In [4]:
lifeexpectancy.loc[:, 'life expectancy at birth (both sexes)':] = lifeexpectancy.loc[:, 'life expectancy at birth (both sexes)':].astype(float)

#### Question 2 (10 points) 

How many different _countries_ do you have data for? How many different years of life expectancy data do you have for each country? 

In [4]:
# TYPE CODE HERE





#### Question 3 (5 points)

Using pandas, make a new dataframe that contains all the data for Brazil. 

Hint: the following pseudocode gives you a general idea of what you need to do: 

<code> lifeexpectancy[lifeexpectancy['column title']=='Name_of_Country']</code>.

In [12]:
# TYPE ANSWER HERE





#### Question 4 (5 points)

Plot life expectancy (from birth, "both sexes") as a function of year for Brazil using the dataframe you constructed in question 3.  

In [13]:
# TYPE ANSWER HERE




#### Question 5 (10 points)

Which country has the highest life expectancy (from birth) for men, women, and both sexes? What are the associated years for each of these life expectancies? (Be sure to show your code!)

In [14]:
# TYPE ANSWER HERE





#### Question 6 (30 points)

Using life expectancy data for "both sexes" from birth, which country has the fastest growing life expectancy on average for all years provided? Likewise, which country has the slowest growing (or even fastest decreasing) life expectancy on average for all years provided? Using pandas, plot the life expectancy of these two countries as a function of year in the same graph. <b>Suggestion: If you find yourself getting stuck on this question, consider doing questions 7 and 8 first and then coming back to this question.</b>

In [21]:
# TYPE ANSWER HERE

# HINT 1: 
# We're going to treat years as number, and in order to do this we need to change years to floats or ints.
# You don't have to solve it this way, but this is one way to do it.

# Set years to floats.
# Note that some values are NaNs, which floats convert without warning, so do this with care...
lifeexpectancy.loc[:, 'year':] = lifeexpectancy.loc[:, 'year':].astype(float)


# HINT #2: 
# Determine slopes for all countries for entire time period of the data set

# HINT #3:
# Identify which which country has most extreme positive slope and which country has most extreme negative slope 

# HINT #4:
# Graph both countries in the same plot

#### Question 7 (15 points)

Pick 3 countries you'd like to compare, and plot their life expectancies (from birth, "both sexes") on the same graph.  

In [17]:
# TYPE ANSWER HERE



#### Question 8 (15 points)

Plot the _average life expectancy_ for _all_ countries as a function of year.  

In [18]:
# TYPE ANSWER HERE

