First I'll import pandas so that I can read and manipulate the data in our csv file.

In [1]:
import pandas as pd

In [2]:
life_ex = pd.read_csv('all_data.csv')

In [4]:
life_ex.head()

Unnamed: 0,Country,Year,Life expectancy at birth (years),GDP
0,Chile,2000,77.3,77860930000.0
1,Chile,2001,77.3,70979920000.0
2,Chile,2002,77.8,69736810000.0
3,Chile,2003,77.9,75643460000.0
4,Chile,2004,78.0,99210390000.0


In order to summarize the mean life expectancy and GDP for each country in our dataset, I'll use the `groupby()` and `mean()` functions together. To make the data easier to read, I'll also use the `sort_values()` function to list data in descending order of mean life expectancy. This has the added bonus of letting me know all the unique countries in our dataset.

In [13]:
life_ex.groupby('Country').mean().sort_values(by=['Life expectancy at birth (years)'], ascending=False)

Unnamed: 0_level_0,Year,Life expectancy at birth (years),GDP
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,2007.5,79.65625,3094776000000.0
Chile,2007.5,78.94375,169788800000.0
United States of America,2007.5,78.0625,14075000000000.0
Mexico,2007.5,75.71875,976650600000.0
China,2007.5,74.2625,4957714000000.0
Zimbabwe,2007.5,50.09375,9062580000.0


We can see that Germany has the highest life expectancy, but not the highest GDP. This seems like two metrics that would be interesting to plot against each other.

We can also see that Zimbabwe has the lowest life expectancy by a large amount, effectively an outlier in this data. This also makes me curious how many data points we have for each country. Below I'll use the `value_counts()` method on the `Country` column to check this.

In [15]:
life_ex['Country'].value_counts()

Chile                       16
China                       16
Germany                     16
Mexico                      16
United States of America    16
Zimbabwe                    16
Name: Country, dtype: int64

Nice to see that we have an equal number of data points from each country. I would also like to quickly check if there are any missing data points. The quickest way I can think of is using `isnull()` and `value_counts()` together. If there are any `True` values, we can investigate those further.

In [17]:
life_ex.isnull().value_counts()

Country  Year   Life expectancy at birth (years)  GDP  
False    False  False                             False    96
dtype: int64

An even number of values for each country with no missing data points... this data is looking pretty clean so far.

It appears from the `head()` and mean years that data is equally spread across years. It would be nice to confirm this and get the range at the same time.

In [18]:
life_ex['Year'].value_counts()

2000    6
2001    6
2002    6
2003    6
2004    6
2005    6
2006    6
2007    6
2008    6
2009    6
2010    6
2011    6
2012    6
2013    6
2014    6
2015    6
Name: Year, dtype: int64