# Data analyses with pandas dataframes

---

### Learning objectives

- Describe what a dataframe is.
- Load external data from a .csv file into a dataframe with pandas.
- Summarize the contents of a dataframe with pandas.
- Learn to use dataframe attributes `loc[]`, `head()`, `info()`, `describe()`, `shape`, `columns`, `index`.
- Learn to clean dirty data.
- Understand the split-apply-combine concept for data analysis.
    - Use `groupby()`, `mean()`, `agg()` and `size()` to apply this technique.

### Lesson outline

- Manipulating and analyzing data with pandas
    - Data set background (10 min)
    - What are dataframes (15 min)
    - Data wrangling with pandas (40 min)
- Cleaning data (20 min)
- Split-apply-combine techniques in `pandas`
    - Using `mean()` to summarize categorical data (20 min)
    - Using `size()` to summarize categorical data (15 min)
    
---

## Dataset background

Today,
we will be working with real data about the world
combined from multiple sources by the [Gapminder foundation].
Gapminder is an independent Swedish organization
that fights devastating misconceptions about global development.
They also promote a fact-based world view
through the production of free teaching and data exploration resources.
Insights from the Gapminder data sources
have been popularized through the efforts of public health professor Hans Rosling.
It is highly recommended to check out his [entertaining videos],
most famously [The best stats you have ever seen].
Before we start exploring the data,
we recommend taking [this 5-10 min quiz],
to see how knowledgeable (or ignorant) you are about the world.
Then we will learn how to dive deeper into this data using Python!

[Gapminder foundation]: https://www.gapminder.org/about-gapminder/
[entertaining videos]: https://www.gapminder.org/videos/
[The best stats you have ever seen]: https://www.youtube.com/watch?v=hVimVzgtD6w
[this 5-10 min quiz]: http://forms.gapminder.org/s3/test-2018

We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a comma separated value (CSV) file. Each row
holds information for a single animal, and the columns represent:

| Column                | Description                                                                                            |
|-----------------------|--------------------------------------------------------------------------------------------------------|
| country               | Country name                                                                                           |
| year                  | Year of observation                                                                                    |
| population            | Population in the country at each year                                                                 |
| region                | Continent the country belongs to                                                                       |
| sub_region            | Sub regions as defined by                                                                              |
| income_group          | Income group [as specified by the world bank]                                                          |
| life_expectancy       | The average number of years a newborn child would <br>live if mortality patterns were to stay the same |
| income                | GDP per capita (in USD) adjusted <br>for differences in purchasing power                               |
| children_per_woman    | Number of children born to each woman                                                                  |
| child_mortality       | Deaths of children under 5 years <break>of age per 1000 live births                                    |
| pop_density           | Average number of people per km<sup>2</sup>                                                            |
| co2_per_capita        | CO2 emissions from fossil fuels (tonnes per capita)                                                    |
| years_in_school_men   | Average number of years attending primary, secondary,<br>and tertiary school for 25-36 years old men   |
| years_in_school_women | Average number of years attending primary, secondary,<br>and tertiary school for 25-36 years old women |

[as specified by the world bank]: https://datahelpdesk.worldbank.org/knowledgebase/articles/378833-how-are-the-income-group-thresholds-determined

To read the data into Python,
we are going to use a function called `read_csv` from the Python-package [pandas].
As mentioned previously,
Python-packages are a bit like phone apps,
they are not essential to the core Python library,
but provides domain specific functionality.
To use a package,
it first needs to be imported.

[pandas]: https://pandas.pydata.org/

In [1]:
# pandas is given the nickname `pd`
import pandas as pd

pandas can read CSV-files saved on the computer or directly from an URL.
Here,
we read data that we have compiled from Gapminder
and uploaded to our GitHub repository.

In [3]:
url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/go-pages/data/world-data-Gapminder.csv'
world_data = pd.read_csv(url)

HTTPError: HTTP Error 404: Not Found

To view the dataframe that pandas created,
type `world_data` in a cell and run it,
just as when viewing the content of any variable in Python.

In [5]:
world_data

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
0,Afghanistan,1800,3280000,Asia,Southern Asia,Low,28.2,603,7.00,469.0,,,,
1,Afghanistan,1801,3280000,Asia,Southern Asia,Low,28.2,603,7.00,469.0,,,,
2,Afghanistan,1802,3280000,Asia,Southern Asia,Low,28.2,603,7.00,469.0,,,,
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.00,469.0,,,,
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.00,469.0,,,,
5,Afghanistan,1805,3280000,Asia,Southern Asia,Low,28.2,603,7.00,469.0,,,,
6,Afghanistan,1806,3280000,Asia,Southern Asia,Low,28.1,603,7.00,470.0,,,,
7,Afghanistan,1807,3280000,Asia,Southern Asia,Low,28.1,603,7.00,470.0,,,,
8,Afghanistan,1808,3280000,Asia,Southern Asia,Low,28.1,603,7.00,470.0,,,,
9,Afghanistan,1809,3280000,Asia,Southern Asia,Low,28.1,603,7.00,470.0,,,,


This is how a dataframe is displayed in the Jupyter notebook.
The Jupyter notebook displays pandas dataframes in a tabular format,
and adds cosmetic conveniences such as the bold font type for the column and row names,
the alternating grey and white zebra stripes for the rows,
and highlighting of the row the mouse pointer hovers over.
The increasing numbers on the far left is the dataframe's index or row names.
These are not present in CSV-file,
but were added by `pandas` to easily distinguish between the rows.

## What are dataframes?

A dataframe is the representation of data in a tabular format,
similar to how data is often arranged in spreadsheets.
The data is rectangular,
meaning that all rows have the same amount of columns
and all columns have the same amount of rows.
As mentioned in the previous lectures,
when our data is arranged in a tidy format,
the columns can be referred to as the "features" or "variables" of the data,
while each row represents an individual "observation".
Dataframes are the standard data structure for most tabular data,
and what we will use for data wrangling, statistics and plotting.
A dataframe can be created by hand,
but most commonly they are generated by an input function,
such as `read_csv()`,
when importing spreadsheet data from your hard drive (or the web).

As can be seen above,
the default is to display the first and last five rows
and truncate everything in between,
as indicated by the ellipsis (`...`).
If we wanted to display only the first 5 lines,
we could use the `head()` method.

In [6]:
world_data.head()

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
0,Afghanistan,1800,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
1,Afghanistan,1801,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
2,Afghanistan,1802,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,


Methods are very similar to functions,
the main difference is that they belong to an object
(the method `head()` belongs to the dataframe `world_data`).
Methods operate on the object they belong to,
that's why we can call the method with an empty parenthesis without any arguments.
Compare this with the function `type()` that was introduced previously.

In [7]:
type(world_data)

pandas.core.frame.DataFrame

Here,
the `world_data` variable is explicitly passed as an argument to `type()`.
An immediately tangible advantage with methods is that they simplify tab completion.
Just type the name of the dataframe,
a period,
and then hit tab to see all the relevant methods for that dataframe
instead of fumbling around with all the available functions in Python
(there's quite a few!)
and figuring out which ones operate on dataframes and which do not.
Methods also facilitates readability when chaining many operations together,
which will be shown in detail later.

The columns in a dataframe can contain data of different types,
e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)).
General information about the dataframe
(including the column data types)
can be obtained with the `info()` method.

In [8]:
world_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38982 entries, 0 to 38981
Data columns (total 14 columns):
country                  38982 non-null object
year                     38982 non-null int64
population               38982 non-null int64
region                   38982 non-null object
sub_region               38982 non-null object
income_group             38982 non-null object
life_expectancy          38982 non-null float64
income                   38982 non-null int64
children_per_woman       38982 non-null float64
child_mortality          38980 non-null float64
pop_density              12282 non-null float64
co2_per_capita           16285 non-null float64
years_in_school_men      8188 non-null float64
years_in_school_women    8188 non-null float64
dtypes: float64(7), int64(3), object(4)
memory usage: 4.2+ MB


The information includes the total number of rows and columns,
the number of non-null observations,
the column data types,
and the memory (RAM) usage.
The number of non-null observation is not the same for all columns,
which means that some columns contain null (or NA) values
indicating that there is missing data for some observations.
The column data type indicates which type of data is stored in that column,
and approximately corresponds to the following

- **Categorical/Qualitative**
    - Nominal (labels, e.g. 'red', 'green', 'blue')
        - `object`, `category`
    - Ordinal (labels with order, e.g. 'Jan', 'Feb', 'Mar')
        - `object`, `category`, `int`
    - Binary (only two outcomes, e.g. True or False)
        - `bool`
- **Quantitative/Numerical**
    - Discrete (whole numbers, often counting, e.g. number of children)
        - `int`
    - Continuous (measured values with decimals, e.g. weight)
        - `float`

Note that an `object` could contain different types,
e.g. `str` or `list`.
Also note that there can be exceptions to the schema above,
but it is a useful general guide.

After reading in the data into a dataframe,
`head()` and `info()` are two of the most useful methods
to get an idea of the structure of this dataframe.
There are a few additional methods
that can facilitate the understanding of what a dataframe contains:

- Content:
    - `world_data.head(n)` - shows the first `n` rows
    - `world_data.tail(n)` - shows the last `n` rows

- Summary:
    - `world_data.info()` - column names and data types, number of observations, memory consumptions
      length, and content of each column
    - `world_data.describe()` - summary statistics for each column

The suffixed parentheses indicate that the method is being called,
which means that there is a computation carried out
when we execute the code.
Parameters can be put inside this parentheses
to change the behavior of the method.
For example,
`head(10)` tells the `head()` method to show the first ten rows of the dataframe,
instead of the default first five.

In addition to methods that compute values on demand,
dataframes can also have pre-calculated values stored with the same dot-syntax.
Values stored like this are often frequently accessed
and it saves time store the value directly instead of recomputing it every time it is needed.
For example,
every time `pandas` creates a dataframe,
the number of rows and columns is computed and stored in the `shape` attribute.
Some useful pre-computed values are shown below.

- Names:
    - `world_data.columns` - the names of the columns
      objects)
    - `world_data.index` - the names of the rows (referred to as the index in pandas)

- Size:
    - `world_data.shape` - the number of rows and columns stored as a tuple
    - `world_data.shape[0]` - the number of rows
    - `world_data.shape[1]`- the number of columns

In `shape[0]`,
the `[0]` part accesses the first element of the tuple via indexing
and it is not the same as passing a number to `head()`,
which changes how a calculation happens.
Generally,
anything accessible via the dot-syntax,
is an *attribute* of the dataframe (including methods).

>#### Challenge
>
>Based on the output of `world_data.info()`, can you answer the following questions?
>
>* What is the class of the object `world_data`?
>* How many rows and how many columns are in this object?
>* Why is there not the same number of rows (observations) for each column?

### Saving dataframes locally

When using data from an online source,
it is good practice to keep a copy stored locally on your computer
in case you want to do offline analyses,
the online version of the file changes,
or the file is taken down.
To save a local copy,
the data could be downloaded manually
or the current `world_data` dataframe could be saved to disk as a CSV-file with `to_csv()`.

In [None]:
world_data.to_csv('world-data.csv', index=False)
# `index=False` because the index (the numbered row names)
# was generated automatically when pandas loaded the file
# and this information is not needed to be saved

Since the data is now saved locally,
the next time this notebook is opened,
it could be loaded from the local path instead of downloading it from the URL.

In [10]:
world_data = pd.read_csv('world-data.csv')
world_data.head()

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
0,Afghanistan,1800,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
1,Afghanistan,1801,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
2,Afghanistan,1802,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,


### Indexing and subsetting dataframes

The world data dataframe has rows and columns,
which means it has two dimensions.
We can "subset" the dataframe 
and extract data only from a single column
by using its name inside brackets.
pandas recognizes the column names in the dataframe,
so tab autocompletion can be used when typing out the column name.

In [11]:
world_data['year'].head()

0    1800
1    1801
2    1802
3    1803
4    1804
Name: year, dtype: int64

The name of the column and its data type is shown at the bottom.
Remember that the numbers on the left is the index of the dataframe,
which was added by `pandas` upon importing the data.
You could also select a column with the dot-syntax `world_data.year`,
but using brackets is clearer so this tutorial will stick to that.
To selected multiple columns,
the columns names can be passed as a list inside the brackets
(so there will be double brackets,
one for the dataframe indexing and one for the list).

In [13]:
world_data[['country', 'year']].head()

Unnamed: 0,country,year
0,Afghanistan,1800
1,Afghanistan,1801
2,Afghanistan,1802
3,Afghanistan,1803
4,Afghanistan,1804


The output is displayed a bit differently this time.
The reason is that when there was only one column `pandas` technically returned a `Series`,
not a `Dataframe`.
This can be confirmed by using `type` as previously.

In [14]:
type(world_data['year'])

pandas.core.series.Series

In [15]:
type(world_data[['country', 'year']])

pandas.core.frame.DataFrame

Every column in a dataframe is a `Series`
and pandas glues them together to form a `Dataframe`.
There can be performance benefits to work with `Series`,
but pandas often takes care of conversions between these two object types under the hood,
so this introductory tutorial will not make any further distinction between a `Series` and a `Dataframe`.
Many of the analysis techniques used here will apply to both series and dataframes.

Selecting with single brackets (`[]`) is a shortcut for common operations,
such as selecting columns by labels as above.
For more flexible and robust row and column selection,
the more verbose `loc[<rows>, <columns>]` syntax can be used
(`.loc` stand for "location").

In [None]:
world_data.loc[[0, 2, 4], ['country', 'year']]
# Although methods usually have trailing parenthesis,
# square brackets are used with `loc[]` to stay
# consistent with the indexing with square brackets in general in Python
# (e.g. lists and Numpy arrays)

A single number can be selected,
which returns that value (an integer in this case),
rather than a `Dataframe` or `Series` with one value.

In [17]:
world_data.loc[4, 'year']

1804

In [18]:
type(world_data.loc[4, 'year'])

numpy.int64

To select all rows,
but only a subset of columns,
the colon character (`:`) can be used.

In [19]:
world_data.loc[:, ['country', 'year']].head() # head() is used to limit the length of the output

Unnamed: 0,country,year
0,Afghanistan,1800
1,Afghanistan,1801
2,Afghanistan,1802
3,Afghanistan,1803
4,Afghanistan,1804


The same syntax can be used to select all columns,
but only a subset of rows.

In [20]:
world_data.loc[[3, 4], :]

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,


When selecting all columns,
the `:` could be left out as a convenience.

In [21]:
world_data.loc[[3, 4]]

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,


It is also possible to select slices of rows and column labels.

In [22]:
world_data.loc[2:4, 'country':'region']

Unnamed: 0,country,year,population,region
2,Afghanistan,1802,3280000,Asia
3,Afghanistan,1803,3280000,Asia
4,Afghanistan,1804,3280000,Asia


It is important to realize that `loc[]` selects rows and columns by their *labels*.
To instead select by row or column *position*,
use `iloc[]` (integer location).

In [23]:
world_data.iloc[[2, 3, 4], [0, 1, 2]]

Unnamed: 0,country,year,population
2,Afghanistan,1802,3280000
3,Afghanistan,1803,3280000
4,Afghanistan,1804,3280000


The index of `world_data` consists of consecutive integers,
so in this case selecting from the index by labels or position will return the same rows.
As will be shown later,
an index could also consist of text names,
just like the columns.

While selecting slices by label is inclusive of both the start and end,
selecting slices by position is inclusive of the start but exclusive of the end position,
just like when slicing in lists.

In [24]:
world_data.iloc[2:5, :4] # `iloc[2:5]` gives the same result as `loc[2:4]` above

Unnamed: 0,country,year,population,region
2,Afghanistan,1802,3280000,Asia
3,Afghanistan,1803,3280000,Asia
4,Afghanistan,1804,3280000,Asia


Selecting slices of row positions is a common operation,
and has thus been given a shortcut syntax with single brackets.

In [25]:
world_data[2:5]

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
2,Afghanistan,1802,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,


>#### Challenge
>
>1. Extract the 200th and 201st row of the `world_data` dataset
>   and assign the resulting dataframe to a new variable name (`world_data_200_201`).
>   Remember that Python indexing starts at 0!
>
>2. How can you get the same result as from `world_data.head()`
>   by using row slices instead of the `head()` method?
>
>3. There are at least three distinct ways to extract the last row of the dataframe.
>   Which can you find?

### Filtering observations

The `describe()` method was mentioned above
as a way of retrieving summary statistics of a dataframe.
Together with `info()` and `head()`,
this is often a good place to start exploratory data analysis
as it gives a helpful overview of the numeric valuables the data set.

In [26]:
world_data.describe()

Unnamed: 0,year,population,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
count,38982.0,38982.0,38982.0,38982.0,38982.0,38980.0,12282.0,16285.0,8188.0,8188.0
mean,1909.0,14220750.0,43.073468,4527.128033,5.384391,292.050891,120.900572,3.236894,7.681019,6.948334
std,63.220006,67224230.0,16.219216,9753.116041,1.642597,161.56229,382.454242,6.079257,3.185983,3.876399
min,1800.0,12500.0,1.0,247.0,1.12,1.95,0.502,0.0,0.9,0.21
25%,1854.0,506000.0,31.2,876.0,4.55,141.0,14.8,0.188,5.16,3.62
50%,1909.0,2140000.0,35.5,1450.0,5.91,361.0,46.0,0.944,7.65,6.98
75%,1964.0,6870000.0,55.6,3520.0,6.63,420.0,110.0,4.02,10.1,9.98
max,2018.0,1420000000.0,84.2,178000.0,8.87,756.0,8270.0,101.0,15.3,15.7


A common next step would be to plot the data to explore relationships between different variables,
but before getting into plotting in the next lecture,
we will elaborate on the dataframe object and several of its common operations.

An often desired operation is to select a subset of rows matching a criteria,
e.g. which observations have a life expectancy above 83 years.
To do this,
the "less than" comparison operator that was introduced previously can be used
to filter the relevant rows.

In [27]:
world_data['life_expectancy'] > 83

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
38952    False
38953    False
38954    False
38955    False
38956    False
38957    False
38958    False
38959    False
38960    False
38961    False
38962    False
38963    False
38964    False
38965    False
38966    False
38967    False
38968    False
38969    False
38970    False
38971    False
38972    False
38973    False
38974    False
38975    False
38976    False
38977    False
38978    False
38979    False
38980    False
38981    False
Name: life_expectancy, Length: 38982, dtype: bool

The result is a boolean array with one value for every row in the dataframe
indicating whether it is `True` or `False`
that this row has a value above 83 in the column `life_expectancy`.
To find out how many observations there are matching this condition,
the `sum()` method can used
since each `True` will be `1` and each `False` will be `0`.

In [28]:
above_83_bool = world_data['life_expectancy'] > 83
above_83_bool.sum()

20

Instead of assigning to the intermediate variable `above_83_bool`,
we can use methods directly on the resulting boolean series
by surrounding it with parentheses.

In [29]:
(world_data['life_expectancy'] > 83).sum()

20

The boolean array can be used to select only those rows from the dataframe
that meet the specified condition.

In [30]:
world_data[world_data['life_expectancy'] > 83]

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
17513,Japan,2012,128000000,Asia,Eastern Asia,High,83.2,36400,1.4,3.0,352.0,9.58,14.8,15.2
17514,Japan,2013,128000000,Asia,Eastern Asia,High,83.4,37100,1.42,2.9,352.0,9.71,14.9,15.3
17515,Japan,2014,128000000,Asia,Eastern Asia,High,83.6,37300,1.43,2.8,352.0,9.47,15.0,15.4
17516,Japan,2015,128000000,Asia,Eastern Asia,High,83.8,37800,1.44,3.0,351.0,,15.1,15.5
17517,Japan,2016,128000000,Asia,Eastern Asia,High,83.9,38200,1.46,2.7,350.0,,,
17518,Japan,2017,127000000,Asia,Eastern Asia,High,84.0,38600,1.47,2.83,350.0,,,
17519,Japan,2018,127000000,Asia,Eastern Asia,High,84.2,39100,1.48,2.76,349.0,,,
30653,Singapore,2012,5270000,Asia,South-eastern Asia,High,83.2,76000,1.26,2.8,7530.0,6.9,13.6,13.3
30654,Singapore,2013,5360000,Asia,South-eastern Asia,High,83.2,78500,1.25,2.7,7660.0,10.4,13.7,13.5
30655,Singapore,2014,5450000,Asia,South-eastern Asia,High,83.4,80300,1.25,2.7,7780.0,10.3,13.8,13.7


As before,
this can be combined with selection of a particular set of columns.

In [31]:
world_data.loc[world_data['life_expectancy'] > 83, ['country', 'year', 'life_expectancy']]

Unnamed: 0,country,year,life_expectancy
17513,Japan,2012,83.2
17514,Japan,2013,83.4
17515,Japan,2014,83.6
17516,Japan,2015,83.8
17517,Japan,2016,83.9
17518,Japan,2017,84.0
17519,Japan,2018,84.2
30653,Singapore,2012,83.2
30654,Singapore,2013,83.2
30655,Singapore,2014,83.4


A single expression can be used to filter for several criteria,
either matching *all* criteria with the `&` operator,
or *any* criteria with the `|`.
These special operators are used instead of `and` and `or`
to make sure that the comparison occurs for each row in the dataframe.
Parentheses are added to indicate the priority of the comparisons.

In [33]:
world_data.loc[(world_data['sub_region'] == 'Northern Europe') & (world_data['year'] == 1879), ['sub_region', 'country', 'year']]

Unnamed: 0,sub_region,country,year
9496,Northern Europe,Denmark,1879
11248,Northern Europe,Estonia,1879
11905,Northern Europe,Finland,1879
15409,Northern Europe,Iceland,1879
16504,Northern Europe,Ireland,1879
19132,Northern Europe,Latvia,1879
20227,Northern Europe,Lithuania,1879
25921,Northern Europe,Norway,1879
33367,Northern Europe,Sweden,1879
36871,Northern Europe,United Kingdom,1879


To increase readability,
long statements can be put on multiple rows.
Anything that is within parentheses or brackets can be continued on the next row.
When inside a bracket or parenthesis,
the indentation is not significant to the Python interpreter,
but it is recommended to align code in meaningful ways,
to make it more readable.

In [None]:
world_data.loc[(world_data['sub_region'] == 'Northern Europe') &
               (world_data['year'] == 1879),
               ['sub_region', 'country', 'year']]

Above,
we assumed that `'Northern Europe'` was a value within the `sub_region` column.
When we don't know which values exist in a column,
the `unique()` method can reveal them.

In [34]:
world_data['sub_region'].unique()

array(['Southern Asia', 'Southern Europe', 'Northern Africa',
       'Sub-Saharan Africa', 'Latin America and the Caribbean',
       'Western Asia', 'Australia and New Zealand', 'Western Europe',
       'Eastern Europe', 'South-eastern Asia', 'Northern America',
       'Eastern Asia', 'Northern Europe', 'Melanesia', 'Central Asia',
       'Micronesia', 'Polynesia'], dtype=object)

With the `|` operator, rows matching either of the supplied criteria are returned.

In [None]:
world_data.loc[(world_data['year'] == 1800) |
               (world_data['year'] == 1801) ,
               ['country', 'year']].head()

Additional useful ways of subsetting the data includes `between()`,
which checks if a numerical value is within a given range,
and `isin()`,
which checks if a value is contained in a given list.

In [None]:
# `unique` is used to show that only the relevant items are returned
world_data.loc[world_data['year'].between(2000, 2015), 'year'].unique()

In [37]:
world_data.loc[world_data['region'].isin(['Africa', 'Asia', 'Americas']), 'region'].unique()

array(['Asia', 'Africa', 'Americas'], dtype=object)

### Creating new columns

A frequent operation when working with data,
is to create new columns based on the values in existing columns.
For example,
to find the total income in a country,
we could multiple the income per person with the population:

In [38]:
world_data['population_income'] = world_data['income'] * world_data['population']
world_data[['population', 'income', 'population_income']].head()

Unnamed: 0,population,income,population_income
0,3280000,603,1977840000
1,3280000,603,1977840000
2,3280000,603,1977840000
3,3280000,603,1977840000
4,3280000,603,1977840000


>#### Challenge
>
>1. Subset `world_data` to include observations from 1995 to 2001.
>   Check that the dimensions of the resulting dataframe is 1253 x 15.
>
>2. Subset the data to include only observation from year 2000 and onwards,
>   from all regions except 'Asia',
>   and retain only the columns `country`, `year`, and `sub_region`.
>   The dimensions of the resulting dataframe should be 2508 x 3.

In [39]:
# Challenge solutions

# 1.
world_data.loc[world_data['year'].between(1995, 2001)].shape

# 2.
world_data.loc[(world_data['year'] >= 2000) &
               (world_data['region'] != 'Asia'),
               ['country', 'year', 'sub_region']].shape

(2489, 3)

## Split-apply-combine techniques in pandas

Many data analysis tasks can be approached using the *split-apply-combine* paradigm:
split the data into groups,
apply some operation on each group,
and combine the results into a single table.

![Image credit Jake VanderPlas](img/split-apply-combine.png)

*Image credit Jake VanderPlas*

pandas facilitates this workflow through the use of `groupby()` to split data,
and summary/aggregation functions such as `mean()`,
which collapses each group into a single-row summary of that group.
When the mean is computed,
the default behavior is to ignore NA values.
The arguments to `groupby()` are column names that reference *categorical* variables
by which the summary statistics should be calculated.

In [55]:
world_data.groupby('region')['population'].sum()

region
Africa       59192998600
Americas     63837885500
Asia        330133218800
Europe       98766930400
Oceania       2422277600
Name: population, dtype: int64

The output is a series that is indexed with the grouped variable (the region)
and the result of the aggregation (the total population) as the values.

These population numbers are abnormally high
because the summary was made for all the years in the dataframe,
instead of for a single year.
To view only the data from this year,
we can use what we learnt previously to filter the dataframe for observations in 2018 only.
Compare these results to the picture in the world ignorance survey
that placed 4 million people in Asia and 1 million in each of the other regions.

In [56]:
world_data_2018 = world_data.loc[world_data['year'] == 2018]
world_data_2018.groupby('region')['population'].sum()

region
Africa      1286388200
Americas    1010688000
Asia        4514211000
Europe       742109000
Oceania       40212000
Name: population, dtype: int64

Individual countries can be selected from the resulting series using `loc[]`.

In [57]:
avg_density = world_data_2018.groupby('region')['population'].sum()
avg_density.loc[['Asia', 'Europe']]

region
Asia      4514211000
Europe     742109000
Name: population, dtype: int64

As a shortcut,
`loc[]` can be omitted when indexing a series.
This is similar to selecting columns from a dataframe with just `[]`.

In [58]:
avg_density[['Asia', 'Europe']]

region
Asia      4514211000
Europe     742109000
Name: population, dtype: int64

This indexing can be used to normalize the population numbers to the region of interest.

In [59]:
region_pop_2018 = world_data_2018.groupby('region')['population'].sum()
region_pop_2018 / region_pop_2018['Europe']

region
Africa      1.733422
Americas    1.361913
Asia        6.082949
Europe      1.000000
Oceania     0.054186
Name: population, dtype: float64

There are six times as many people living in Asia than in Europe.

Groups can also be created from multiple columns,
e.g. it could be interesting to compare how densely populated countries are on average
in different income brackets around the world.

In [60]:
world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()

region    income_group
Africa    High             207.000000
          Low              118.640741
          Lower middle      69.331250
          Upper middle      94.457500
Americas  High             136.426000
          Low              403.000000
          Lower middle     113.950000
          Upper middle      92.931875
Asia      High            1121.654545
          Low              115.866667
          Lower middle     262.606471
          Upper middle     235.447692
Europe    High             176.563214
          Lower middle      99.500000
          Upper middle      67.832222
Oceania   High              10.610000
          Lower middle      52.500000
          Upper middle      90.266667
Name: pop_density, dtype: float64

Note that `income_group` is an ordinal variable,
i.e. a categorical variable with an inherent order to it.
pandas has not listed the values of that variable in the order we would expect
(low, lower-middle, upper-middle, high).
The order of a categorical variable can be specified in the dataframe,
using the top level pandas function `Categorical()`.

In [61]:
# Reassign in the main dataframe since we will use more than just the 2018 data later
world_data['income_group'] = (
    pd.Categorical(world_data['income_group'], ordered=True,
                   categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)

# Need to recreate the 2018 dataframe since the categorical was changed in the main frame
world_data_2018 = world_data.loc[world_data['year'] == 2018]
world_data_2018['income_group'].dtype

CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)

In [62]:
world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()

region    income_group
Africa    Low              118.640741
          Lower middle      69.331250
          Upper middle      94.457500
          High             207.000000
Americas  Low              403.000000
          Lower middle     113.950000
          Upper middle      92.931875
          High             136.426000
Asia      Low              115.866667
          Lower middle     262.606471
          Upper middle     235.447692
          High            1121.654545
Europe    Lower middle      99.500000
          Upper middle      67.832222
          High             176.563214
Oceania   Lower middle      52.500000
          Upper middle      90.266667
          High              10.610000
Name: pop_density, dtype: float64

Now the values appear in the order we would expect.
The value for Asia in the high income bracket looks suspiciously high.
It would be interesting to see which countries were averaged to that value.

In [63]:
world_data_2018.loc[(world_data['region'] == 'Asia') &
                    (world_data['income_group'] == 'High'),
                    ['country', 'pop_density']]

Unnamed: 0,country,pop_density
2627,Bahrain,2060.0
9197,Cyprus,129.0
16862,Israel,391.0
17519,Japan,349.0
18614,Kuwait,236.0
26279,Oman,15.6
28469,Qatar,232.0
29564,Saudi Arabia,15.6
30659,Singapore,8270.0
31973,South Korea,526.0


Extreme values,
such as the city-state Singapore,
can heavily skew averages
and it could be a good idea to use a more robust statistics such as the median instead.

In [64]:
world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()

region    income_group
Africa    Low              66.70
          Lower middle     74.75
          Upper middle     12.81
          High            207.00
Americas  Low             403.00
          Lower middle     68.20
          Upper middle     55.95
          High             37.80
Asia      Low              82.35
          Lower middle     92.00
          Upper middle    106.00
          High            236.00
Europe    Lower middle     99.50
          Upper middle     68.70
          High            109.50
Oceania   Lower middle     22.70
          Upper middle     69.90
          High             10.61
Name: pop_density, dtype: float64

 <!--TODO remove? -->
The returned series has an index that is a combination of the columns `region` and `sub_region`,
and referred to as a `MultiIndex`.
The same syntax as previously can be used to select rows on the species-level.

In [65]:
med_density_2018 = world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()
med_density_2018[['Africa', 'Americas']]

region    income_group
Africa    Low              66.70
          Lower middle     74.75
          Upper middle     12.81
          High            207.00
Americas  Low             403.00
          Lower middle     68.20
          Upper middle     55.95
          High             37.80
Name: pop_density, dtype: float64

To select specific values from both levels of the `MultiIndex`,
a list of tuples can be passed to `loc[]`.

In [66]:
med_density_2018.loc[[('Africa', 'High'), ('Americas', 'High')]]

region    income_group
Africa    High            207.0
Americas  High             37.8
Name: pop_density, dtype: float64

To select only the low income values from all region,
the `xs()` (cross section) method can be used.

In [67]:
med_density_2018.xs('Low', level='income_group')

region
Africa       66.70
Americas    403.00
Asia         82.35
Name: pop_density, dtype: float64

The names and values of the index levels can be seen by inspecting the index object.

In [68]:
med_density_2018.index

MultiIndex(levels=[['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], ['Low', 'Lower middle', 'Upper middle', 'High']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 1, 2, 3, 1, 2, 3]],
           names=['region', 'income_group'])

Although `MultiIndexes` offer succinct and fast ways to access data,
they also requires memorization of additional syntax
and are strictly speaking not essential unless speed is of particular concern.
It can therefore be easier to reset the index,
so that all values are stored in columns.

In [69]:
med_density_2018_res = med_density_2018.reset_index()
med_density_2018_res

Unnamed: 0,region,income_group,pop_density
0,Africa,Low,66.7
1,Africa,Lower middle,74.75
2,Africa,Upper middle,12.81
3,Africa,High,207.0
4,Americas,Low,403.0
5,Americas,Lower middle,68.2
6,Americas,Upper middle,55.95
7,Americas,High,37.8
8,Asia,Low,82.35
9,Asia,Lower middle,92.0


After resetting the index,
the same comparison syntax introduced earlier can be used instead of `xs()` or passing lists of tuples to `loc[]`.

In [70]:
med_density_2018_asia = med_density_2018_res.loc[med_density_2018_res['income_group'] == 'Low']
med_density_2018_asia

Unnamed: 0,region,income_group,pop_density
0,Africa,Low,66.7
4,Americas,Low,403.0
8,Asia,Low,82.35


`reset_index()` grants the freedom of not having to work with indexes,
but it is still worth keeping in mind that selecting on an index level with `xs()`
can be orders of magnitude faster than using boolean comparisons on large dataframes.

 <!--TODO remove? -->
The opposite operation of creating an index from an existing columns
can be performed with `set_index()` on any column (or combination of columns) that creates an index with unique values.

In [71]:
med_density_2018_asia.set_index(['region', 'income_group'])

Unnamed: 0_level_0,Unnamed: 1_level_0,pop_density
region,income_group,Unnamed: 2_level_1
Africa,Low,66.7
Americas,Low,403.0
Asia,Low,82.35


> Challenge
>
> 1. Which is the highest population density in each region?
>
> 2. The low income group for the Americas had the same population density for both the mean and the median.
>    This could mean that there are few observations in this group.
>    List all the low income countries in the Americas.

In [72]:
# Challenge solutions

# 1.
world_data_2018.groupby('region')['pop_density'].max()

region
Africa       625.0
Americas     666.0
Asia        8270.0
Europe      1350.0
Oceania      151.0
Name: pop_density, dtype: float64

In [73]:
# This will be a challenge

# 2.
world_data_2018.loc[(world_data['region'] == 'Americas') & (world_data['income_group'] == 'Low'), ['country', 'pop_density']]

Unnamed: 0,country,pop_density
14891,Haiti,403.0


### Multiple aggregations on grouped data

Since the same grouped dataframe will be used in multiple code chunks below,
we can assigned it to a new variable
instead of typing out the grouping expression each time.

In [74]:
grouped_world_data = world_data_2018.groupby(['region', 'sub_region'])
grouped_world_data['life_expectancy'].mean()

region    sub_region                     
Africa    Northern Africa                    74.716667
          Sub-Saharan Africa                 63.682609
Americas  Latin America and the Caribbean    75.600000
          Northern America                   80.650000
Asia      Central Asia                       71.340000
          Eastern Asia                       76.440000
          South-eastern Asia                 73.630000
          Southern Asia                      72.211111
          Western Asia                       76.122222
Europe    Eastern Europe                     75.110000
          Northern Europe                    80.140000
          Southern Europe                    79.466667
          Western Europe                     82.100000
Oceania   Australia and New Zealand          82.350000
          Melanesia                          63.700000
          Micronesia                         62.200000
          Polynesia                          71.550000
Name: life_expectancy, 

Instead of using the `mean()` or `sum()` methods directly,
the more general `agg()` method could be called
to aggregate by any existing aggregation functions.
The equivalent to the `mean()` method would be to call `agg()` and specify `'mean'`.

In [75]:
grouped_world_data['life_expectancy'].agg('mean')

region    sub_region                     
Africa    Northern Africa                    74.716667
          Sub-Saharan Africa                 63.682609
Americas  Latin America and the Caribbean    75.600000
          Northern America                   80.650000
Asia      Central Asia                       71.340000
          Eastern Asia                       76.440000
          South-eastern Asia                 73.630000
          Southern Asia                      72.211111
          Western Asia                       76.122222
Europe    Eastern Europe                     75.110000
          Northern Europe                    80.140000
          Southern Europe                    79.466667
          Western Europe                     82.100000
Oceania   Australia and New Zealand          82.350000
          Melanesia                          63.700000
          Micronesia                         62.200000
          Polynesia                          71.550000
Name: life_expectancy, 

This general approach is more flexible and powerful,
since multiple aggregation functions can be applied in the same line of code
by passing them as a list to `agg()`.
For instance,
the standard deviation and mean could be computed in the same call:

In [76]:
grouped_world_data['life_expectancy'].agg(['mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std
region,sub_region,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,Northern Africa,74.716667,3.510793
Africa,Sub-Saharan Africa,63.682609,4.540108
Americas,Latin America and the Caribbean,75.6,3.721559
Americas,Northern America,80.65,2.192031
Asia,Central Asia,71.34,0.808084
Asia,Eastern Asia,76.44,6.56643
Asia,South-eastern Asia,73.63,4.835298
Asia,Southern Asia,72.211111,6.426983
Asia,Western Asia,76.122222,4.585214
Europe,Eastern Europe,75.11,2.711478


The returned output is in this case a dataframe
and the column `MultiIndex` is indicated in bold font.

By passing a dictionary to `.agg()`
it is possible to apply different aggregations to the different columns.
Long code statements can be broken down into multiple lines
if they are enclosed by parentheses, brackets, or braces,
something that will be described in detail later.

In [77]:
grouped_world_data[['population', 'income']].agg(
    {'population': 'sum',
     'income': ['min', 'median', 'max']
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,population,income,income,income
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,min,median,max
region,sub_region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Africa,Northern Africa,237270000,4440,11200,18300
Africa,Sub-Saharan Africa,1049118200,629,1985,27500
Americas,Latin America and the Caribbean,646688000,1710,13700,30300
Americas,Northern America,364000000,43800,49350,54900
Asia,Central Asia,71890000,2920,6690,24200
Asia,Eastern Asia,1626920000,1390,16000,39100
Asia,South-eastern Asia,655870000,1490,7255,83900
Asia,Southern Asia,1887261000,1870,6890,17400
Asia,Western Asia,272270000,2430,20750,121000
Europe,Eastern Europe,291970000,5330,24100,32300


There are plenty of aggregation methods available in pandas
(e.g. `sem`, `mad`, `sum`),
most of which can be seen at [the end of this section] in the `pandas` documentation,
or explored using tab-complete on the grouped dataframe.

[the end of this section]: https://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation

In [None]:
# This is a side note if there are issues with tab completion
# Tab completion might only work like this:
# find_agg_methods = grouped_world_data['weight']
# find_agg_methods.<tab>

Even if a function is not part of the `pandas` library,
it can be passed to `agg()`.

In [79]:
import numpy as np

grouped_world_data['pop_density'].agg(np.mean)

region    sub_region                     
Africa    Northern Africa                     50.113333
          Sub-Saharan Africa                 108.143043
Americas  Latin America and the Caribbean    126.558966
          Northern America                    19.880000
Asia      Central Asia                        38.504000
          Eastern Asia                       248.202000
          South-eastern Asia                 961.110000
          Southern Asia                      460.388889
          Western Asia                       298.355556
Europe    Eastern Europe                      88.629000
          Northern Europe                     64.897000
          Southern Europe                    202.166667
          Western Europe                     256.000000
Oceania   Australia and New Zealand           10.610000
          Melanesia                           28.475000
          Micronesia                         146.000000
          Polynesia                          110.450000
Name: 

Any function can be passed like this,
including functions you create yourself.

> #### Challenge
>
> 1. What's the mean life expectancy for each income group in 2018?
>
> 2. What's the min, median, and max life expectancies
>    for each income group within each region?

In [80]:
# Challenge solutions

# 1.
world_data_2018.groupby('income_group')['life_expectancy'].mean()

income_group
Low             63.744118
Lower middle    69.053488
Upper middle    74.283673
High            79.919231
Name: life_expectancy, dtype: float64

In [81]:
# 2.
world_data_2018.groupby(['region', 'income_group'])['life_expectancy'].agg(['min', 'median', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,median,max
region,income_group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,Low,51.6,62.5,68.3
Africa,Lower middle,51.1,66.35,78.0
Africa,Upper middle,63.5,67.1,77.9
Africa,High,74.2,74.2,74.2
Americas,Low,64.5,64.5,64.5
Americas,Lower middle,73.1,74.9,78.7
Americas,Upper middle,68.2,75.8,81.4
Americas,High,73.4,77.6,82.2
Asia,Low,58.7,70.45,72.2
Asia,Lower middle,67.9,71.5,77.8


## Additional sections (time permitting)

### Using `size()` to summarize categorical data 

When working with data,
we commonly want to know the number of observations present for each categorical variable.
For this,
pandas provides the `size()` method.
For example,
to find the number of observations per region
(in this case unique countries during year 2018):

In [82]:
world_data_2018.groupby('region').size()

region
Africa      52
Americas    31
Asia        47
Europe      39
Oceania      9
dtype: int64

`size()` can also be used when grouping on multiple variables.

In [83]:
world_data_2018.groupby(['region', 'income_group']).size()

region    income_group
Africa    Low             27
          Lower middle    16
          Upper middle     8
          High             1
Americas  Low              1
          Lower middle     4
          Upper middle    16
          High            10
Asia      Low              6
          Lower middle    17
          Upper middle    13
          High            11
Europe    Lower middle     2
          Upper middle     9
          High            28
Oceania   Lower middle     4
          Upper middle     3
          High             2
dtype: int64

If there are many groups,
`size()` is not that useful on its own.
For example,
it is difficult to quickly find the five most abundant species among the observations.

In [84]:
world_data_2018.groupby('sub_region').size()

sub_region
Australia and New Zealand           2
Central Asia                        5
Eastern Asia                        5
Eastern Europe                     10
Latin America and the Caribbean    29
Melanesia                           4
Micronesia                          1
Northern Africa                     6
Northern America                    2
Northern Europe                    10
Polynesia                           2
South-eastern Asia                 10
Southern Asia                       9
Southern Europe                    12
Sub-Saharan Africa                 46
Western Asia                       18
Western Europe                      7
dtype: int64

Since there are many rows in this output,
it would be beneficial to sort the table values and display the most abundant species first.
This is easy to do with the `sort_values()` method.

In [85]:
world_data_2018.groupby('sub_region').size().sort_values()

sub_region
Micronesia                          1
Australia and New Zealand           2
Polynesia                           2
Northern America                    2
Melanesia                           4
Eastern Asia                        5
Central Asia                        5
Northern Africa                     6
Western Europe                      7
Southern Asia                       9
Northern Europe                    10
South-eastern Asia                 10
Eastern Europe                     10
Southern Europe                    12
Western Asia                       18
Latin America and the Caribbean    29
Sub-Saharan Africa                 46
dtype: int64

That's better,
but it could be helpful to display the most abundant species on top.
In other words,
the output should be arranged in descending order.

In [86]:
world_data_2018.groupby('sub_region').size().sort_values(ascending=False).head(5)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

Looks good!

### Method chaining

By now,
the code statement has grown quite long because many methods have been *chained* together.
It can be tricky to keep track of what is going on in long method chains.
To make the code more readable,
it can be broken up multiple lines by adding a surrounding parenthesis.

In [87]:
(world_data_2018
     .groupby('sub_region')
     .size()
     .sort_values(ascending=False)
     .head(5)
)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

This looks neater and makes long method chains easier to reads.
There is no absolute rule for when to break code into multiple line,
but always try to write code that is easy for collaborators to understand.
Remember that your most common collaborator is a future version of yourself!

pandas has a convenience function for returning the top five results,
so the values don't need to be sorted explicitly.

In [88]:
(world_data_2018
     .groupby(['sub_region'])
     .size()
     .nlargest()  # the default is 5
)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

To include more attributes about these countries,
add those columns to `groupby()`.

In [89]:
(world_data_2018
     .groupby(['region', 'sub_region'])
     .size()
     .nlargest()  # the default is 5
)

region    sub_region                     
Africa    Sub-Saharan Africa                 46
Americas  Latin America and the Caribbean    29
Asia      Western Asia                       18
Europe    Southern Europe                    12
Asia      South-eastern Asia                 10
dtype: int64

In [90]:
world_data.head()

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women,population_income
0,Afghanistan,1800,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
1,Afghanistan,1801,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
2,Afghanistan,1802,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000


>#### Challenge
>
> 1. How many countries are there in each income group worldwide?
> 2. Assign the variable name `world_data_2015` to a dataframe containing only the values from year 2015
>    (e.g. the same way as `world_data_2018` was created)
> 3.
>    a. For those countries where women went to school longer than men,
>       how many are there in each income group.
>    b. Do the same as above but for countries where men went to school longer than women.
>       What does this distribution tell you?

In [91]:
# Challenge solutions
# 1.
world_data_2018.groupby('income_group').size()

income_group
Low             34
Lower middle    43
Upper middle    49
High            52
dtype: int64

In [92]:
# 2
world_data_2015 = world_data.loc[world_data['year'] == 2015]

In [93]:
# 3a
world_data_2015.loc[world_data_2015['years_in_school_men'] < world_data_2015['years_in_school_women']].groupby('income_group').size()

income_group
Low              0
Lower middle    14
Upper middle    33
High            47
dtype: int64

In [94]:
# 3b
world_data_2015.loc[world_data_2015['years_in_school_men'] > world_data_2015['years_in_school_women']].groupby('income_group').size()

income_group
Low             34
Lower middle    29
Upper middle    11
High             5
dtype: int64

### Data cleaning tips

`dropna()` removes both explicit `NaN` values
and value that pandas assumed to be `NaN`,
such as the non-numeric values in the life_expectancy column.


In [None]:
world_data_2018.dropna()

Instead of dropping observations that has `NaN` values in a any column,
a subset of columns can be considered.

In [None]:
world_data_2018.dropna(subset='life_expectancy'])

Non-numeric values can also be coerced into explicit `NaN` values
via the `to_numeric()` top level function.

In [None]:
world_data_2018['life_expectancy'], errors='coerce')