# Extracting subsets of data frames

In this notebook, we will learn how to manipulate pandas DataFrame objects, starting with extracting subsets.

In [1]:
# import pandas
import pandas as pd
# load the gapminder dataset
gapminder = pd.read_csv('https://raw.githubusercontent.com/UofUDELPHI/2024-02-08-python/main/content/complete/data/gapminder.csv')
# take a look at the head of gapminder
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Extracting multiple columns

Suppose that you want to extract multiple columns at once from your DataFrame object. You might imagine that you can do this by providing two column names inside the square parentheses that follow the object name, as follows:

In [None]:
# try to extract two columns: country and gdpPercap from gapminder using the `df[]` notation
gapminder['country', 'gdpPercap']

KeyError: ('country', 'gdpPercap')

However, this approach clearly results in an error!

The problem is that the `df[]` syntax expects only one value (or object) inside the square parentheses. Fortunately, you can provide multiple column names as a single **list** object.

The code below creates a *list* containing the `'country'` and `'gdpPercap'` string values:

In [2]:
# create a list containing the names of the columns we want to extract: 'country' and 'gdpPercap' 
['country', 'gdpPercap']

['country', 'gdpPercap']

You can extract both the country and gdpPercap columns by providing this *list* in the indexing square parentheses, which ends up corresponding to two sets of square parentheses:

In [3]:
# provide the list of names inside the `df[]` notation to extract the two columns from gapminder
gapminder[['country', 'gdpPercap']]

Unnamed: 0,country,gdpPercap
0,Afghanistan,779.445314
1,Afghanistan,820.853030
2,Afghanistan,853.100710
3,Afghanistan,836.197138
4,Afghanistan,739.981106
...,...,...
1699,Zimbabwe,706.157306
1700,Zimbabwe,693.420786
1701,Zimbabwe,792.449960
1702,Zimbabwe,672.038623


Above, the outer-most square parentheses are the indexing syntax, and the inner-most square parentheses are creating a list that contains the two column name string values.

### The `.loc` indexer

An alternative (and ultimately more flexible) approach to subsetting a Pandas DataFrame is to use the `.loc` indexer. The syntax is similar to the square bracket syntax, except now the square brackets expect *two* values: one for the *row* index and one for the *column* index. If you want to extract a subset of *rows* you need to use this `df.loc[,]` syntax, because the original `df[]` syntax can only be used to subset coumns.

The general syntax is `df.loc[rows, cols]`. For example, the code below extracts the entry in the row with index `3` (the fourth row) and the `gdpPercap` column:

In [4]:
# Use `df.loc[,]` to extract the entry with row index 3 from the 'gdpPercap' column
gapminder.loc[3, 'gdpPercap']

836.1971382

### Using `:` with `.loc` to select all rows/columns

If you want to extract all rows (or columns), you can replace the corresponding index entry with `:`. So the following code will extract all rows for the `gdpPercap` column:

In [5]:
# Use `df.loc[,]` to extract all rows from the 'gdpPercap' column
gapminder.loc[:, 'gdpPercap']
# what are two other ways that you could do this same thing?
# gapminder.gdpPercap
# gapminder['gdpPercap']

0       779.445314
1       820.853030
2       853.100710
3       836.197138
4       739.981106
           ...    
1699    706.157306
1700    693.420786
1701    792.449960
1702    672.038623
1703    469.709298
Name: gdpPercap, Length: 1704, dtype: float64

If you want to extract multiple columns (or rows), you still need to provide all of the index values that you want to extract in a list (so that you are still only providing two entries inside the `.loc[]` parentheses). So the following code will extract all rows (`:`) for the `country` and `gdpPercap` columns:

In [6]:
# use `df.loc[,]` to extract all rows for the 'country' and 'gdpPercap' columns
gapminder.loc[:, ['country', 'gdpPercap']]
# gapminder[['country', 'gdpPercap']]

Unnamed: 0,country,gdpPercap
0,Afghanistan,779.445314
1,Afghanistan,820.853030
2,Afghanistan,853.100710
3,Afghanistan,836.197138
4,Afghanistan,739.981106
...,...,...
1699,Zimbabwe,706.157306
1700,Zimbabwe,693.420786
1701,Zimbabwe,792.449960
1702,Zimbabwe,672.038623


Similarly for the rows. If you wanted to extract the fourth through the eighth rows and the `country` and `gdpPercap` columns, you need to provide two lists inside the `df.loc[]` square parentheses, the first list for the rows will be `[4, 5, 6, 7, 8]` and the second list for the columns will be `['country', 'gdpPercap']`:

In [7]:
# extract the rows with index 4, 5, 6, 7, and 8 for the country and gdpPercap columns
gapminder.loc[[4, 5, 6, 7, 8],['country', 'gdpPercap']]

Unnamed: 0,country,gdpPercap
4,Afghanistan,739.981106
5,Afghanistan,786.11336
6,Afghanistan,978.011439
7,Afghanistan,852.395945
8,Afghanistan,649.341395


If your index corresponds to a sequence of integers, you can instead provide a "range" object:

In [8]:
# use the `range()` function to simplify the code in the previous cell
gapminder.loc[range(4, 9, 1),['country', 'gdpPercap']]


Unnamed: 0,country,gdpPercap
4,Afghanistan,739.981106
5,Afghanistan,786.11336
6,Afghanistan,978.011439
7,Afghanistan,852.395945
8,Afghanistan,649.341395


### Using `.loc` with non-numeric indexes

Note that the fact that we can index the rows using `.loc` with integers is solely a result of the fact that the row index corresponds to integers. If, instead the row index corresponded to the `country` values, such as in `gapminder_country`, we would not be able to use integers to subset the rows, and we would instead need to use the country names. 

Let's create `gapminder_country`, whose row index corresponds to the country variable:

In [9]:
# define gapminder_country as a new dataframe with the country column as the row index
gapminder_country = gapminder.set_index('country')
# look at gapminder_country
gapminder_country

Unnamed: 0_level_0,continent,year,lifeExp,pop,gdpPercap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,Asia,1952,28.801,8425333,779.445314
Afghanistan,Asia,1957,30.332,9240934,820.853030
Afghanistan,Asia,1962,31.997,10267083,853.100710
Afghanistan,Asia,1967,34.020,11537966,836.197138
Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...
Zimbabwe,Africa,1987,62.351,9216418,706.157306
Zimbabwe,Africa,1992,60.377,10704340,693.420786
Zimbabwe,Africa,1997,46.809,11404948,792.449960
Zimbabwe,Africa,2002,39.989,11926563,672.038623


Then to extract the rows for Germany (i.e., where the row index value is equal to `'Germany'`), you need to provide `'Germany'` in the row index `.loc` position. The following code thus extracts the `gdpPercap` column for all rows corresponding to `'Germany'` using the version of `gapminder_country` where the row index corresponds to country.

In [10]:
# use the `df.loc[,]` notation to extract the rows for Germany for the gdpPercap column
gapminder_country.loc['Germany','gdpPercap']

country
Germany     7144.114393
Germany    10187.826650
Germany    12902.462910
Germany    14745.625610
Germany    18016.180270
Germany    20512.921230
Germany    22031.532740
Germany    24639.185660
Germany    26505.303170
Germany    27788.884160
Germany    30035.801980
Germany    32170.374420
Name: gdpPercap, dtype: float64

### Exercise

1. Extract the population and year columns for Australia using `gapminder_country`.

In [11]:
gapminder_country.loc['Australia',['pop', 'year']]

Unnamed: 0_level_0,pop,year
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,8691212,1952
Australia,9712569,1957
Australia,10794968,1962
Australia,11872264,1967
Australia,13177000,1972
Australia,14074100,1977
Australia,15184200,1982
Australia,16257249,1987
Australia,17481977,1992
Australia,18565243,1997


2. Extract the 'country' and 'lifeExp' columns for the first, second, and third rows of `gapminder_country`.

In [12]:
gapminder.loc[[0, 1, 2], ['country', 'lifeExp']]

Unnamed: 0,country,lifeExp
0,Afghanistan,28.801
1,Afghanistan,30.332
2,Afghanistan,31.997
