# Subsetting `pd.DataFrame`
2023-10-09

there are *many* ways to subset a data frame.

We'll review some core methods to do this


## Our data
We will use simplified data from the National Snow and Ice Data Center. Column descriptions:


- **year**: calendar year
- **europe - antarctica**: change in glacial volume (km3) in each region that year
- **global_glacial_volume_change**: cumulative global glacial volume change (km3), starting in 1961
- **annual_sea_level_rise**: annual rise in sea level (mm)
- **cumulative_sea_level_rise**: cumulative rise in sea level (mm) since 1961

Let's read in the data!


In [1]:
import pandas as pd
import numpy as np

In [4]:
# read in file

df = pd.read_csv('glacial_loss.csv')

# see the first 5 rows
df.head()

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
0,1961,-5.128903,-108.382987,-18.72119,-32.350759,-14.359007,-4.739367,-35.116389,-220.823515,0.61001,0.61001
1,1962,5.576282,-173.25245,-24.32479,-4.67544,-2.161842,-13.694367,-78.222887,-514.269862,0.810625,1.420635
2,1963,-10.123105,-0.423751,-2.047567,-3.027298,-27.535881,3.419633,3.765109,-550.57564,0.100292,1.520927
3,1964,-4.508358,20.070148,0.4778,-18.675385,-2.248286,20.732633,14.853096,-519.589859,-0.085596,1.435331
4,1965,10.629385,43.695389,-0.115332,-18.414602,-19.398765,6.862102,22.793484,-473.112003,-0.128392,1.306939


In [9]:
# get number of rows and columns
# 43 rows, 11 columns
df.shape

(43, 11)

In [5]:
# get column names

df.columns

Index(['year', 'europe', 'arctic', 'alaska', 'asia', 'north_america',
       'south_america', 'antarctica', 'global_glacial_volume_change',
       'annual_sea_level_rise', 'cumulative_sea_level_rise'],
      dtype='object')

In [10]:
# unfo about non-null values and dtype of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   year                          43 non-null     int64  
 1   europe                        43 non-null     float64
 2   arctic                        43 non-null     float64
 3   alaska                        43 non-null     float64
 4   asia                          43 non-null     float64
 5   north_america                 43 non-null     float64
 6   south_america                 43 non-null     float64
 7   antarctica                    43 non-null     float64
 8   global_glacial_volume_change  43 non-null     float64
 9   annual_sea_level_rise         43 non-null     float64
 10  cumulative_sea_level_rise     43 non-null     float64
dtypes: float64(10), int64(1)
memory usage: 3.8 KB


In [7]:
# check the data types of each column

df.dtypes

year                              int64
europe                          float64
arctic                          float64
alaska                          float64
asia                            float64
north_america                   float64
south_america                   float64
antarctica                      float64
global_glacial_volume_change    float64
annual_sea_level_rise           float64
cumulative_sea_level_rise       float64
dtype: object

# Selecting a single column...

## ...by column name

This is the simplest case for selecting data. Suppose we are interested in the annual sea level rise. Then we can access that single column in this way:

```
df['column_name']
```

Note: this is the same suntax for a dictionary. Remember we can think of a `pd.DataFrame` as a dictionary where the keys are the column names. 


Example: sea level rise

In [14]:
# select a single column by using square brackets

annual_rise = df['annual_sea_level_rise']

# check the type of the output
print(type(annual_rise))

annual_rise.head()

<class 'pandas.core.series.Series'>


0    0.610010
1    0.810625
2    0.100292
3   -0.085596
4   -0.128392
Name: annual_sea_level_rise, dtype: float64

Since we only selected a single column the output is a `pandas.Series`.



`df['column_name']` is an example of selecting **by label**


**selecting by label**, which means we want to select data from our data frame using the *names* of the columns, not their *position*.

## ... with attribute syntax

`df.column_name`


In [16]:
annual_rise_2 = df.annual_sea_level_rise
annual_rise_2.head()

0    0.610010
1    0.810625
2    0.100292
3   -0.085596
4   -0.128392
Name: annual_sea_level_rise, dtype: float64

In [18]:
#df.annual_sea_level_rise

# Selecting multiple columns...


## ... using a list of column names


This is another example of selecting by labels. We just need to pass a list with the column names to the square brackets []. 



```
df[['col_1', 'col_2', 'col_3']]
```

The list of column names `['col_1', 'col_2', etc.]` goes inside the selection brackets `[]`

Notice each column name is a string. 



For example, say we want to look at the change in glacial volume in Europe and Asia, then we can select those columns like this:


In [21]:
# select columns with names "europe" and "asia"

europe_asia = df[['europe', 'asia']]

# double brackets because we're passing the list ['europe', 'asia'] 
# to the selection brackets []

In [24]:
# check the type of the resulting selection
print(type(europe_asia))

# check the shape of the selection
print((europe_asia.shape))
# 43 rows, 2 columns

<class 'pandas.core.frame.DataFrame'>
(43, 2)


## ... using a slice

Yet another example of selecting by label! In this case we will use the `loc` selection. 

`loc` is short for "locate" :)

The general syntax is:

```
df.loc[ row-selection , column-selection]
```



where `row-selection` and `column-selection` are the rows and columns we want to subset from the data frame.

format for indexing:

```
df.loc[:(if we want all rows) , 'start_col_name':'stop_col_name(inclusive)']

```

Let’s start by a simple example, where we want to select a slice of columns, say the change in glacial volume per year in all regions. This corresponds to all columns between `europe` and `antarctica`. (arctic to antarctica in the textbook)


In [30]:
# select all columns between 'europe' and 'antarctica'

all_regions = df.loc[: , 'europe':'antarctica']



print(all_regions.shape)

print(type(all_regions))

all_regions.head()

(43, 6)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,arctic,alaska,asia,north_america,south_america,antarctica
0,-108.382987,-18.72119,-32.350759,-14.359007,-4.739367,-35.116389
1,-173.25245,-24.32479,-4.67544,-2.161842,-13.694367,-78.222887
2,-0.423751,-2.047567,-3.027298,-27.535881,3.419633,3.765109
3,20.070148,0.4778,-18.675385,-2.248286,20.732633,14.853096
4,43.695389,-0.115332,-18.414602,-19.398765,6.862102,22.793484


Notice two things:

we used the colon : as the `row-selection` parameter, which means “select all the rows”
the slice of the data frame we got includes both endpoints of the slice `europe`:`antarctica`. In other words we get the `europe` column and the `antarctica` column. This is different from how slicing works in base Python and NumPy, where the end point is not included.


- `row-selection` = ':' colon = means select all rows
- `column-selection`  = slice 'europe:antarctica'
- we get both ends of the slice in the selected dataframe. This is different from slicing in base Python and NumPy, where the endpoint is not included. 

### Check-in:


select the following from `df`:

- year data
- data from Arctic, Alaska, Asia, North America
- global change in glacial volume and cumulative sea rise

In [39]:
# year data
year_data = df.year
year_data.head()

0    1961
1    1962
2    1963
3    1964
4    1965
Name: year, dtype: int64

In [35]:
# data from Arctic, Alaska, Asia, North America
selected_regions = df.loc[: , 'arctic':'north_america']
selected_regions.head()

Unnamed: 0,arctic,alaska,asia,north_america
0,-108.382987,-18.72119,-32.350759,-14.359007
1,-173.25245,-24.32479,-4.67544,-2.161842
2,-0.423751,-2.047567,-3.027298,-27.535881
3,20.070148,0.4778,-18.675385,-2.248286
4,43.695389,-0.115332,-18.414602,-19.398765


In [38]:
#  global change in glacial volume and cumulative sea rise

glacial_cumulative = df[['global_glacial_volume_change', 'cumulative_sea_level_rise']]
glacial_cumulative.head()

Unnamed: 0,global_glacial_volume_change,cumulative_sea_level_rise
0,-220.823515,0.61001
1,-514.269862,1.420635
2,-550.57564,1.520927
3,-519.589859,1.435331
4,-473.112003,1.306939


# Selecting rows ...


## ... using a condition

Selecting which rows satisfy a particular condition is, in my experience, the most usual kind of row subsetting. 

The general syntax for this type of selection is `df[condition_on_rows]`. 


For example, suppose we are intersted in all data after 1996. We can select those rows in this way:

In [53]:
# select all rows with year > 1996
after_96 = df[ df['year']>1996 ]

print(after_96.shape)
print(type(after_96))

after_96

(7, 11)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
36,1997,-13.724106,-24.832246,-167.229145,-34.406403,-27.680661,-38.213286,-20.17909,-4600.686013,0.909625,12.709077
37,1998,-13.083338,-110.429302,-107.879027,-58.115702,30.169987,-3.797978,-48.129928,-4914.831966,0.867807,13.576884
38,1999,-8.039555,-64.644068,-87.714653,-26.211723,5.888512,-8.03863,-40.653001,-5146.368231,0.639603,14.216487
39,2000,-17.00859,-96.494055,-44.445,-37.518173,-29.191986,-2.767698,-58.87383,-5435.317175,0.798202,15.014688
40,2001,-8.419109,-145.415483,-55.749505,-35.977022,-0.926134,7.553503,-86.774675,-5764.039931,0.908074,15.922762
41,2002,-3.392361,-48.718943,-87.12,-36.127226,-27.853498,-13.484593,-30.20396,-6013.2255,0.688358,16.61112
42,2003,-3.392361,-48.718943,-67.253634,-36.021991,-75.066475,-13.22343,-30.20396,-6289.640976,0.763579,17.374699


Let’s break down what is happening here. 

In this case the condition for our rows is `df['year']>1996`, this checks which rows have a value greater than 1996 in the year column. 

- in the `df[condition_on_rows]` syntax, we have that `condition_on_rows` = `df['year'] > 1996`
- `df['year'] > 1996` checks if each row meets the condition 

Let’s see this explicitly:

In [45]:
# check the type of df['year']>1996
print(type(df['year']>1996))

df['year']>1996

<class 'pandas.core.series.Series'>


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36     True
37     True
38     True
39     True
40     True
41     True
42     True
Name: year, dtype: bool

in-class notes:
- `df['year'] > 1996` is a `pd.Series` with boolean values (`True` or `False`) 
- `pd.Series` with boolean values are often called **masks**
- When we pass such a series of boolean values to the selection brackets `[]` we keep only those rows with a `True` value in the pd.Series/mask


Example: select data from years 1970 to 1979 (inclusive)

from textbook:

The output is a `pandas.Series` with boolean values (`True` or `False`) indicating which rows satisfy the condition year>1996. When we pass such a series of boolean values to the selection brackets `[]` we keep only those rows with a `True` value.


Here’s another example of using a condition. Suppose we want to look at data from years 1970 to 1979. One way of doing this is to use the `in` operator in our condition:

In [60]:
range(1970,1979) # lol 

range(1970, 1979)

In [62]:
df.year.isin(range(1970,1980))

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
Name: year, dtype: bool

In [61]:
seventies = df[ df.year.isin(range(1970,1980))]
seventies

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
9,1970,-6.452316,-24.494667,-0.125296,-36.120199,11.61979,11.636911,4.400377,-999.018177,0.110225,2.759719
10,1971,0.414711,-42.904189,28.103328,-8.702938,-9.964542,1.061299,-6.735536,-1038.104459,0.107973,2.867692
11,1972,-5.144729,-27.004031,-22.14335,-40.883357,32.36373,-14.968034,-6.223849,-1122.885506,0.234202,3.101894
12,1973,4.08109,9.839444,22.985188,-31.432594,-20.883232,2.103649,10.539823,-1125.677743,0.007713,3.109607
13,1974,1.545615,-40.126998,-29.517874,-43.861622,-23.991402,-21.338825,4.419343,-1279.964287,0.426206,3.535813
14,1975,7.431192,-32.410467,-44.094084,-43.357442,-30.85881,-2.368842,-7.775315,-1434.818037,0.427773,3.963586
15,1976,3.986753,21.686639,-28.234725,-67.292125,-12.534421,-19.465358,19.250607,-1518.185129,0.230296,4.193882
16,1977,4.89141,-33.12301,-5.662139,-62.165684,-15.905332,2.65495,-23.727249,-1652.4534,0.370907,4.564788
17,1978,8.404591,-77.561015,-12.503384,-22.85804,-31.097609,7.127708,-9.140167,-1791.355022,0.383706,4.948495
18,1979,3.916703,-88.351684,-63.938851,-49.242043,-12.076624,-17.718503,-9.578557,-2030.537848,0.660726,5.609221


from book:

- `df['year']` is the column with the year values, a `pandas.Series`,

- in `df['year'].isin()`, we have that isin is a method for the `pandas.Series` and we are calling it using the dot ..  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html
    

- `range(1970,1980)` constructs consecutive integers from 1970 to 1979 - remember the right endopoint (1980) is not included!

- `df['year'].isin(range(1970,1980))` is then a `pandas.Series` of boolean values indicating which rows have year equal to 1970, …, 1979.

- when we put `df['year'].isin(range(1970,1980))` inside the selection brackets `[]` we obtain the rows of the data frame with year equal to 1970, …, 1979.

## ... using multiple conditions


We can combine multipe conditions by surrounding each one in parenthesis `()` and using the or operator` |` and the and operator `&`.

Syntax:

```
# select rows of df that satisfy condition1 OR condition2
df[ (condition1) | (condition2) ]
```

and

```
# select rows of df that satisfy condition1 OR condition2
df[ (condition1) & (condition2) ]
```


Example: OR

In [68]:
# select rows with 
# annual_sea_level_rise<0mm OR annual_sea_level_rise>0.8 mm

df[ (df['annual_sea_level_rise']<0) | (df['annual_sea_level_rise']>0.8)]

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
1,1962,5.576282,-173.25245,-24.32479,-4.67544,-2.161842,-13.694367,-78.222887,-514.269862,0.810625,1.420635
3,1964,-4.508358,20.070148,0.4778,-18.675385,-2.248286,20.732633,14.853096,-519.589859,-0.085596,1.435331
4,1965,10.629385,43.695389,-0.115332,-18.414602,-19.398765,6.862102,22.793484,-473.112003,-0.128392,1.306939
31,1992,16.175828,16.115404,-1.112326,9.582169,1.380067,-16.355425,11.861513,-3429.634639,-0.10496,9.474129
36,1997,-13.724106,-24.832246,-167.229145,-34.406403,-27.680661,-38.213286,-20.17909,-4600.686013,0.909625,12.709077
37,1998,-13.083338,-110.429302,-107.879027,-58.115702,30.169987,-3.797978,-48.129928,-4914.831966,0.867807,13.576884
40,2001,-8.419109,-145.415483,-55.749505,-35.977022,-0.926134,7.553503,-86.774675,-5764.039931,0.908074,15.922762


### Check-in 2:

use two conditions and the & operator to select data from years 1970 to 1979 (inclusive)

In [72]:
df[ (df.year > 1969) & (df.year < 1980) ]

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
9,1970,-6.452316,-24.494667,-0.125296,-36.120199,11.61979,11.636911,4.400377,-999.018177,0.110225,2.759719
10,1971,0.414711,-42.904189,28.103328,-8.702938,-9.964542,1.061299,-6.735536,-1038.104459,0.107973,2.867692
11,1972,-5.144729,-27.004031,-22.14335,-40.883357,32.36373,-14.968034,-6.223849,-1122.885506,0.234202,3.101894
12,1973,4.08109,9.839444,22.985188,-31.432594,-20.883232,2.103649,10.539823,-1125.677743,0.007713,3.109607
13,1974,1.545615,-40.126998,-29.517874,-43.861622,-23.991402,-21.338825,4.419343,-1279.964287,0.426206,3.535813
14,1975,7.431192,-32.410467,-44.094084,-43.357442,-30.85881,-2.368842,-7.775315,-1434.818037,0.427773,3.963586
15,1976,3.986753,21.686639,-28.234725,-67.292125,-12.534421,-19.465358,19.250607,-1518.185129,0.230296,4.193882
16,1977,4.89141,-33.12301,-5.662139,-62.165684,-15.905332,2.65495,-23.727249,-1652.4534,0.370907,4.564788
17,1978,8.404591,-77.561015,-12.503384,-22.85804,-31.097609,7.127708,-9.140167,-1791.355022,0.383706,4.948495
18,1979,3.916703,-88.351684,-63.938851,-49.242043,-12.076624,-17.718503,-9.578557,-2030.537848,0.660726,5.609221


In [73]:
df[ (df.year >= 1970) & (df.year <= 1979) ]

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
9,1970,-6.452316,-24.494667,-0.125296,-36.120199,11.61979,11.636911,4.400377,-999.018177,0.110225,2.759719
10,1971,0.414711,-42.904189,28.103328,-8.702938,-9.964542,1.061299,-6.735536,-1038.104459,0.107973,2.867692
11,1972,-5.144729,-27.004031,-22.14335,-40.883357,32.36373,-14.968034,-6.223849,-1122.885506,0.234202,3.101894
12,1973,4.08109,9.839444,22.985188,-31.432594,-20.883232,2.103649,10.539823,-1125.677743,0.007713,3.109607
13,1974,1.545615,-40.126998,-29.517874,-43.861622,-23.991402,-21.338825,4.419343,-1279.964287,0.426206,3.535813
14,1975,7.431192,-32.410467,-44.094084,-43.357442,-30.85881,-2.368842,-7.775315,-1434.818037,0.427773,3.963586
15,1976,3.986753,21.686639,-28.234725,-67.292125,-12.534421,-19.465358,19.250607,-1518.185129,0.230296,4.193882
16,1977,4.89141,-33.12301,-5.662139,-62.165684,-15.905332,2.65495,-23.727249,-1652.4534,0.370907,4.564788
17,1978,8.404591,-77.561015,-12.503384,-22.85804,-31.097609,7.127708,-9.140167,-1791.355022,0.383706,4.948495
18,1979,3.916703,-88.351684,-63.938851,-49.242043,-12.076624,-17.718503,-9.578557,-2030.537848,0.660726,5.609221


*and* (&)

In [71]:
# from the book

# select rows with cumulative_sea_level_rise>10 AND  global_glacial_volume_change<-300
df[ (df['cumulative_sea_level_rise']>10) & (df['global_glacial_volume_change']<-300)]

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
32,1993,16.685013,-73.666274,-43.70204,-65.99513,-33.151246,-20.578403,-20.311577,-3672.582082,0.671126,10.145254
33,1994,0.741751,-3.069084,-59.962273,-59.00471,-89.506142,-15.258449,-8.168498,-3908.977191,0.653025,10.79828
34,1995,-2.139665,-58.167778,-74.141762,3.500155,-0.699374,-19.863392,-25.951496,-4088.082873,0.494767,11.293047
35,1996,-6.809834,-4.550205,-74.847017,-67.436591,4.86753,-21.080115,-11.781489,-4271.401594,0.506405,11.799452
36,1997,-13.724106,-24.832246,-167.229145,-34.406403,-27.680661,-38.213286,-20.17909,-4600.686013,0.909625,12.709077
37,1998,-13.083338,-110.429302,-107.879027,-58.115702,30.169987,-3.797978,-48.129928,-4914.831966,0.867807,13.576884
38,1999,-8.039555,-64.644068,-87.714653,-26.211723,5.888512,-8.03863,-40.653001,-5146.368231,0.639603,14.216487
39,2000,-17.00859,-96.494055,-44.445,-37.518173,-29.191986,-2.767698,-58.87383,-5435.317175,0.798202,15.014688
40,2001,-8.419109,-145.415483,-55.749505,-35.977022,-0.926134,7.553503,-86.774675,-5764.039931,0.908074,15.922762
41,2002,-3.392361,-48.718943,-87.12,-36.127226,-27.853498,-13.484593,-30.20396,-6013.2255,0.688358,16.61112


## ... by position



All the selections we have done so far have been using labels or using a condition. Sometimes we might want to select certain rows depending on their actual position in the data frame. In this case we use `iloc` selection with the syntax `df.iloc[row-indices]`. `iloc` stands for integer-location based indexing. Let’s see some examples:


Sometimes we wants to select rows on their *actual* position in the data frame.

use `iloc` selection. Syntax:

```
df.iloc[row-indices]
```

`iloc` = stands for *integer* location based indexing



Example:

In [74]:
# select the fifth row (remember, indexing starts from 0 in python)

df.iloc[4]

year                            1965.000000
europe                            10.629385
arctic                            43.695389
alaska                            -0.115332
asia                             -18.414602
north_america                    -19.398765
south_america                      6.862102
antarctica                        22.793484
global_glacial_volume_change    -473.112003
annual_sea_level_rise             -0.128392
cumulative_sea_level_rise          1.306939
Name: 4, dtype: float64

In [75]:
type(df.iloc[4])

pandas.core.series.Series

In [76]:
# select rows 23 through 30 (inclusive)

df.iloc[23:31]

Unnamed: 0,year,europe,arctic,alaska,asia,north_america,south_america,antarctica,global_glacial_volume_change,annual_sea_level_rise,cumulative_sea_level_rise
23,1984,8.581427,-5.755672,-33.466092,-20.528535,-20.734676,-8.267686,-3.261011,-2569.339802,0.232609,7.097624
24,1985,-5.97098,-49.651089,12.065473,-31.571622,-33.833985,10.072906,-13.587886,-2682.857926,0.313586,7.41121
25,1986,-5.680642,22.900847,7.557447,-18.920773,-33.014743,-4.65203,30.482473,-2684.197632,0.003701,7.414911
26,1987,8.191477,12.38778,-24.007862,-41.12197,-48.560996,1.670733,3.13019,-2773.325568,0.24621,7.66112
27,1988,-11.117228,-31.066489,49.897712,-21.300712,-46.545435,13.460422,-37.986834,-2858.767621,0.236028,7.897148
28,1989,14.86322,-23.462392,-36.112726,-46.528372,-57.756422,-21.68747,-10.044757,-3041.169131,0.503872,8.40102
29,1990,-1.226009,-27.484542,-92.713339,-35.553433,-56.563056,-31.077022,-29.893352,-3318.220397,0.765335,9.166355
30,1991,-14.391425,-34.898689,-8.822063,-15.338299,-31.45801,-7.162909,-35.968429,-3467.630284,0.412734,9.579089


# Selecting rows and columns simultaneously ...


Selecting rows and columns simultaneously can be done using `loc` (labels or conditions) or `iloc` (integer position).

## ... by labels or conditions

When we want to select rows and columns simultaneously by labels or conditions we can use `loc` selection with the syntax


```
df.loc[ row-selection , column-selection]
```

specifying both paratmers: `row-selection` and `column-selection`
- can be a condition or subset of labels from the index or the column names

These parameters can be a condition (which generates a boolean array) or a subset of labels from the index or the column names. 

Let’s see an example:

In [77]:
# select the change in glacial volumn in Europe per year after 2000

df.loc[ df['year'] > 2000 , ['year', 'europe']]

Unnamed: 0,year,europe
40,2001,-8.419109
41,2002,-3.392361
42,2003,-3.392361


from the textbook:

Let’s break it down:

- we are using the `df.loc[ row-selection , column-selection]` syntax

- the `row-selection` parameter is the condition `df['year']>1990`, which is a boolean array saying which years are greater than 1990

- the `column-selection` parameter is `['year','europe']` which is a list with the names of the two columns we are intersted in.

## ... by position

When we want to select rows and columns simultaneously by position we use `iloc` selection with the syntax:

```
df.iloc[ row-indices , column-indices]
```

Example:

In [81]:
# select rows with indices 3 through 7 (inclusive) (4th through 8th rows)
# and the 4th and 5th columns
df.iloc[ 3:8, [3,4] ]

Unnamed: 0,alaska,asia
3,0.4778,-18.675385
4,-0.115332,-18.414602
5,0.224762,-14.630284
6,-7.17403,-39.013695
7,-0.660556,7.879589



Let’s break it down:

- we are using the `df.iloc[ row-indices , column-indices]` syntax

- the `row-indices` parameter is the slice of *integer indices* 3:8. Remember the right endpoint (8) won’t be included.

- the `column-indices` parameter is the list of integer indices 3 and 4. This means we are selecting the fourth and fifth column.

https://carmengg.github.io/eds-220-book/lectures/lesson-2-pandas-basics.html#notes-about-loc-and-iloc