Link to Medium blog post: https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

# Selecting Subsets of Data in Pandas: Part 1

### Example selecting some columns and all rows

In [3]:
import pandas as pd
import numpy as np

data = {
    'state': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'],
    'color': ['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
    'food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
    'age': [30, 2, 12, 4, 32, 33, 69],
    'height': [165, 70, 120, 80, 180, 172, 150],
    'score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2]
}

df = pd.DataFrame(data, index=['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])


Let’s see some images of subset selection. We will first look at a sample DataFrame with fake data.

In [4]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Extracting the individual DataFrame components

Earlier, we mentioned the three components of the DataFrame. The index, columns and data (values). We can extract each of these components into their own variables. Let’s do that and then inspect them:

In [6]:
index = df.index
columns = df.columns
values = df.values

In [7]:
index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object')

In [8]:
columns

Index(['state', 'color', 'food', 'age', 'height', 'score'], dtype='object')

In [9]:
values

array([['NY', 'blue', 'Steak', 30, 165, 4.6],
       ['TX', 'green', 'Lamb', 2, 70, 8.3],
       ['FL', 'red', 'Mango', 12, 120, 9.0],
       ['AL', 'white', 'Apple', 4, 80, 3.3],
       ['AK', 'gray', 'Cheese', 32, 180, 1.8],
       ['TX', 'black', 'Melon', 33, 172, 9.5],
       ['TX', 'red', 'Beans', 69, 150, 2.2]], dtype=object)

### Data types of the components

Let’s output the type of each component to understand exactly what kind of object they are.

In [10]:
type(index)

pandas.core.indexes.base.Index

In [11]:
type(columns)

pandas.core.indexes.base.Index

In [12]:
type(values)

numpy.ndarray

### Understanding these types

Interestingly, both the index and the columns are the same type. They are both a pandas Index object. This object is quite powerful in itself, but for now you can just think of it as a sequence of labels for either the rows or the columns.

The values are a NumPy ndarray, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy and it's this array that is responsible for the bulk of the workload.

### Beginning with just the indexing operator on DataFrames

We will begin our journey of selecting subsets by using just the indexing operator on a DataFrame. Its main purpose is to select a single column or multiple columns of data.

### Selecting a single column as a Series

To select a single column of data, simply put the name of the column in-between the brackets. Let’s select the food column:

In [13]:
df['food']

Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

### Anatomy of a Series

Selecting a single column of data returns the other pandas data container, the Series. A Series is a one-dimensional sequence of labeled data. There are two main components of a Series, the index and the data(or values). There are NO columns in a Series.

The visual display of a Series is just plain text, as opposed to the nicely styled table for DataFrames. The sequence of person names on the left is the index. The sequence of food items on the right is the values.

You will also notice two extra pieces of data on the bottom of the Series. The name of the Series becomes the old-column name. You will also see the data type or dtype of the Series. You can ignore both these items for now.

### Selecting multiple columns with just the indexing operator

It’s possible to select multiple columns with just the indexing operator by passing it a list of column names. Let’s select color, food, and score:



In [14]:
df[['color', 'food', 'score']]

Unnamed: 0,color,food,score
Jane,blue,Steak,4.6
Niko,green,Lamb,8.3
Aaron,red,Mango,9.0
Penelope,white,Apple,3.3
Dean,gray,Cheese,1.8
Christina,black,Melon,9.5
Cornelia,red,Beans,2.2


### Selecting multiple columns returns a DataFrame

Selecting multiple columns returns a DataFrame. You can actually select a single column as a DataFrame with a one-item list:

In [15]:
df[['food']]

Unnamed: 0,food
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


Although, this resembles the Series from above, it is technically a DataFrame, a different object.

### Column order doesn’t matter

When selecting multiple columns, you can select them in any order that you choose. It doesn’t have to be the same order as the original DataFrame. For instance, let’s select height and color.

In [16]:
df[['height', 'color']]

Unnamed: 0,height,color
Jane,165,blue
Niko,70,green
Aaron,120,red
Penelope,80,white
Dean,180,gray
Christina,172,black
Cornelia,150,red


### Exceptions

There are a couple common exceptions that arise when doing selections with just the indexing operator.

- If you misspell a word, you will get a KeyError
- If you forgot to use a list to contain multiple columns you will also get a KeyError

In [17]:
df['height']

Jane         165
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

In [18]:
df['color', 'age'] # should be df[['color', 'age]]

KeyError: ('color', 'age')

### Summary of just the indexing operator


- Its primary purpose is to select columns by the column names
- Select a single column as a Series by passing the column name directly to it: df['col_name']
- Select multiple columns as a DataFrame by passing a list to it: df[['col_name1', 'col_name2']]
- You actually can select rows with it, but this will not be shown here as it is confusing and not used often.

### Getting started with .loc

The .loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the LABEL of the rows and columns.

### Select a single row as a Series with .loc

The .loc indexer will return a single row as a Series when given a single row label. Let's select the row for Niko.

In [19]:
df.loc['Niko']


state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

We now have a Series, where the old column names are now the index labels. The name of the Series has become the old index label, Niko in this case.



### Select multiple rows as a DataFrame with .loc

To select multiple rows, put all the row labels you want to select in a list and pass that to .loc. Let's select Niko and Penelope.

In [20]:
df.loc[['Niko', 'Penelope']]


Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3


### Use slice notation to select a range of rows with .loc

It is possible to ‘slice’ the rows of a DataFrame with .loc by using slice notation. Slice notation uses a colon to separate start, stop and step values. For instance we can select all the rows from Niko through Dean like this:

In [21]:
df.loc['Niko':'Dean']

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


### .loc includes the last value with slice notation

Notice that the row labeled with Dean was kept. In other data containers such as Python lists, the last value is excluded.

### Other slices

You can use slice notation similarly to how you use it with lists. Let’s slice from the beginning through Aaron:

In [22]:
df.loc[:'Aaron']

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0


Slice from Niko to Christina stepping by 2:

In [23]:
df.loc['Niko':'Christina':2]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


Slice from Dean to the end:

In [24]:
df.loc['Dean':]

Unnamed: 0,state,color,food,age,height,score
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Selecting rows and columns simultaneously with .loc

Unlike just the indexing operator, it is possible to select rows and columns simultaneously with .loc. You do it by separating your row and column selections by a comma. It will look something like this:

In [None]:
df.loc[row_selection, column_selection]

### Select two rows and three columns

For instance, if we wanted to select the rows Dean and Cornelia along with the columns age, state and score we would do this:

In [26]:
df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]

Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


### Use any combination of selections for either row or columns for .loc


Row or column selections can be any of the following as we have already seen:

- A single label
- A list of labels
- A slice with labels

We can use any of these three for either row or column selections with .loc. Let's see some examples.

Let’s select two rows and a single column:

In [27]:
df.loc[['Dean', 'Aaron'], 'food']

Dean     Cheese
Aaron     Mango
Name: food, dtype: object

Select a slice of rows and a list of columns:

In [28]:
df.loc['Jane':'Penelope', ['state', 'color']]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


Select a single row and a single column. This returns a scalar value.

In [29]:
df.loc['Jane', 'age']

30

Select a slice of rows and columns

In [30]:
df.loc[:'Dean', 'height':]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


### Selecting all of the rows and some columns

It is possible to select all of the rows by using a single colon. You can then select columns as normal:

In [31]:
df.loc[:, ['food', 'color']]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


You can also use this notation to select all of the columns:

In [32]:
df.loc[['Penelope','Cornelia'], :]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


But, it isn’t necessary as we have seen, so you can leave out that last colon:

In [33]:
df.loc[['Penelope','Cornelia']]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### Assign row and column selections to variables


It might be easier to assign row and column selections to variables before you use .loc. This is useful if you are selecting many rows or columns:

In [34]:
rows = ['Jane', 'Niko', 'Dean', 'Penelope', 'Christina']
cols = ['state', 'age', 'height', 'score']
df.loc[rows, cols]

Unnamed: 0,state,age,height,score
Jane,NY,30,165,4.6
Niko,TX,2,70,8.3
Dean,AK,32,180,1.8
Penelope,AL,4,80,3.3
Christina,TX,33,172,9.5


### Summary of .loc

- Only uses labels
- Can select rows and columns simultaneously
- Selection can be a single label, a list of labels or a slice of labels
- Put a comma between row and column selections

If you are enjoying this article, consider purchasing the All Access Pass which includes all my current and future material for one low price.

### Getting started with .iloc

The .iloc indexer is very similar to .loc but only uses integer locations to make its selections. The word .iloc itself stands for integer location so that should help with remember what it does.

### Selecting a single row with .iloc

By passing a single integer to .iloc, it will select one row as a Series:



In [35]:
df.iloc[3]

state        AL
color     white
food      Apple
age           4
height       80
score       3.3
Name: Penelope, dtype: object

### Selecting multiple rows with .iloc

Use a list of integers to select multiple rows:

In [36]:
df.iloc[[5, 2, 4]]       # remember, don't do df.iloc[5, 2, 4]

Unnamed: 0,state,color,food,age,height,score
Christina,TX,black,Melon,33,172,9.5
Aaron,FL,red,Mango,12,120,9.0
Dean,AK,gray,Cheese,32,180,1.8


### Use slice notation to select a range of rows with .iloc


Slice notation works just like a list in this instance and is exclusive of the last element

In [37]:
df.iloc[3:5]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


Select 3rd position until end:

In [38]:
df.iloc[3:]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


Select 3rd position to end by 2:

In [39]:
df.iloc[3::2]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


### Selecting rows and columns simultaneously with .iloc

Just like with .iloc any combination of a single integer, lists of integers or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a comma.

Select two rows and two columns:

In [40]:
df.iloc[[2,3], [0, 4]]

Unnamed: 0,state,height
Aaron,FL,120
Penelope,AL,80


Select a slice of the rows and two columns:

In [41]:
df.iloc[3:6, [1, 4]]

Unnamed: 0,color,height
Penelope,white,80
Dean,gray,180
Christina,black,172


Select slices for both

In [42]:
df.iloc[2:5, 2:5]

Unnamed: 0,food,age,height
Aaron,Mango,12,120
Penelope,Apple,4,80
Dean,Cheese,32,180


Select a single row and column



In [43]:
df.iloc[0, 2]

'Steak'

Select all the rows and a single column

In [44]:
df.iloc[:, 5]

Jane         4.6
Niko         8.3
Aaron        9.0
Penelope     3.3
Dean         1.8
Christina    9.5
Cornelia     2.2
Name: score, dtype: float64

### Deprecation of .ix

Early in the development of pandas, there existed another indexer, ix. This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it's not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous.

You can still call .ix, but it has been deprecated, so please never use it.

### Selecting subsets of Series

We can also, of course, do subset selection with a Series. Earlier I recommended using just the indexing operator for column selection on a DataFrame. Since Series do not have columns, I suggest using only .loc and .iloc. You can use just the indexing operator, but its ambiguous as it can take both labels and integers. I will come back to this at the end of the tutorial.

Typically, you will create a Series by selecting a single column from a DataFrame. Let’s select the food column:

In [45]:
food = df['food']
food

Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

### Series selection with .loc

Series selection with .loc is quite simple, since we are only dealing with a single dimension. You can again use a single row label, a list of row labels or a slice of row labels to make your selection. Let's see several examples.

Let’s select a single value:

In [46]:
food.loc['Aaron']

'Mango'

Select three different values. This returns a Series:

In [47]:
food.loc[['Dean', 'Niko', 'Cornelia']]

Dean        Cheese
Niko          Lamb
Cornelia     Beans
Name: food, dtype: object

Slice from Niko to Christina - is inclusive of last index

In [48]:
food.loc['Niko':'Christina']

Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Name: food, dtype: object

Slice from Penelope to the end:

In [49]:
food.loc['Penelope':]

Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

Select a single value in a list which returns a Series

In [50]:
food.loc[['Aaron']]

Aaron    Mango
Name: food, dtype: object

### Series selection with .iloc


Series subset selection with .iloc happens similarly to .loc except it uses integer location. You can use a single integer, a list of integers or a slice of integers. Let's see some examples.

Select a single value:

In [51]:
food.iloc[0]

'Steak'

Use a list of integers to select multiple values:

In [52]:
food.iloc[[4, 1, 3]]

Dean        Cheese
Niko          Lamb
Penelope     Apple
Name: food, dtype: object

Use a slice — is exclusive of last integer

In [53]:
food.iloc[4:6]

Dean         Cheese
Christina     Melon
Name: food, dtype: object

### Comparison to Python lists and dictionaries

It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.

Python lists allow for selection of data only through integer location. You can use a single integer or slice notation to make the selection but NOT a list of integers.

Let’s see examples of subset selection of lists using integers:



In [54]:
some_list = ['a', 'two', 10, 4, 0, 'asdf', 'mgmt', 434, 99]

In [55]:
some_list[5]

'asdf'

In [56]:
some_list[-1]

99

In [57]:
some_list[:4]

['a', 'two', 10, 4]

In [58]:
some_list[3:]

[4, 0, 'asdf', 'mgmt', 434, 99]

In [59]:
some_list[2:6:3]

[10, 'asdf']

### Selection by label with Python dictionaries

All values in each dictionary are labeled by a key. We use this key to make single selections. Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [61]:
d = {'a':1, 'b':2, 't':20, 'z':26, 'A':27}

In [62]:
d['a']

1

In [63]:
d['A']

27

### Pandas has power of lists and dictionaries

DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.

### Using just the indexing operator to select rows from a DataFrame — Confusing!

Above, I used just the indexing operator to select a column or columns from a DataFrame. But, it can also be used to select rows using a slice. This behavior is very confusing in my opinion. The entire operation changes completely when a slice is passed.

Let’s use an integer slice as our first example:

In [64]:
df[3:6]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


To add to this confusion, you can slice by labels as well.

In [65]:
df['Aaron':'Christina']

Unnamed: 0,state,color,food,age,height,score
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


### Using just the indexing operator to select rows from a Series — Confusing!

You can also use just the indexing operator with a Series. Again, this is confusing because it can accept integers or labels. Let’s see some examples



In [66]:
food

Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

In [67]:
food[2:4]

Aaron       Mango
Penelope    Apple
Name: food, dtype: object

In [68]:
food['Niko':'Dean']

Niko          Lamb
Aaron        Mango
Penelope     Apple
Dean        Cheese
Name: food, dtype: object

Since Series don’t have columns you can use a single label and list of labels to make selections as well

In [69]:
food['Dean']

'Cheese'

In [70]:
food[['Dean', 'Christina', 'Aaron']]

Dean         Cheese
Christina     Melon
Aaron         Mango
Name: food, dtype: object

Again, I recommend against doing this and always use .iloc or .loc