# Slicing & Indexing

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


## `[ ]` on DataFrames

It's main purpose is to select a **single column** or **multiple columns** of data. We have already seen it in action. `[]`  is called as indexing operator.

### Selecting one column

To select a single column of data, simply put the name of the column in-between the brackets. Let's select the food column:

In [3]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [4]:
df['food']

Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

### Selecting multiple columns

It's possible to select multiple columns with just the indexing operator by passing it a list of column names. Let's select *color*, *food*, and *score*:

In [5]:
df[['color', 'food', 'score']]

Unnamed: 0,color,food,score
Jane,blue,Steak,4.6
Niko,green,Lamb,8.3
Aaron,red,Mango,9.0
Penelope,white,Apple,3.3
Dean,gray,Cheese,1.8
Christina,black,Melon,9.5
Cornelia,red,Beans,2.2


Selecting multiple columns always returns a DataFrame. You can actually select a single column as a DataFrame with a one-item list. 

Another things to note is that column order doesn't matter. You can select columns in any order that you choose. It doesn't have to be the same order as the original DataFrame.

### Exceptions
There are a couple common exceptions that arise when doing selections with just the indexing operator. 
* If you misspell a word, you will get a **`KeyError`**
* If you forgot to use a list to contain multiple columns you will also get a **`KeyError`**

In [6]:
df['height']

Jane         165
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

In [7]:
df['color', 'age'] # should be:  df[['color', 'age']]

KeyError: ('color', 'age')

### Summary of `[ ]` operator
* Its primary purpose is to select columns by the column names
* Select a single column as a Series by passing the column name directly to it: **`df['col_name']`**
* Select multiple columns as a DataFrame by passing a **list** to it: **`df[['col_name1', 'col_name2']]`**

## `.loc` on DataFrames

The **`.loc`** indexer selects data in a different way than `[ ]` operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the **LABEL** of the rows and columns.

### Selecting single row 

The **`.loc`** indexer will return a single row as a Series when given a single row label. Let's select the row for **Niko**.

In [8]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [9]:
df.loc['Niko'] 

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

We now have a Series, where the old column names are now the index labels. The **`name`** of the Series has become the old index label, **`Niko`** in this case.

### Selecting multiple rows
To select multiple rows, put all the row labels you want to select in a list and pass that to **`.loc`**. Let's select `Niko` and `Penelope`.

In [10]:
df.loc[['Niko', 'Penelope']]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3


### Selecting range of rows
It is possible to 'slice' the rows of a DataFrame with `.loc` by using **slice notation**. Slice notation uses a colon to separate **start**, **stop** and **step** values. For instance we can select all the rows from `Niko` through `Dean` like this:

In [11]:
df.loc['Niko':'Dean':1]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


```{note} 
`.loc` includes the last value with slice notation. Notice that the row labeled with `Dean` was kept. In other data containers such as Python lists, the last value is excluded.
```

You can use slice notation similarly to how you use it with lists. Let's slice from the beginning through *Aaron*:

In [12]:
df.loc[:'Aaron']

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0


Slice from *Niko* to *Christina* stepping by 2:

In [13]:
df.loc['Niko':'Christina':2] 

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


Slice from *Dean* to the end:

In [14]:
df.loc['Dean':]

Unnamed: 0,state,color,food,age,height,score
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Selecting rows and columns simultaneously
Unlike just the indexing operator, it is possible to select rows and columns simultaneously with `.loc`. You do it by separating your row and column selections by a **comma**. It will look something like this:

```{code-block} python

df.loc[row_selection, column_selection]

```

**Selecting *m* rows and *n* columns**

```{note}
Seleting *m* rows and *n* columns does not means that only unequal numbers of rows and columns can be selected using this combination.
We can also select *m* rows and *m* columns.
```

For instance, if we wanted to select the rows *Dean* and *Cornelia* along with the columns *age*, *state* and *score* we would do this:

In [15]:
df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]

Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


### Possible combinations

Row or column selections can be any of the following as we have already seen:
* A single label
* A list of labels
* A slice with labels

We can use any of these three for either row or column selections with **`.loc`**. Let's see some examples.

Let's select two rows and a single column:

In [16]:
df.loc[['Dean', 'Aaron'], 'food']

Dean     Cheese
Aaron     Mango
Name: food, dtype: object

In [17]:
df.loc['Dean', ['age', 'food']]

age         32
food    Cheese
Name: Dean, dtype: object

Select a slice of rows and a list of columns:

In [18]:
df.loc['Jane':'Penelope', ['state', 'color']] # slicing notation does not have []

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


```{warning}
Please give attention to above example. We have not used list notation while using slicing, because slicing itself acts as list.
```

Select a single row and a single column. This returns a scalar value.

In [19]:
df.loc['Jane', 'age']

30

Select a slice of rows and columns

In [20]:
df.loc[:'Dean', 'height':]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


**Selecting all rows & some columns**

It is possible to select all of the rows by using a single colon. You can then select columns as normal:

In [21]:
df.loc[:, ['food', 'color']]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


You can also use this notation to select all of the columns:

In [22]:
df.loc[['Penelope','Cornelia'], :]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


But, it isn't necessary as we have seen, so you can leave out that last colon:

In [23]:
df.loc[['Penelope','Cornelia']]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### Store the selection in variables

It might be easier to assign row and column selections to variables before you use `.loc`. This is useful if you are selecting many rows or columns:

In [24]:
rows = ['Jane', 'Niko', 'Dean', 'Penelope', 'Christina'] 
cols = ['state', 'age', 'height', 'score']
df.loc[rows, cols]

Unnamed: 0,state,age,height,score
Jane,NY,30,165,4.6
Niko,TX,2,70,8.3
Dean,AK,32,180,1.8
Penelope,AL,4,80,3.3
Christina,TX,33,172,9.5


### Summary of `.loc` operator

- Only uses labels
- Can select rows and columns simultaneously
- Selection can be a single label, a list of labels or a slice of labels
- Put a comma between row and column selections

## `.iloc` on DataFrames

The `.iloc` indexer is very similar to `.loc` but only uses **integer locations** to make its selections. The word `.iloc` itself stands for *integer location* so that should help with remember what it does.

### Selecting a single row

By passing a single integer to `.iloc`, it will select one row as a Series:

In [25]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [26]:
df.iloc[3]

state        AL
color     white
food      Apple
age           4
height       80
score       3.3
Name: Penelope, dtype: object

### Selecting multiple rows

Use a list of integers to select multiple rows:

In [27]:
df.iloc[[5, 2, 4]]  # remember, don't do df.iloc[5, 2, 4]  Error!

Unnamed: 0,state,color,food,age,height,score
Christina,TX,black,Melon,33,172,9.5
Aaron,FL,red,Mango,12,120,9.0
Dean,AK,gray,Cheese,32,180,1.8


### Selecting range of rows

```{note} 
Slice notation works just like a list in this instance and is exclusive of the last element
```

In [28]:
df.iloc[3:5]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


Select 3rd position until end:

In [29]:
df.iloc[3:]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


Select 3rd position to end by 2:

In [30]:
df.iloc[3::2]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


### Selecting rows and columns simultaneously

Just like with `.iloc` any combination of a single integer, lists of integers or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a **comma**.

Select two rows and two columns:

In [31]:
df.iloc[[2,3], [0, 4]]

Unnamed: 0,state,height
Aaron,FL,120
Penelope,AL,80


Select a slice of the rows and two columns:

In [32]:
df.iloc[3:6, [1, 4]]

Unnamed: 0,color,height
Penelope,white,80
Dean,gray,180
Christina,black,172


Select slices for both

In [33]:
df.iloc[2:5, 2:5]

Unnamed: 0,food,age,height
Aaron,Mango,12,120
Penelope,Apple,4,80
Dean,Cheese,32,180


Select a single row and column

In [34]:
df.iloc[0, 2]

'Steak'

Select all the rows and a single column

In [35]:
df.iloc[:, 5]

Jane         4.6
Niko         8.3
Aaron        9.0
Penelope     3.3
Dean         1.8
Christina    9.5
Cornelia     2.2
Name: score, dtype: float64

## Selecting Series

We can also do subset selection with a Series. Since Series do not have columns, we suggest using only `.loc` and `.iloc`.

We will create a Series by selecting a single column from a DataFrame. Let's select the *food* column:

In [36]:
food = df['food']
food

Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

### Series selection with *.loc*

Series selection with `.loc` is quite simple, since we are only dealing with a single dimension. You can again use a *single row label*, a *list of row labels* or a *slice of row labels* to make your selection. Let's see several examples.

Let's select a single value:

In [37]:
food

Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

In [38]:
food.loc['Aaron'] # label, list of labels, slice of labels

'Mango'

Select three different values. This returns a Series:

In [39]:
food.loc[['Dean', 'Niko', 'Cornelia']]

Dean        Cheese
Niko          Lamb
Cornelia     Beans
Name: food, dtype: object

Slice from *Niko* to *Christina* - is inclusive of last index

In [40]:
food.loc['Niko':'Christina']

Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Name: food, dtype: object

Slice from *Penelope* to the end:

In [41]:
food.loc['Penelope':]

Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

### Series selection with *.iloc*

Series subset selection with `.iloc` happens similarly to `.loc` except it uses *integer location*. You can use a *single integer*, a *list of integers* or a *slice of integers*. Let's see some examples.

Select a single value:

In [42]:
food.iloc[0]

'Steak'

Use a list of integers to select multiple values:

In [43]:
food.iloc[[4, 1, 3]]

Dean        Cheese
Niko          Lamb
Penelope     Apple
Name: food, dtype: object

Use a slice - is exclusive of last integer

In [44]:
food.iloc[4:6]

Dean         Cheese
Christina     Melon
Name: food, dtype: object

## Python list vs Dictionary

It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.

Python lists allow for selection of data only through *integer location*. You can use a single integer or slice notation to make the selection but NOT a list of integers.

Python dictionaries allow for selection of data only through *keys*. All you can do is pass a single key to get the associated value but NOT a *list of labels* or a *slice of labels*.

So, you can see how pandas DataFrames/Series can do everything python list and dictionaries can do, combined! Not just that, pandas objects can do many more things and are very flexible.

## The default indexes

If you don't specify a column to be the index when first reading in the data, pandas will use the integers **0** to **n-1** as the *index*. This technically creates a `RangeIndex` object. Let's take a look at it.

```{note} 
`df` and `df2` are different objects. We will compare them in next few cells, so pay close attention.
```

In [46]:
df2 = pd.read_csv('../data/sample_data2.csv')
df2

Unnamed: 0,Names,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3
4,Dean,AK,gray,Cheese,32,180,1.8
5,Christina,TX,black,Melon,33,172,9.5
6,Cornelia,TX,red,Beans,69,150,2.2


In [47]:
df2.index

RangeIndex(start=0, stop=7, step=1)

This object is similar to Python `range()` objects. Let's create one:

In [48]:
range(7)

range(0, 7)

Converting both of these objects to a list produces the exact same thing:

In [49]:
list(df2.index)

[0, 1, 2, 3, 4, 5, 6]

In [50]:
list(range(7))

[0, 1, 2, 3, 4, 5, 6]

For now, it's not at all important that you have a **`RangeIndex`**. Selections from it happen just the same with **`.loc`** and **`.iloc`**. Let's look at some examples.

In [51]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [52]:
col = [1,4]

In [53]:
df.iloc[[1,3,5], col]

Unnamed: 0,color,height
Niko,green,70
Penelope,white,80
Christina,black,172


In [54]:
df2.loc[[2, 4, 5], ['food', 'color']]

Unnamed: 0,food,color
2,Mango,red
4,Cheese,gray
5,Melon,black


In [55]:
df2.iloc[[2, 4, 5], [3,2]]

Unnamed: 0,food,color
2,Mango,red
4,Cheese,gray
5,Melon,black


There is a subtle difference when using a slice. **`.iloc`** excludes the last value, while **`.loc`** includes it. This happens because your *labels* and *integer locations* have the same values. 

In [56]:
df2.iloc[:3]

Unnamed: 0,Names,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0


In [57]:
df2.loc[:3]

Unnamed: 0,Names,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3


Another test we can do to prove *how important it is to understand this difference* is to try using negative indexing.

In [58]:
df2.iloc[-1] 

Names     Cornelia
state           TX
color          red
food         Beans
age             69
height         150
score          2.2
Name: 6, dtype: object

In [59]:
df2.loc[-1] # searches for '-1' label

KeyError: -1

```{admotion} Selecting the same column twice?
This is rather peculiar, but you can actually select the same column more than once.
```

Let's see it in action.

In [60]:
df[['age', 'age', 'age']]

Unnamed: 0,age,age.1,age.2
Jane,30,30,30
Niko,2,2,2
Aaron,12,12,12
Penelope,4,4,4
Dean,32,32,32
Christina,33,33,33
Cornelia,69,69,69


## Conclusion

### Exercise

Check out [this](../nbs/Pandas_Exercise.html#slicing-and-indexing-exercise) amazing exercise questions on slicing and indexing.