# Advanced Indexing

In this tutorial, we will cover some aspects of advanced indexing, for both rows and columns. As with earlier tutorials, we will start with creating a demonstration dataframe, then operating on it. 

**Note!** You should make a copy of this notebook before editing it. From the "File" menu, select "Save a copy in Drive". 


In [1]:
# Initial Imports
%matplotlib inline
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML
    
success = HTML('<p style="color:green; font-size:30pt">Success!</p>')

In [2]:
df  = pd.read_csv('http://ds.civicknowledge.org.s3.amazonaws.com/civicknowledge.com/pandas-training/colors-sizes.csv', index_col=False)


# Indexer Overview

Indexers are denoted with square brackets — "[" and "]" — and are used to access a subset of rows and columns. There are a lot of indexers. Most of them are attached directly to the dataframe object, like ``df['color']``, which select a single column, but there are also indexers that have a seperate sub objects, such as ``df.loc[0]``, which selects a single row. However, in all but one case, the indexer will have the square brackets. Some of these indexers are: 

\[ Note: the size column is names 'siz' because ``df.size`` is a property that returns the memory consumption of the dataframe. So, if the ``size`` column was named "size", then ``df['size']`` would access the column, but ``df.size`` would not. Naming it 'siz' solves this problem. \] 

|Indexer|Description|
|-------|-----------|
|``df['color']``|Return a single column, as a ``pandas.Series object``|
|``df.color``|Access the 'color' column as a property. ( The only square bracket exception) |
|``df[['color','siz']]``|Return multiple columns. |
|``df.iloc[10]``|Return the 10th row|
|``df.loc['blue']``|Return rows with 0the index label of 'blue'. ( Assumes ``.set_index('color')`` was exectued first)|
|``df[df.color=='blue']``|Return rows where the color is blue

There are a few other indexers which we won't cover here, but you can read about ``.at()`` and ``.iat()`` [in the pandas documentation.](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#fast-scalar-value-getting-and-setting). 


## Indexing Columns

The most commonly used  indexer is the column indexer, which selects columns.  Here are some demonstrations of the various column indexers. 


In [3]:
# Column indexer
df['color'].head()

0    blue
1    blue
2    blue
3    blue
4    blue
Name: color, dtype: object

In [4]:
# Column property
df.color.head()

0    blue
1    blue
2    blue
3    blue
4    blue
Name: color, dtype: object

You can also access multiple coumns, by providing a list of strings to the base indexer: 

In [5]:
df[['color','siz']].head()

Unnamed: 0,color,siz
0,blue,large
1,blue,large
2,blue,medium
3,blue,medium
4,blue,small


For the multiple column indexer, ``df[['color','siz']]``, don't think of this as "double brackets"; it is actually a list inside of the indexer. That is you could also have written:

```python 
idx = ['color','siz'] # A list of strings. 
df[idx] # Index with a list of strings
```


## Indexing Rows

There are three main row indexers: 
* ``df.loc[]``, to get rows by the index label. 
* ``df.iloc[]``, to get rows by the row number
* ``df[pandas.Series]``, to get rows where the Series has a ``True`` value. 


The ``df.iloc[]`` indexer return rows based on their row numbers, starting at 0. 

In [6]:
# Return the 10th row
df.iloc[10]

color        green
siz          small
frequency     4616
opacity       2105
focus          881
squelch        636
barity        6151
Name: 10, dtype: object

The ``df.loc[]`` indexer returns rows based on their index label. By default, the index label is the same as the index row number, so to get the 10th row, we can do the same thing as with ``df.iloc[]``, but use ``df.loc[]``:

In [7]:
df.loc[10]

color        green
siz          small
frequency     4616
opacity       2105
focus          881
squelch        636
barity        6151
Name: 10, dtype: object

But that only works because the default index is the row number, so the labels for the index are the same as the row number. If we change the index, such as by using ``.set_index()``, we will get different labels. For instance, we can use ``df.set_index('color')`` to set colors as the index labels, then the ``df.loc[]`` can access rows by their color. 

In [8]:
# Return the rows with an index label of 'blue'
df.set_index('color').loc['blue'].head()

Unnamed: 0_level_0,siz,frequency,opacity,focus,squelch,barity
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,medium,2551,1708,432,1366,8336
blue,medium,2918,1790,817,1504,8385
blue,small,3371,2165,756,749,4498


The last important row indexer is the Boolean indexer, which uses a comparison on a column to a value to produce a series of True or False values, which are then used to select rows. 

In [9]:
# Return rows where the color is "blue"

df[df.color=='blue']

Unnamed: 0,color,siz,frequency,opacity,focus,squelch,barity
0,blue,large,1720,1540,440,2069,9790
1,blue,large,1762,2284,626,2100,9965
2,blue,medium,2551,1708,432,1366,8336
3,blue,medium,2918,1790,817,1504,8385
4,blue,small,3371,2165,756,749,4498
5,blue,small,4799,2075,401,658,4374



Here is another way to look at the row indexer ``df[df.color=='blue']``. Since the indexer is attached to ``df``, it looks like it should index columns. The difference is the datatype of the object inside the brackets. You could also write this indexer as: 

```python 
idx = (df.color=='blue')
df[idx]
```

The ``idx`` variable now holds a ``pandas.Series``, and the values of the series are booleans, ``True`` or ``False``. 

```
[In] df.color=='blue'
[Out] 
0      True
1      True
2      True
3      False
4      False
5      True
6      True
7      True
8      True
```

The indexer can detect the difference between a list of strings ( which returns columns ) and a ``Series`` of booleans, which returns rows. 



## Slicing

All of the indexer can actually handle a more complex specification for what rows to return, using Python Slice notation. Slice notation is defined for lists in Python, and is carried over to Pandas, so we can explore how it works forst with Python Lists. Here is a simple list and a slice:



In [10]:
l = ['a','b','c','d','e']

# This will give you the values between the 2 and 4 positions:
l[2:4]


['c', 'd']

The way to read the slice notataion is to imagine that the numbers point between the elements of the list, and the numbering starts from zero: 

```
 'a' 'b' 'c' 'd' 'e'
0   1   2   3   4   5 
```

So the notation ``[2:4]`` will return all of the values between 2 and 4, which are 'c' and 'd'. 

You can also describe the slice with negative numbers, which start from the right: 

```
  'a'  'b'  'c'  'd'  'e'
-5   -4   -3   -2   -1   0 
```

Using this notation, you can also get 'c' and 'd', indexing from the back


In [11]:
l[-3:-1]

['c', 'd']

The more useful way to use negative slices is to cut off some elements, for instance, to get rid of the first and last elements: 

In [12]:
l[1:-1]

['b', 'c', 'd']

Python handles these slices in a tricky way; it actually converts them to an object before sending them to the indexer function, and you can create that object manually. ( Sometimes you have to ... ). The object is created with ``slice()``,  So, the last inder operation could also be written as: 


In [13]:
l[slice(1,-1)]

['b', 'c', 'd']

Slice notation is particularly useful for the ``df.iloc[]`` indexer. For instance, to get the 10th through the 20th rows: 

In [14]:
df.iloc[10:20]

Unnamed: 0,color,siz,frequency,opacity,focus,squelch,barity
10,green,small,4616,2105,881,636,6151
11,green,small,2601,1757,1015,535,6411
12,orange,large,1584,1776,2365,2226,22948
13,orange,large,2184,1535,2447,2108,22675
14,orange,medium,4326,1711,1177,1428,23527
15,orange,medium,1142,1662,2094,1408,23565
16,orange,small,3515,1822,1219,662,23049
17,orange,small,1455,1876,1332,800,23267
18,red,large,2334,1522,363,2270,3128
19,red,large,1509,1803,234,2081,3133


Or, to get the last 5:

In [15]:
df.iloc[-5:] # Leaving off the last part of the slice defaults to the end

Unnamed: 0,color,siz,frequency,opacity,focus,squelch,barity
25,yellow,large,4588,1454,1756,2130,8453
26,yellow,medium,2186,2091,879,1204,11341
27,yellow,medium,4580,1808,1761,1392,11475
28,yellow,small,1611,1502,1104,600,10148
29,yellow,small,1724,1886,1175,647,10061


## Extended Slice Notation

Python lists are one dimensional, but Pandas has two dimensional tables, so pandas has an extended slice notation, where the indexer can take a **tuple**. A tuple is a lot like a list, but it is defined with parenthesis, and it can't be changed after it is created. So, one of our past examples could have been written with a tuple instead of a list: 



In [16]:
l = ('a','b','c','d','e')
l[2:4]

('c', 'd')

In Python, nearly all of the time you see things separated by commas between parenthesis -- like ``1,2,3`` -- it is a tuple. The tuple does not require the parenthesis, just the commas. So our example above could also have been written as:

```python 
l = 'a','b','c','d','e'
l[2:4]
```

However, it is good style to use parentheses to make the tuple explicit. 

Some pandas indexers can take tuples to indicate a slice of both the rows and the columns of a dataframe. for instance in our ``df`` dataframe:



In [17]:
df.head()

Unnamed: 0,color,siz,frequency,opacity,focus,squelch,barity
0,blue,large,1720,1540,440,2069,9790
1,blue,large,1762,2284,626,2100,9965
2,blue,medium,2551,1708,432,1366,8336
3,blue,medium,2918,1790,817,1504,8385
4,blue,small,3371,2165,756,749,4498


The value in the second row ( index label is '1' ) and the third column ( column label is 'frequency' ) is `1708`. We can get this value by giving the ``df.iloc`` indexer an explict request for the second row and third column, with a comma

In [18]:
df.iloc[2,3] # Or, more explicitly: df.iloc[(2,3)]

1708

The top level tuple, elements separated by a comma, indicates that we want to index both the rows and the columns. Each of those elements can also be slices, so if we wanted to get a range of columns and rows: 


In [19]:
df.iloc[ 2:4,4:6]

Unnamed: 0,focus,squelch
2,432,1366
3,817,1504


To be really clear about what is being created here: the things on either side of the comma are slices, and the comma binds them together into a tuple. So, the indexer above is shorthand for:

In [20]:
rows = slice(2,4)
columns = slice(4,6)
idx = (rows, columns)

df.iloc[idx]


Unnamed: 0,focus,squelch
2,432,1366
3,817,1504


### Fully Specified Indexes

One more complication is that instead of slices, you can also list the items from the index that you want to use. So in the above examples, ``slice(2,4)`` is equivalent to directly listing rows 2 and 3, and ``slice(4,6)`` is equivalent to directly listing columns 4 and 5. However, the indexer already has a specical meaning for tuples, so to directly specify the items, you have to use a list, which are created with the square brackets instead of parenthesis. (There are other cases where you can use tuples. )

In [21]:
rows = [2,3]
columns = [4,5]
idx = (rows, columns)

df.iloc[idx]

Unnamed: 0,focus,squelch
2,432,1366
3,817,1504


Or, more compactly: 

In [22]:
df.iloc[ [2,3],[4,5] ]

Unnamed: 0,focus,squelch
2,432,1366
3,817,1504


The ``df.iloc[]`` indexer is handy, but you rarely want to use the row numbers to slice tables; index labels are much more useful. So, let's set an index on the dataframe and use labels. First, set the index. 

In [23]:
df_i = df.set_index('color')
df_i.head()

Unnamed: 0_level_0,siz,frequency,opacity,focus,squelch,barity
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,medium,2551,1708,432,1366,8336
blue,medium,2918,1790,817,1504,8385
blue,small,3371,2165,756,749,4498


Then we can use the ``df.loc[]`` indexer, with index labels, to get rows. 

In [24]:
# Get rows with the color blue
df_i.loc['blue'].head()

Unnamed: 0_level_0,siz,frequency,opacity,focus,squelch,barity
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,medium,2551,1708,432,1366,8336
blue,medium,2918,1790,817,1504,8385
blue,small,3371,2165,756,749,4498


To use slices, the indidices must be sorted, which we can do with ``.sort_index()``. We have to do it twice, once for rows(axis 0 ) and a second time for columns ( axis 1)

In [25]:
# Sort indicies. 
df_i = df.set_index('color').sort_index(axis=0).sort_index(axis=1)

# Get a range of rows, with colors from 'blue' to 'red'. Slicing on labels requires
# that the index is sorted. 
df_i.loc['blue':'red'].head()

Unnamed: 0_level_0,barity,focus,frequency,opacity,siz,squelch
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,9790,440,1720,1540,large,2069
blue,9965,626,1762,2284,large,2100
blue,8336,432,2551,1708,medium,1366
blue,8385,817,2918,1790,medium,1504
blue,4498,756,3371,2165,small,749


In [26]:
# Select both rows and columns:
df_i.loc['blue':'red', 'barity':'opacity'].head()


Unnamed: 0_level_0,barity,focus,frequency,opacity
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
blue,9790,440,1720,1540
blue,9965,626,1762,2284
blue,8336,432,2551,1708
blue,8385,817,2918,1790
blue,4498,756,3371,2165


In [27]:
# Use explict label names instead of slices:
df_i.loc[ ('green','yellow'), ('focus','squelch') ].head()


Unnamed: 0_level_0,focus,squelch
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,685,2047
green,1126,2112
green,1205,1416
green,1193,1423
green,881,636


# Multi-Indexes

As we saw in the Restructing Tables notebook, data frames can have multi-level indexes for both rows and columns. In our dataframe, records have both an identifying size and color, and the have a heirarchical relationship: for each size, there are a set of object for each color. 


In [28]:
df.head(8)

Unnamed: 0,color,siz,frequency,opacity,focus,squelch,barity
0,blue,large,1720,1540,440,2069,9790
1,blue,large,1762,2284,626,2100,9965
2,blue,medium,2551,1708,432,1366,8336
3,blue,medium,2918,1790,817,1504,8385
4,blue,small,3371,2165,756,749,4498
5,blue,small,4799,2075,401,658,4374
6,green,large,3315,1736,685,2047,6677
7,green,large,4733,1457,1126,2112,6769


Setting the ``color`` and ``siz`` columns to the index will make that heirarchy explict.

In [29]:
df_mi = df.set_index(['color','siz'])
df_mi.head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,opacity,focus,squelch,barity
color,siz,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,medium,2551,1708,432,1366,8336
blue,medium,2918,1790,817,1504,8385
blue,small,3371,2165,756,749,4498
blue,small,4799,2075,401,658,4374
green,large,3315,1736,685,2047,6677
green,large,4733,1457,1126,2112,6769
green,medium,4218,1819,1205,1416,7115
green,medium,4492,2098,1193,1423,7245


Now, all of the items that have the color of `blue` are grouped together, and the next level down all of the rows with the same size are grouped together. Each of the columns in the index is known as a "level".

When we use the ``df.loc[]`` indexer, we can access rows using one or both of these index levels. We can index with a tuple, where the first element of the tuple is the color, and the second is the size. 


In [30]:
df_mi.loc[ ('blue','large') ]

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,opacity,focus,squelch,barity
color,siz,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965


As before, we can sepecify multiple items for one of the levels, by using a tuple, but that requires a bit more care. For instance, if you wanted to get the color 'blue' and the sizes of 'large' and 'small', you can sometimes put the 'large' and 'small' in a tuple:

```python
df_mi.loc[ ('blue', ('large', 'small') ) ]
```

However, most of the time this wond work, becase the parser will think you want to use 'large' and 'small'  as column names. So, you have to add a second compoent to specify all of the columns. For fortunately, the slice notation ":", which means "from the start to the end" will work. We can also use ``slice(None)``, which means the same thing. 


In [31]:
df_mi.loc[ ('blue', ('large', 'small') ),slice(None) ]

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,opacity,focus,squelch,barity
color,siz,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,small,3371,2165,756,749,4498
blue,small,4799,2075,401,658,4374


To expand what this means:


In [32]:
color_label = 'blue'

size_label = ('large','small')

row_indexer = (color_label, size_label)

column_indexer = slice(None) # This means "from the first coum to the last" or "all columns"

df_mi.loc[ row_indexer, column_indexer]



Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,opacity,focus,squelch,barity
color,siz,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,small,3371,2165,756,749,4498
blue,small,4799,2075,401,658,4374



The ``slice(None)`` trick can be used to mean "all of a level", so we could get all of colors, but only some of the sizes. However, in a lot of cases, the better style is to use ":" instead of ``slice(None)``. In the following code, we can make that substitution for the column indexer, but not for the color label.

In [33]:
df_mi.loc[ (slice(None), ('large','small') ), :].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,opacity,focus,squelch,barity
color,siz,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,large,1720,1540,440,2069,9790
blue,large,1762,2284,626,2100,9965
blue,small,3371,2165,756,749,4498
blue,small,4799,2075,401,658,4374
green,large,3315,1736,685,2047,6677


To Recap: 

* ``Slice(None)`` means to take all labels of the color level in the index
* ``('large','small')`` means to take only the 'large' and 'small'  labels in the siz index. 
* ``(slice(None), ('large','small') )`` combines the two previous into an indexer for the rows
* ``:`` is on the right side of the top level comma, so it is the column indexer, meaning to take all columns. 

We can take the complexity one step farther and select only a few of the columns:

In [34]:
df_mi.loc[ (slice(None), ('large','small') ), ('opacity','squelch')].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,opacity,squelch
color,siz,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,large,1540,2069
blue,large,2284,2100
blue,small,2165,749
blue,small,2075,658
green,large,1736,2047


There is only one more step to realize the full capability of the ``df.loc[]`` indexer: you can have a multi-index for columns too. But that is a specialized case, so we'll leave that for later. 

# Test

In [35]:
# First, let's re-load our dataset
df  = pd.read_csv('http://ds.civicknowledge.org.s3.amazonaws.com/civicknowledge.com/pandas-training/colors-sizes.csv', index_col=False)


# Task 1

Select row number 5 from ``df`` and return it as the output of the cell ( Leave it as the last line in the cell ) 


In [36]:
# Solution


In [37]:
# Test
assert 'df' in locals(), "Dataframe 'df' isn't loaded. Be sure to run the cell right after the 'Test' heading "
assert _.name == 5
success

# Task 2

Select rows 10 to 20 in from ``df``. Remember that your output should include row #10, but not row #20


In [38]:
# Solution


In [39]:
# Test
assert all((_ == df.iloc[10:20]).color)
success

# Task 3

Assign a new dataframe, ``df1`` that is the ``df`` dataframe, but with the index set to the ``color`` column. Then, select all of the rows that have the color blue. 


In [40]:
# Solution


In [41]:
# Test
assert _.focus.sum()== df[df.color == 'blue'].focus.sum()
success

# Task 4

Produce the same dataframe as in Task 3 -- all the rows with color 'blue' --  but using a column comparison on the ``df`` dataframe to produce a boolean series. ( You don't need to set the index. ) 


In [42]:
# Solution


In [43]:
# Test

assert len(_.columns) == 7, 'Table has wrong number of columns'
assert len(_) == 6, 'Table has wrong number of rows'
assert _.frequency.sum()==  17121, 'Table has wrong sum of values'
success

# Task 5

Use the ``.loc[]`` indexer to select rows with color "green" and the "focus" and "barity" columns. You will need to use ``.set_index()``


In [44]:
# Solution


In [45]:
assert len(_.columns) == 2, 'Table has wrong number of columns'
assert len(_) == 6, 'Table has wrong number of rows'
assert _.sum().sum() ==  46473, 'Table has wrong sum of values'
success

# Task 6

Create a new dataframe, ``t``, which is a copy of ``df`` but has the index set to 'siz' and 'color'. Index ``t`` to return all values of ``siz`` but only the colors 'blue' and 'green', and only the columns 'focus' and 'barity'.

In [46]:
# Solution


In [47]:
assert len(_) == 12, "Dataframe isn't the right length."
assert _.sum().sum() == 95293, 'DataFrame values did not have the corect sum. '
success