## **Summary: Selection & Indexing**

Example df for the summary below, lower-case for the index, upper-case for the column names

In [2]:
import pandas as pd

df = pd.DataFrame.from_dict(dict(a=[1,11,111],
                                 b=[2,22,222],
                                 c=[3,33,333],
                                 aa=[1,11,111],
                                 bb=[2,22,222],
                                 cc=[3,33,333]),
                             orient='index',
                             columns=['A',"B","C"])
df
                            

Unnamed: 0,A,B,C
a,1,11,111
b,2,22,222
c,3,33,333
aa,1,11,111
bb,2,22,222
cc,3,33,333


## **The Main Idea**

The selection conventions of `pandas` sequences fairly consistently follow
those of `numpy` sequences.   A `pandas` `DataFrame` is an
analogue of a `numpy` **2D array**, except that it uses keyword indexing
in place of positional indexing.  The analogue of a `numpy` 1D array
is a `pandas` `Series`. Both the rows and columns of `pandas` arrays
are `Series` instances. The thing to remember about  a `pandas` `Series`
is that it still still has an index and still uses keyword indexing.

### **(1) Keyword Indexing**

Unlike `numpy` which uses positional indexing, `pandas` uses keyword indexing.  The simplest case is the form `df[col][row]`.

In [2]:
df['A']

a     1
b     2
c     3
aa    1
bb    2
cc    3
Name: A, dtype: int64

is column `'A'` of `df`.  This is a `Series` instance.  The index is the same as the index of the `DataFrame`. For any `Series` object the index can be used to access values.

In [3]:
df['A']['a']

1

is a value in the `DataFrame`. Or, from `pandas` point of view, a 0-d
object (no rows, no columns).  No index.

### **(2) Numpy style indexing .loc[ . . . ]**

To access an indexed object (whether a `DataFrame` or a `Series`) using the index, `pandas` provides the `.loc[...]` method. The conventions follow the axis ordering of `numpy` in that  rows come first, then columns.  Both single indices and slices work.

In [21]:
df.loc['a']

A      1
B     11
C    111
Name: a, dtype: int64

is a row and therefore a `Series` instance.  The index of
this `Series` is the column names.

In [22]:
df.loc['a':'aa']

Unnamed: 0,A,B,C
a,1,11,111
b,2,22,222
c,3,33,333
aa,1,11,111


is a `DataFrame`.  Using `.loc[..]` first with a `DataFrame` and then with a `Series`:

In [40]:
df.loc['a'].loc['B']

11

Note that `df.loc['a']` is a `Series` whose index is the column names
of `df`, so that using `.loc[..]` again means indexing by column.

### **(3) Boolean Indexing**

`pandas` also follows `numpy` in allowing Boolean indexing.

In [23]:
df['A'] == 1

a      True
b     False
c     False
aa     True
bb    False
cc    False
Name: A, dtype: bool

is a Boolean `Series`.

In [24]:
df[df['A'] == 1]

Unnamed: 0,A,B,C
a,1,11,111
aa,1,11,111


is a `DataFrame` containing a subset of the rows of `df`.  In this case
we have used a Boolean `Series` as a Boolean mask to select rows of the `DataFrame`, also an idea borrowed from `numpy`.

### **(4) Fancy indexing**

`pandas` also allows an analogue of "fancy indexing" (`numpy`'s term for selecting
a sequence of values with a sequence of indices).

In [25]:
df[['A','B']]

Unnamed: 0,A,B
a,1,11
b,2,22
c,3,33
aa,1,11
bb,2,22
cc,3,33


is a `DataFrame` containing columns `A` and `B`  of `df`.

In [26]:
df.loc[['a','bb']]

Unnamed: 0,A,B,C
a,1,11,111
bb,2,22,222


is a `DataFrame` containing rows `a` and `bb` of `df`.

We can also restrict the columns:

In [9]:
df.loc[:,['A','C']]

Unnamed: 0,A,C
a,1,111
b,2,222
c,3,333
aa,1,111
bb,2,222
cc,3,333


The following example illustrates a difference with `numpy`.  In numpy

```
a[[r1,r2],[c1,c2]]
```

retrieves a sequence of two points from a 2D array, 
the points at `a[r1,c1]` and `a[r2,c2]`.

This is illustrated in the next cell.

In [8]:
import numpy as np
a = np.arange(4).reshape((2,2))

print(f"{a=}")
print()
print(f"{a[[0,1],[0,1]]=}")

a=array([[0, 1],
       [2, 3]])

a[[0,1],[0,1]]=array([0, 3])


An apparently analogous expression for a `pandas DataFrame` is given in the next cell:

In [3]:
df.loc[['a','bb'],['A','C']]

Unnamed: 0,A,C
a,1,111
bb,2,222


It turns out this is just a synonym for 

In [10]:
df.loc[['a','bb']][['A','C']]

Unnamed: 0,A,C
a,1,111
bb,2,222


It doesn't select two data values;
it restricts the columns to `['A','C']` and the rows 
to `['a','bb']`.

### **(5) .iloc[ . . . ] Positional indexing**

Although `pandas` selection is almost always done through keyword indexing, `pandas` does provide a way to use positional indexing with
any indexed object through the `.iloc[...]` attribute.  Consider `df`
repeated here.

In [4]:
df

Unnamed: 0,A,B,C
a,1,11,111
b,2,22,222
c,3,33,333
aa,1,11,111
bb,2,22,222
cc,3,33,333


Both simple positional indices and slices work:

In [10]:
df.iloc[2]

A      3
B     33
C    333
Name: c, dtype: int64

is the third row of `df`, a `Series` instance.

In [5]:
df.iloc[2:4]

Unnamed: 0,A,B,C
c,3,33,333
aa,1,11,111


is a `DataFrame` with the third and fourth rows of `df`,

An example using `.iloc[...]` on a row of `df`.

In [6]:
df.loc['bb'].iloc[1:3]

B     22
C    222
Name: bb, dtype: int64

Using positional indexing on `df.loc['bb']`, a Series indexed by
column names, to get a sub-`Series` containing only the second and third columns.

Reviewing and summarizing the discussion  above using this example

 
| Selection | Native Pandas  |  Numpy-like     |
| :-- | :-: | :-: |
|  row      | NA | df.loc['c'] |
| row slice |  NA           |df.loc['c': 'bb'] |
|  col | df['A'] | df.loc[:,'A'] |
| row, col | df['A']['c'] | df.loc['c','A'] |
|          |               | df.loc['c']['A'] |
| bool series | df['A'] == 2 | Not used |
| bool selection | df[df['A'] == 2] | df.loc[df['A'] == 2] |
| row (position) |   NA             | df.iloc[2]  |
| col (position)      |   NA             | df.iloc[:,2] |
| fancy (cols)    |  df[['A','C']]   | df.loc[:,['A','C']] 
| fancy (rows)    |  NA   | df.loc[['b','bb']] 