# Data Indexing and Selection

We looked in detail at methods and tools to access, set, and modify values in NumPy arrays.
These included:

- **indexing** 
```python
arr[2, 1]
```
- **slicing** 
```python
arr[:, 1:5]
```
- **masking**
```python
arr[arr > 0]
```
- **fancy indexing** 
```python
arr[[1, 5]]```
- and **combinations** 
```python
arr[:, [1, 5]]```

**Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.**
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimensional ``DataFrame`` object.

## Data Selection in Series

As we saw in the previous section, a **``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.**
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

In [2]:
# first things first
import pandas as pd

### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                index = ["a", "b", "c", "d"])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [4]:
print("- por nombre: ", data['a']) # Acceso explícito por nombre
print("- por posición: ", data[0]) # Acceso implícito por posición

- por nombre:  0.25
- por posición:  0.25


We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [8]:
dicti = {4: 4}
4 in dicti

True

In [5]:
"a" in data

True

In [9]:
# como si fuera un diccionario
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [10]:
# accedemos al index de la serie
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [12]:
# a los valores no podemos acceder usando data.values()
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [14]:
# podemos acceder al índice y a los valores, a la vez, como si fuera un diccionario
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [15]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [17]:
# pero también haciendo un `zip` del índice y values
list(zip(data.index, data.values))

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

[¿No recuerdas muy bien qué era `zip`?](https://www.programiz.com/python-programming/methods/built-in/zip)

``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [19]:
dicti["7"] = 7

In [20]:
dicti

{4: 4, '7': 7}

In [21]:
# Crea uno nuevo porque no existe
data["e"] = 0.30

In [25]:
data

a    0.25
b    0.30
c    0.75
d    1.00
e    0.30
dtype: float64

In [24]:
# si ya existe, estamos accediendo al valor y modificándolo
data["b"] = 0.30

### Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [28]:
data_dup = pd.Series([0.25, 0.5, 0.75, 1.0, 0.6],
                    index = ["a", "b", "c", "d", "c"])
data_dup

a    0.25
b    0.50
c    0.75
d    1.00
c    0.60
dtype: float64

In [29]:
# slicing by explicit index NO REPETIDOS
data["a": "c"]

a    0.25
b    0.30
c    0.75
dtype: float64

<table align="left">
 <tr><td width="80"><img src="./img/error.png" style="width:auto;height:auto"></td>
     <td style="text-align:left">
         <h3>ERRORES index duplicado</h3>
         
 </td></tr>
</table>

In [30]:
# KeyError: "Cannot get right slice bound for non-unique label: 'c'"
data_dup['a':'c']

KeyError: "Cannot get right slice bound for non-unique label: 'c'"

In [37]:
# slicing by implicit integer index
data_dup[2:5]

c    0.75
d    1.00
c    0.60
dtype: float64

In [42]:
# se puede acceder por posición a los index repetidos
print(data_dup[2])
data_dup[4]

0.75


0.6

In [43]:
data_dup["c"]

c    0.75
c    0.60
dtype: float64

In [45]:
data

a    0.25
b    0.30
c    0.75
d    1.00
e    0.30
dtype: float64

In [None]:
# en python para escalares: and or not
# con pandas: & | ~

In [54]:
(data > 0.3).any() and (data < 0.8).any() 
# si utilizaramos alguno de los métodos que resultan un solo Booleano por toda la serie podriamos usar
# los operadores escalares

True

In [56]:
# masking
data[(data > 0.3) | (data < 0.8)]

c    0.75
d    1.00
dtype: float64

In [57]:
# fancy indexing
lista_num = ["a", "c"]
data[lista_num]

a    0.25
c    0.75
dtype: float64

In [61]:
data[["a", "c"]]

a    0.25
c    0.75
dtype: float64

Among these, slicing may be the source of the most confusion.
**Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.**

### Indexers: loc and iloc

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as **``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.**

In [62]:
data_num = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data_num

1    a
3    b
5    c
dtype: object

**Explicit index when indexing** 

In [67]:
# por nombre
data_num[5]

'c'

**Implicit index when slicing**

In [64]:
# por posición
# slicing siempre devuelve una "slice" del objeto original
data_num[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the **``loc`` attribute allows indexing and slicing that always references the explicit index:**

In [66]:
# por nombre
data_num.loc[1]

'a'

In [69]:
# por nombres
data_num.loc[1:5]

1    a
3    b
5    c
dtype: object

In [71]:
data.loc["a":"d"]

a    0.25
b    0.30
c    0.75
d    1.00
dtype: float64

The **``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index**:

In [72]:
# por posición
data_num.iloc[1]

'b'

In [79]:
# por posiciones
data_num.iloc[0:5:1] # aquí utilizamos start:stop:step

1    a
3    b
dtype: object

One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, **I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.**

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [84]:
import numpy as np

In [94]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995, 'Arizona': 735835})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135, 'Arizona': np.nan})

df = pd.DataFrame({"area": area, "pop": pop})
df
#np.nan

Unnamed: 0,area,pop
California,423967,38332521.0
Texas,695662,26448193.0
New York,141297,19651127.0
Florida,170312,19552860.0
Illinois,149995,12882135.0
Arizona,735835,


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [88]:
df["area"]

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Arizona       735835
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [89]:
df.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Arizona       735835
Name: area, dtype: int64

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [90]:
df.area is df['area']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, **if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.**
For example, the ``DataFrame`` has a [``pop()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html?highlight=pop#pandas.DataFrame.pop) method, so ``df.pop`` will point to this rather than the ``"pop"`` column:

In [91]:
#df.pop('area')

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Arizona       735835
Name: area, dtype: int64

In particular, you should avoid the temptation to try column assignment via attribute (i.e., **use ``data['pop'] = z`` rather than ``data.pop = z``**).

Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [102]:
df["densidad"] = df["pop"] / df["area"]
df

Unnamed: 0,area,pop,densidad
California,423967,38332521.0,90.413926
Texas,695662,26448193.0,38.01874
New York,141297,19651127.0,139.076746
Florida,170312,19552860.0,114.806121
Illinois,149995,12882135.0,85.883763
Arizona,735835,,


This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects; we'll dig into this further in **Operating on Data in Pandas**.

### DataFrame as two-dimensional array

As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:

In [103]:
df.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
       [7.35835000e+05,            nan,            nan]])

With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself.
**For example, we can transpose the full ``DataFrame`` to swap rows and columns:**

In [None]:
# df.drop(columns = "area", inplace = True)

In [112]:
df.T

Unnamed: 0,California,Texas,New York,Florida,Illinois,Arizona
area,423967.0,695662.0,141297.0,170312.0,149995.0,735835.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0,
densidad,90.41393,38.01874,139.0767,114.8061,85.88376,


When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our **ability to simply treat it as a NumPy array.**
In particular, passing a single index to an array accesses a row:

In [108]:
df.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

and passing a single "index" to a ``DataFrame`` accesses a column:

In [115]:
df['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Arizona       735835
Name: area, dtype: int64

In [116]:
df['area']["California"]

423967

Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc``, indexers mentioned earlier and ``at``.


Let's start having a look to the third indexing attribute, **``at``, which access a single value for a row/column label pair.**

Similar to ``loc``, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series.

In [128]:
# pasa tu index a columna
prueba = df.reset_index()

In [129]:
prueba

Unnamed: 0,index,area,pop,densidad
0,California,423967,38332521.0,90.413926
1,Texas,695662,26448193.0,38.01874
2,New York,141297,19651127.0,139.076746
3,Florida,170312,19552860.0,114.806121
4,Illinois,149995,12882135.0,85.883763
5,Arizona,735835,,


In [130]:
prueba.at[3, 'pop'] # por nombre

19552860.0

In [118]:
df.at["Florida", "pop"]

19552860.0

Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [140]:
df.iloc[:3]

Unnamed: 0,area,pop,densidad
California,423967,38332521.0,90.413926
Texas,695662,26448193.0,38.01874
New York,141297,19651127.0,139.076746


In [138]:
df.iloc[:,:2] # primero filas, segundo columnas

Unnamed: 0,area,pop
California,423967,38332521.0
Texas,695662,26448193.0
New York,141297,19651127.0
Florida,170312,19552860.0
Illinois,149995,12882135.0
Arizona,735835,


Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [139]:
df.loc[:,:"pop"]

Unnamed: 0,area,pop
California,423967,38332521.0
Texas,695662,26448193.0
New York,141297,19651127.0
Florida,170312,19552860.0
Illinois,149995,12882135.0
Arizona,735835,


In [146]:
df.loc["California": "Arizona", ["pop","prueba"]] # 1 slicing 2 fancy indexing

Unnamed: 0,pop,prueba
California,38332521.0,180.827852
Texas,26448193.0,76.037481
New York,19651127.0,278.153492
Florida,19552860.0,229.612241
Illinois,12882135.0,171.767526
Arizona,,


Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine **masking** and **fancy indexing** as in the following:

In [150]:
df

Unnamed: 0,area,pop,densidad,prueba
California,423967,38332521.0,90.413926,180.827852
Texas,695662,26448193.0,38.01874,76.037481
New York,141297,19651127.0,139.076746,278.153492
Florida,170312,19552860.0,114.806121,229.612241
Illinois,149995,12882135.0,85.883763,171.767526
Arizona,735835,,,


In [151]:
df.loc[df.densidad > 100, ['pop', 'densidad']] # primero filas, segundo columnas

Unnamed: 0,pop,densidad
New York,19651127.0,139.076746
Florida,19552860.0,114.806121


Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [153]:
df.iloc[0,2] = 90

In [154]:
df

Unnamed: 0,area,pop,densidad,prueba
California,423967,38332521.0,90.0,180.827852
Texas,695662,26448193.0,38.01874,76.037481
New York,141297,19651127.0,139.076746,278.153492
Florida,170312,19552860.0,114.806121,229.612241
Illinois,149995,12882135.0,85.883763,171.767526
Arizona,735835,,,


To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.