# Data Manipulation: Indexing, Slicing, and Filtering

Topics Reviewed:

- Indexing
- Slicing
- Conditional Filtering
- Conditional Filling data with `where()`

In [90]:
import pandas as pd
import numpy as np

np.random.seed(0)

## Indexing

Indexing refers to the action of accessing and/or editing data using indexes (labels).

The following are some ways to access data in the `DataFrame`:

| Access Method  | Code                                                   | Returned Object               |
|----------------|--------------------------------------------------------|-------------------------------|
| Access entire column | `df[<column_label>]` or `df.column_label`              | Series                        |
| Access entire row    | `df.loc[<row_label>]` or `df.iloc[<absolute_position>]` | Series                        |
| Access specific cell | `df.loc[<row_label>, <column_label>]`                   | Scalar |
|                      | `df.iloc[<absolute_row_position>, <absolute_col_position>]` | Scalar |
|                      | `df.at[<row_label>, <column_label>]`                    | Scalar                        |
|                      | `df.iat[<absolute_row_position>, <absolute_col_position>]` | Scalar                        |
| Access section of adjacent cells | `df.loc[<list_row_labels>, <list_col_labels>]`         | Series, or DataFrame |
|                                   | `df.iloc[<slicing_row_positions>, <slicing_col_positions>]` | Series, or DataFrame |


**NOTE:** In the previous options, it is important to notice two important differences:

1. The difference between `.loc[]` and `.iloc[]` ( also between `.at[]` and `.iat[]`) is that
    - `.loc[]` uses the row and/or col labels
    - `.iloc[]` uses the absolute numeric position in the dataframe (similar to indexing in NumPY)


2. The difference between `.loc[]` and `.at[]` (also between `.iloc[]` and `.iat[]`) is that
    - `.loc[]` can return several values (as Series or DataFrames) or an scalar
    - `.at[]` can return only scalars

In the following, we will have examples of the cases mentioned in the table

In [91]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [100, 101, 102, 103, 104, 105, 106]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [92]:
# 1. Access to an entire column -> return Series
# alternative: df.name
df["name"]

100    Xavier
101       Ann
102      Jana
103        Yi
104     Robin
105      Amal
106      Nori
Name: name, dtype: object

In [93]:
# 2. Access to an entire row -> return Series
# alternative: df.iloc[0]
df.loc[101]

name            Ann
city        Toronto
age              28
py-score       79.0
Name: 101, dtype: object

In [94]:
#3. Access to an specific value -> return Scalar
# alternatives:
# df.iloc[0,0]
# df.at[101,'name']
# df.iat[0,0]
df.loc[101,'name']

'Ann'

In [95]:
#4. Access to a group of adjacent cells -> return Series
df.loc[:,"name"]

100    Xavier
101       Ann
102      Jana
103        Yi
104     Robin
105      Amal
106      Nori
Name: name, dtype: object

In [96]:
#4. Access to a group of adjacent cells -> return DataFrame
#NOTE: In this cases, we are using slicing for accessing values. We elaborate
# more in the next section.
df.iloc[:,:2]

Unnamed: 0,name,city
100,Xavier,Mexico City
101,Ann,Toronto
102,Jana,Prague
103,Yi,Shanghai
104,Robin,Manchester
105,Amal,Cairo
106,Nori,Osaka


## Slicing

Using the power of `.loc[]`, `.iloc[]`, and slicing (or lists), you can access to 
different chunks of data in a DataFrame in a similar way as we can do with 
NumPy arrays. In other words, you can use the slice construct (`:`) similar to Numpy. 

When using slicing, it is important to take into account the following:

1. There is a **difference between `.loc[]` and `.iloc[]`**.
    - `.loc[]` accepts a right inclusive slice. e.g `.iloc[1:5]` returns rows from 1 to 5 included.
    - `.iloc[]` accepts a right exclusive slice. e.g `.iloc[1:5]` returns rows from 1 to 4 (5 is not included). It is more consistent with NumPy arrays.

2. You can **skip rows and column**s with `.iloc[]` with a step parameter, in the same way as you can do with numpy arrays.

3. You can the **slice construct (`:`) in `.loc[]`**, but if the index is not numeric you can not slice the index.


**NOTE**: Don’t use tuples instead of lists or integer arrays to get ordinary rows or columns.

**NOTE**: Instead of using the slicing construct (`:`), you could also use 
the built-in Python class `slice()`, as well as `np.s_[]` or `pd.IndexSlice[]`.

In [97]:
# Access to a column
# alternative: df.iloc[:,0]
df.loc[:,'name']

100    Xavier
101       Ann
102      Jana
103        Yi
104     Robin
105      Amal
106      Nori
Name: name, dtype: object

In [98]:
# 1. Access to a chunk of the DataFrame
# NOTE: we are using both slides with : and lists
# NOTE: loc is inclusive and includes the row 105
df.loc[100:105, ['name', 'city']]

Unnamed: 0,name,city
100,Xavier,Mexico City
101,Ann,Toronto
102,Jana,Prague
103,Yi,Shanghai
104,Robin,Manchester
105,Amal,Cairo


In [99]:
# NOTE: iloc is exclusive and doesn't include the row 105
df.iloc[0:5, [0,1]]

Unnamed: 0,name,city
100,Xavier,Mexico City
101,Ann,Toronto
102,Jana,Prague
103,Yi,Shanghai
104,Robin,Manchester


In [100]:
# 2. You can skip some rows with a step  
# NOTE: we are taking rows from 0 to 5 (6 not included) with a step 2
df.iloc[0:6:2, 0]

100    Xavier
102      Jana
104     Robin
Name: name, dtype: object

In [101]:
# 3. You can not slice non-numeric (order) indexes with loc
# NOTE: if we change the index to letters, we can use (:), but not slicing on the index
df.index = ["a","b","c","d", "e", "f", "g"]
df

# df.loc["a":"c", ['name', 'city']] #error
df.loc[:,['name',"city"]]

Unnamed: 0,name,city
a,Xavier,Mexico City
b,Ann,Toronto
c,Jana,Prague
d,Yi,Shanghai
e,Robin,Manchester
f,Amal,Cairo
g,Nori,Osaka


## Setting Data

You can use accessors to modify parts of a pandas DataFrame by passing 

1. Python list
2. Numpy array
3. scalar

In [102]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [100, 101, 102, 103, 104, 105, 106]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [103]:
# setting values with a list
df.loc[:102, "py-score"] = [40, 50, 60]
df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,40.0
101,Ann,Toronto,28,50.0
102,Jana,Prague,33,60.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [104]:
# setting values with a numpy array
df.loc[103:105, "py-score"] = [7, 8, 9]
df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,40.0
101,Ann,Toronto,28,50.0
102,Jana,Prague,33,60.0
103,Yi,Shanghai,34,7.0
104,Robin,Manchester,38,8.0
105,Amal,Cairo,31,9.0
106,Nori,Osaka,37,84.0


In [105]:
# setting value with a scalar (broadcasting)
df.loc[:,"age"] = 25
df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,25,40.0
101,Ann,Toronto,25,50.0
102,Jana,Prague,25,60.0
103,Yi,Shanghai,25,7.0
104,Robin,Manchester,25,8.0
105,Amal,Cairo,25,9.0
106,Nori,Osaka,25,84.0


## Filtering
Data filtering is another powerful feature of pandas. It works similarly to indexing with Boolean arrays in NumPy.

It means that applying a condition over a column (`Series`) returns a `Series` with Boolean values. Then passing that `Series` again to the `DataFrame` will filter the rows with `True` values.

You can create very powerful and sophisticated expressions by combining logical operations with the following operators:

- **NOT** (~)
- **AND** (&)
- **OR** (|)
- **XOR** (^)

In [106]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [100, 101, 102, 103, 104, 105, 106]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [107]:
# 1. Filtering people with age < 35
# NOTE: df["age"] will return a Series and < 35 will transform to a Series with booleans
df[df["age"] < 35]

Unnamed: 0,name,city,age,py-score
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
105,Amal,Cairo,31,61.0


In [108]:
# 2. Complex Filtering
# Filter people with age < 35 and have py-score >= than 80
df[(df["age"] < 35) & (df["py-score"] >= 80)]

Unnamed: 0,name,city,age,py-score
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0


## Conditional Filling Data

You can use a function called `.where(cond, other)`, which replaces the values in the position where is a `False`.

- `cond` specifies the condition given by a `Series` of booleans. However, it can also be a callable.
- `other` specifies the value to replace when there is `False`. However, it can aslo be a callable.

**NOTE:** The callable for `cond` and `other` are out of the scope of this tutorial.

In [112]:
# 3. Replacing cells where condition is False
df['py-score'] = df['py-score'].where(cond = df["py-score"] >= 80, other=0)
df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,0.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,0.0
105,Amal,Cairo,31,0.0
106,Nori,Osaka,37,84.0
