In [2]:
import numpy as np
import pandas as pd

In [3]:
np.random.seed(0)

# Essential Basic Funcionality

This notebook provides an introduction to the fundamental mechanics of pandas package. Here, we're going to learn about  indexing, filtering, arithmetic and binary operations, function application, descriptive statistics, sorting, ranking, and reindexing. We'll also explore some useful concepts like Copy-on-Write (CoW) and data alignment.

We'll not cover exhaustively all capabilities of the methods presented. Instead, we'll focus on familiarizing you with commonly used features to equip you with the practical skills needed to efficiently handle real-world datasets.

For comprehensive information, refer to the [official pandas documentation](https://pandas.pydata.org/docs/user_guide/basics.html).

# Indexing and Selection

Indexing in pandas refers to selecting specific segments of data, such as rows and columns of a DataFrame. There are several ways to do this, categorized as follows:

- **Selection by label:** This involves using axis labels (label-based indexing). Pandas provides different methods for this, depending on whether you're working with a Series or a DataFrame. Key points include:
    - Every label you request must exist in the index, or you'll get a `KeyError`.
    - When slicing, both the start and stop bounds are included if they exist in the index.
    - Integers can be used as labels, but they refer to the label, not the position.

- **Selection by position:** This involves using integer positions (position-based indexing). Pandas also provides methods for this, which vary depending on the data structure. Key points include:
    - It follows 0-based indexing, similar to Python and NumPy.
    - When slicing, the start bound is included, and the upper bound is excluded.
    - Using a non-integer, even if it's a valid label, will result in an `IndexError`.

- **Selection by callable:** Methods like `.loc`, `.iloc`, and `[]` can accept a callable as an indexer. The callable must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.

Indeed, `loc` and `iloc` are the recommended indexing methods to access data within a DataFrame or Series.

- `.loc` is used for label-based indexing. It allows you to access rows and columns by labels, boolean arrays, label slicing, alignable boolean Series, and other methods (check the [documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label) for full information).

- `.iloc` is used for position-based indexing. It allows you to access data by integer positions, similar to how you would with a list or an array in Python. It expects integer positions for rows and columns and also supports slicing with integer positions or boolean arrays.

The reason to prefer these methods is the different treatment of integers when using the `[]` operator. More precisely, data selection is always label-oriented, however, slicing with integers is always integer-oriented. To avoid confusion and ambiguity, it's prefer indexing with `loc` and `iloc`.

> **NOTE:**
> You can retrieve a DataFrame's column by using the dot attribute notation. However, it's not recommended

## Series

Starting by Series, the `[]` indexing operator works in a label-based manner.

In [3]:
# Indexing on Series

s = pd.Series(np.arange(4.), index=[3, 1, 2, 0])
s

3    0.0
1    1.0
2    2.0
0    3.0
dtype: float64

In [4]:
# Label-based indexing
# of a single label

s.loc[3]

0.0

In [5]:
# List of labels

s.loc[[3,0]]

3    0.0
0    3.0
dtype: float64

In [6]:
# Slice of labels (included)

s.loc[3:2]

3    0.0
1    1.0
2    2.0
dtype: float64

In [7]:
# Alignable boolean Series

s.loc[pd.Series([True, False, True, False], index=[3, 1, 2, 0])]

3    0.0
2    2.0
dtype: float64

In [8]:
# Label-based indexing vs regular based indexing

s1 = pd.Series(np.arange(4.), index=[0, 1, 2, 3])
s2 = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [9]:
s1

0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

In [10]:
s2

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [11]:
s1[[0, 1, 2]]

0    0.0
1    1.0
2    2.0
dtype: float64

In [12]:
s2[[0, 1, 2]]

  s2[[0, 1, 2]]


a    0.0
b    1.0
c    2.0
dtype: float64

In [13]:
# When using loc, the expression s2.loc[[0, 1, 2]] will not work
# because the loc operator indexes exclusively with labels,
# and s2 does not contain integer labels."

s2.loc[[0, 1, 2]]

KeyError: "None of [Index([0, 1, 2], dtype='int64')] are in the [index]"

In [None]:
# Position-based indexing

s = pd.Series(np.arange(4.), index=[3, 1, 2, 0])
s

In [None]:
s.iloc[3]

In [14]:
s.iloc[[0, 3]]

3    0.0
0    3.0
dtype: float64

In [15]:
s.iloc[:2]

3    0.0
1    1.0
dtype: float64

In [16]:
s.iloc[[True, False, True, False]]

3    0.0
2    2.0
dtype: float64

## DataFrames

Indexing into a DataFrame using regular based indexing retrieves one or more columns, either with a single value or a sequence.

In [17]:
df = pd.DataFrame(
    data=np.arange(16).reshape((4, 4)),
    index=["USP", "UFMG", "Unicamp", "UFSCar"],
    columns=["beauty", "students", "parties", "tuscas"]
)
df

Unnamed: 0,beauty,students,parties,tuscas
USP,0,1,2,3
UFMG,4,5,6,7
Unicamp,8,9,10,11
UFSCar,12,13,14,15


In [18]:
# Note that a single column label returns a Series,
# while a list of column labels returns a DataFrame

df['beauty']

USP         0
UFMG        4
Unicamp     8
UFSCar     12
Name: beauty, dtype: int64

In [19]:
df[['beauty']]

Unnamed: 0,beauty
USP,0
UFMG,4
Unicamp,8
UFSCar,12


In [20]:
df[['students', 'tuscas']]

Unnamed: 0,students,tuscas
USP,1,3
UFMG,5,7
Unicamp,9,11
UFSCar,13,15


In [21]:
# The slicing occurs over rows, not columns!

df[:2]

Unnamed: 0,beauty,students,parties,tuscas
USP,0,1,2,3
UFMG,4,5,6,7


In [22]:
# Label-based indexing using loc

df.loc['UFSCar']

beauty      12
students    13
parties     14
tuscas      15
Name: UFSCar, dtype: int64

In [23]:
# Again, note that a single column label returns a Series,
# while a list of column labels returns a DataFrame

df.loc[['UFSCar', 'USP']]

Unnamed: 0,beauty,students,parties,tuscas
UFSCar,12,13,14,15
USP,0,1,2,3


In [24]:
# Label-based indexing to retrieve rows and columns

df.loc[['UFSCar', 'USP'], 'tuscas']

UFSCar    15
USP        3
Name: tuscas, dtype: int64

In [25]:
df.loc[['UFSCar', 'USP'], ['beauty', 'tuscas']]

Unnamed: 0,beauty,tuscas
UFSCar,12,15
USP,0,3


In [26]:
# Position-based indexing

df.iloc[3]

beauty      12
students    13
parties     14
tuscas      15
Name: UFSCar, dtype: int64

In [27]:
df.iloc[[0,3]]

Unnamed: 0,beauty,students,parties,tuscas
USP,0,1,2,3
UFSCar,12,13,14,15


In [28]:
df.iloc[[0,3], 0:3]

Unnamed: 0,beauty,students,parties
USP,0,1,2
UFSCar,12,13,14


In [29]:
# Combining both label-based and position-based indexing

df.loc[df.index[:2], 'students']

USP     1
UFMG    5
Name: students, dtype: int64

In [30]:
df.iloc[:2, df.columns.get_loc('students')]

USP     1
UFMG    5
Name: students, dtype: int64

In [31]:
# Using a callable on a Series resulting from a column selection on a DataFrame

df['beauty'].loc[lambda beauty: beauty >= 10]

UFSCar    12
Name: beauty, dtype: int64

There are many ways to select and rearrange the data contained in a pandas object. These were just examples of the most common practices.

For more information and examples, take a look at the [official pandas tutorial](https://pandas.pydata.org/docs/user_guide/indexing.html).

# Filtering and Attribution

Indexing behavior is mostly used to filter data based on conditions and assign values to rows or columns that meet those criteria.

To perform boolean comparisons, pandas provides a variety of boolean operators along with overloaded traditional Python operators, as shown in the table below. You can combine these operators with `|` (or), `&` (and), and `~` (not) for advanced filtering.

| Method | Python Operator |
|--------|-----------------|
| eq     | ==              |
| ne     | !=              |
| lt     | <               |
| le     | <=              |
| gt     | >               |
| ge     | >=              |

By using alignable boolean vectors to filter data (a technique called masking), you can efficiently select and modify subsets of the DataFrame.

In [32]:
df = pd.DataFrame(
    data=np.arange(16).reshape((4, 4)),
    index=["A", "B", "C", "D"],
    columns=["beauty", "students", "parties", "tuscas"]
)
df

Unnamed: 0,beauty,students,parties,tuscas
A,0,1,2,3
B,4,5,6,7
C,8,9,10,11
D,12,13,14,15


In [33]:
# The boolean comparison '>=' returns an alignable boolean series with the df DataFrame.
# We can use it to filter rows based on desired criteria.

df['students'] >= 10

A    False
B    False
C    False
D     True
Name: students, dtype: bool

In [34]:
# Alternatively:
# mask = df['students'] >= 10
# df[mask]

# Very common practice

df[df['students'] >= 10]

Unnamed: 0,beauty,students,parties,tuscas
D,12,13,14,15


Always group your conditions using parentheses! Since the overloaded operators will follow the default Python behavior, it's important to guarantee the desired operations order. For example, in our case Python will evaluate an expression such as `df['students'] >= 1 & df['tuscas'] <= 5` as `df['students'] > (1 & df['tuscas']) <= 5`, while the desired evaluation order is `(df['students'] >= 1) & (df['tuscas'] <= 5)`.

In [35]:
# Using regular index based (very common practice)

df[(df['students']>=1) & (df['tuscas']<=5)]

Unnamed: 0,beauty,students,parties,tuscas
A,0,1,2,3


In [36]:
# Using loc (also very common)

df.loc[(df['students']>=1) & (df['tuscas']<=5)]

Unnamed: 0,beauty,students,parties,tuscas
A,0,1,2,3


In [37]:
# Once we're able to filter the desired data based on custom criteria conditions,
# we can apply some changes to it as well.

df[(df['students']>=1) & (df['tuscas']<=5)] = 0
df

Unnamed: 0,beauty,students,parties,tuscas
A,0,0,0,0
B,4,5,6,7
C,8,9,10,11
D,12,13,14,15


In [38]:
df[df > 10] = 1000
df

Unnamed: 0,beauty,students,parties,tuscas
A,0,0,0,0
B,4,5,6,7
C,8,9,10,1000
D,1000,1000,1000,1000


## Attribution

Assigning a column that doesn’t exist will create a new column. To assign lists or arrays to a column, the array lenght must match the lenght of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present.

In [39]:
df['rating'] = 5
df

Unnamed: 0,beauty,students,parties,tuscas,rating
A,0,0,0,0,5
B,4,5,6,7,5
C,8,9,10,1000,5
D,1000,1000,1000,1000,5


In [40]:
df['rating'] = np.arange(4)
df

Unnamed: 0,beauty,students,parties,tuscas,rating
A,0,0,0,0,0
B,4,5,6,7,1
C,8,9,10,1000,2
D,1000,1000,1000,1000,3


In [41]:
# df['rating'] = pd.Series([1, 2, 3], index=['A', 'B', 'C'])

df['rating'] = pd.Series([1, 2, 3])
df

Unnamed: 0,beauty,students,parties,tuscas,rating
A,0,0,0,0,
B,4,5,6,7,
C,8,9,10,1000,
D,1000,1000,1000,1000,


## Caution: Views, Copies and Copy-on-Write

When manipulating data with pandas, it's important to understand the distinction between views and copies. Some operations return views of the underlying data, while others create actual copies. This behavior can lead to unintended mutations if not understood correctly.

Pandas issues a `SettingWithCopyWarning` to alert users that they might be assigning to a copy of a slice, which is often unintentional. This warning can sometimes be a false positive, but it mostly helps users identify and catch potential bugs.

To address these inconsistencies, pandas introduced a Copy-on-Write (CoW) feature in version 1.5, which will become the default behavior in pandas 3.0.

In [42]:
df = pd.DataFrame({
    'dish': ['Salad', 'Pasta', 'Ice Cream', 'Cake', 'Steak'],
    'total_price': [36.00, 120.00, 14.5, 54.00, 172.5],
    'quantity': [3, 2, 1, 4, 3],
})
df

Unnamed: 0,dish,total_price,quantity
0,Salad,36.0,3
1,Pasta,120.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,172.5,3


In [43]:
# Get a view and change it
# (warning expected)

salad = df['dish']
salad.iloc[0] = 'oops'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  salad.iloc[0] = 'oops'


In [44]:
salad

0         oops
1        Pasta
2    Ice Cream
3         Cake
4        Steak
Name: dish, dtype: object

In [45]:
# Side-effect !
df

Unnamed: 0,dish,total_price,quantity
0,oops,36.0,3
1,Pasta,120.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,172.5,3


Effects of the View & Copy behavior is especially common with chained indexing, which refers to accessing or setting values in a pandas object through two consecutive indexing operations.

Assigning a value to the result of chained indexing usually lead to unexpected results since pandas doesn't guarantee whether the operation returns a view of the underlying data or a copy.

This uncertainty depends on several technical factors, most notably the internal memory layout of the data, especially when pandas utilizes NumPy as backend.

For example:

In [46]:
# Chained indexing

df['total_price'][df['total_price'] > 100] = 99
df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['total_price'][df['total_price'] > 100] = 99
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['total_price

Unnamed: 0,dish,total_price,quantity
0,oops,36.0,3
1,Pasta,99.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,99.0,3


In [47]:
with pd.option_context("mode.copy_on_write", True):
    df['total_price'][df['total_price'] > 100] = 99
    display(df)

/var/folders/b6/l0rr20x91ys1hhpzy1g789380000gn/T/ipykernel_54871/202190318.py:2: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment.
When using the Copy-on-Write mode, such chained assignment never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

Try using '.loc[row_indexer, col_indexer] = value' instead, to perform the assignment in a single step.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['total_price'][df['total_price'] > 100] = 99


Unnamed: 0,dish,total_price,quantity
0,oops,36.0,3
1,Pasta,99.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,99.0,3


In [48]:
with pd.option_context("mode.copy_on_write", True):
    df.loc[df['total_price'] > 100, 'total_price'] = 99
    display(df)

Unnamed: 0,dish,total_price,quantity
0,oops,36.0,3
1,Pasta,99.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,99.0,3


### Copy-on-Write

Copy-on-Write (CoW) ensures that any DataFrame or Series derived from another always behaves as a copy. As a result, you can only change the values of an object by modifying the object itself. CoW prevents inplace updates to a DataFrame or Series that shares data with another object.

In this case, to avoid issues with chained indexing, we must always use `.loc`.

The implementation of `.loc` ensures that you can operate on DataFrames and Series without causing unintended side effects (since it's a single operation).

> Although CoW has implications on performance and memory management, we'll not talk about it here.

## `where()` and `mask()` methods

The `where(cond, other, ...)` method in pandas implements the if-then logic, allowing conditional replacement of elements in a DataFrame. For each element in the DataFrame, if `cond` is True, the original element is kept; otherwise, the element from `other` is used. If the axis of other does not align with the axis of the condition (`cond`), the misaligned index positions will be filled with `False`.

By default, `where()` returns a modified copy of the DataFrame. To modify the original DataFrame in place, set `inplace=True`.

Alternatively, we can use the `mask()` method which is the opposite of `where()`. It replaces the values where the `cond` is True.

In [49]:
df = pd.DataFrame(np.random.uniform(-1, 1, size=(5,5)))
df

Unnamed: 0,0,1,2,3,4
0,0.097627,0.430379,0.205527,0.089766,-0.15269
1,0.291788,-0.124826,0.783546,0.927326,-0.233117
2,0.58345,0.05779,0.136089,0.851193,-0.857928
3,-0.825741,-0.959563,0.66524,0.556314,0.740024
4,0.957237,0.598317,-0.077041,0.561058,-0.763451


In [50]:
# For cond is True, keep the original element.
# Otherwise, set other (default is NaN)

df.where(df < 0)

Unnamed: 0,0,1,2,3,4
0,,,,,-0.15269
1,,-0.124826,,,-0.233117
2,,,,,-0.857928
3,-0.825741,-0.959563,,,
4,,,-0.077041,,-0.763451


In [51]:
# which is equivalent to:
# df2 = df.copy()
# df2[~(df2 < 0)] = 0
# df2

df.where(df < 0, 0)

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,0.0,0.0,-0.15269
1,0.0,-0.124826,0.0,0.0,-0.233117
2,0.0,0.0,0.0,0.0,-0.857928
3,-0.825741,-0.959563,0.0,0.0,0.0
4,0.0,0.0,-0.077041,0.0,-0.763451


In [52]:
# Opposite of the where() method.
# It replaces the values of the rows where the condition evaluates to True.

df.mask(df < 0, 0)

Unnamed: 0,0,1,2,3,4
0,0.097627,0.430379,0.205527,0.089766,0.0
1,0.291788,0.0,0.783546,0.927326,0.0
2,0.58345,0.05779,0.136089,0.851193,0.0
3,0.0,0.0,0.66524,0.556314,0.740024
4,0.957237,0.598317,0.0,0.561058,0.0


# Arithmetic and Binary Operations

Performing element-wise operations such as arithmetic, boolean comparisons, and binary operations is one of the most essential data manipulation actions day to day.

Pandas DataFrames and Series offer several methods for these operations, including `add()`, `sub()`, `mul()`, `div()`, and their corresponding reverse operations `radd()`, `rsub()`, etc. These methods, along with overloaded Python operators, enable binary operations between DataFrame/Series and various data types (scalar, sequence, Series, or DataFrame).

| Method              | Python Operator |
|---------------------|-----------------|
| add, radd           | +               |
| sub, rsub           | -               |
| div, rdiv           | /               |
| floordiv, rfloordiv | //              |
| mul, rmul           | *               |
| pow, rpow           | **              |

These methods include three optional parameters to control their behavior:

- `axis:` Specifies alignment by index (`0` or `'index'`) or columns (`1` or `'columns'`) being `'columns'` the default. For Series input, this parameter aligns with the Series index. 
- `level:` Enables broadcasting across a specified level in a MultiIndex, matching index values at the given level.
- `fill_value:` Specifies a value to use for filling in missing (NaN) values before performing the operation. If both corresponding values in the DataFrame or Series are missing, the result will also be missing. Default is `None`.

In [53]:
df = pd.DataFrame(np.ones((4,4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0


In [54]:
s = pd.Series(np.ones((4,)))
s

0    1.0
1    1.0
2    1.0
3    1.0
dtype: float64

In [55]:
df + 1

Unnamed: 0,A,B,C,D
0,2.0,2.0,2.0,2.0
1,2.0,2.0,2.0,2.0
2,2.0,2.0,2.0,2.0
3,2.0,2.0,2.0,2.0


In [56]:
df['A'] * 10

0    10.0
1    10.0
2    10.0
3    10.0
Name: A, dtype: float64

In [57]:
(df['A'] + df['B']) * (df['C'] - 2)

0   -2.0
1   -2.0
2   -2.0
3   -2.0
dtype: float64

In [58]:
df['A'] / s

0    1.0
1    1.0
2    1.0
3    1.0
dtype: float64

In [59]:
df['A'] + s

0    2.0
1    2.0
2    2.0
3    2.0
dtype: float64


So if you wanna update a column of a DataFrame for instance, you need to attribute the result of operaton on it

In [60]:
# Note that the operations return a copy of the original data.
# If you want to update a column of a DataFrame, for instance,
# you need to assign the result of the operation to it.

df['A'] = df['A'] + s
df

Unnamed: 0,A,B,C,D
0,2.0,1.0,1.0,1.0
1,2.0,1.0,1.0,1.0
2,2.0,1.0,1.0,1.0
3,2.0,1.0,1.0,1.0


## Data Alignment

In pandas, operations like arithmetic, indexing, and filtering rely on data alignment. This means that these operations are performed based on the matching of indexes and columns between the involved DataFrames or Series.

Data alignment ensures that operations occur only on elements that share the same index and column labels. If the indexes or columns do not align, pandas will reindex the data structures to enforce alignment before performing the operation. As a result, any mismatched labels will produce NaN values in the output, indicating the absence of corresponding data points.

By preserving the alignment of indexes and columns, pandas ensure that operations maintain data context and consequently prevent errors that might occur when working with heterogeneous or misaligned data.

For example, to illustrate data alignment behavior, consider the following DataFrames:

In [61]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'B': [7, 8, 9], 'C': [10, 11, 12]}, index=['b', 'c', 'd'])

In [62]:
df1

Unnamed: 0,A,B
a,1,4
b,2,5
c,3,6


In [63]:
df2

Unnamed: 0,B,C
b,7,10
c,8,11
d,9,12


In [64]:
df1 + df2

Unnamed: 0,A,B,C
a,,,
b,,12.0,
c,,14.0,
d,,,


## Broadcasting DataFrames and Series

Broadcasting allows performing arithmetic operations between arrays of different shapes under certain conditions, making it seem as if the arrays had the same shape. The smaller array is "broadcast" across the larger array to make their shapes compatible.

In pandas, broadcasting enables interaction between list-like data structures, scalar values, Series, and DataFrames. Specially, for Series and DataFrames, broadcasting is managed by the `axis` parameter, allowing element-wise operations across rows and columns.

This means that when performing operations like addition (`add()`) or subtraction (`sub()`) between a Series and a DataFrame, the alignment of the Series index with the DataFrame's columns is necessary for row-wise operations (`axis=1`) to run successfully. In this case, the Series is broadcast across each row of the DataFrame, applying the operation element-wise to each corresponding column. Conversely, for column-wise operations (`axis=0`), the Series is broadcast across each column, and its index should align with the DataFrame's index.

For example, consider the following scenario:

In [65]:
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [66]:
s = pd.Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [67]:
df.add(s, axis='index')

Unnamed: 0,A,B
0,11,14
1,22,25
2,33,36


In [68]:
df.add(s, axis='columns')

Unnamed: 0,A,B,0,1,2
0,,,,,
1,,,,,
2,,,,,


In [69]:
s = pd.Series([10, 20], index=['A', 'B'])
s

A    10
B    20
dtype: int64

In [70]:
df.sub(s, axis=1)

Unnamed: 0,A,B
0,-9,-16
1,-8,-15
2,-7,-14


## Filling Missing Data

It is pretty common for real-world datasets to have missing data. Those missing values especially impact arithmetic operations. After all, what should we do when summing 1 + Null? To handle such cases, pandas arithmetic methods provide a mechanism to have a fallback value to substitute when one of the values at a location is missing, which is controlled by the `fill_value` parameter.

`fill_value` accepts a float or None (with the default being `None`) and ensures that missing values are filled before performing the operation, providing a seamless calculation even in the presence of `NaN`s.

In [71]:
df1 = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8]
})
df2 = pd.DataFrame({
    'A': [np.nan, 2, 3, 4],
    'B': [5, 6, np.nan, 8]
})

In [72]:
df1

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,
2,,7.0
3,4.0,8.0


In [73]:
df2

Unnamed: 0,A,B
0,,5.0
1,2.0,6.0
2,3.0,
3,4.0,8.0


In [74]:
df1.add(df2, fill_value=0)

Unnamed: 0,A,B
0,1.0,10.0
1,4.0,6.0
2,3.0,7.0
3,8.0,16.0


## Boolean Reductions

Boolean reductions are methods that reduce a DataFrame or Series to a single boolean value based on certain conditions. These operations evaluate whether all or any elements in the data structure meet a specified criterion.

The methods available for this are `all()` and `any()`. Both methods accept the following parameters:

- `axis:` Specifies which axis or axes to reduce. For Series, this parameter is unused and defaults to `0` (or `'index'`).
- `bool_only`: Includes only boolean columns. Not implemented for Series.
- `skipna:` Excludes null values. If the entire row/column is null and `skipna=True`, the result will be `False`, as for an empty row or column. Otherwise, null values are treated as `True` because they are not equal to zero.

In [75]:
df = pd.DataFrame(
    np.random.uniform(-1, 1, size=(5,5)),
    columns=['A', 'B', 'C', 'D', 'E'],
)
df

Unnamed: 0,A,B,C,D,E
0,0.279842,-0.713293,0.889338,0.043697,-0.170676
1,-0.470889,0.548467,-0.087699,0.136868,-0.96242
2,0.235271,0.224191,0.233868,0.887496,0.363641
3,-0.280984,-0.125936,0.395262,-0.879549,0.333533
4,0.341276,-0.579235,-0.742147,-0.369143,-0.272578


In [76]:
df > 0

Unnamed: 0,A,B,C,D,E
0,True,False,True,True,False
1,False,True,False,True,False
2,True,True,True,True,True
3,False,False,True,False,True
4,True,False,False,False,False


In [77]:
# Default axis='index'

(df > 0).all()

A    False
B    False
C    False
D    False
E    False
dtype: bool

In [78]:
(df > 0).all(axis='columns')

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [79]:
# Whe axis=None, reduce the whole DataFrame

(df > 0).all(axis=None)

False

In [80]:
(df > 0).any()

A    True
B    True
C    True
D    True
E    True
dtype: bool

In [81]:
(df > 0).any(axis='columns')

0    True
1    True
2    True
3    True
4    True
dtype: bool

In [82]:
# When reducing Series,
# it always return an single scalar
# (behavior is the same for axis='index' and None)

(df['A'] > 0).all()

False

In [83]:
(df.loc[0, :] > 0).all()

False

# Iteration

In pandas, you can iterate over Series and DataFrames, with behavior varying based on the object in question. Iterating over a Series treats it as array-like, producing values. DataFrames, on the other hand, follow a dict-like convention, iterating over the column labels.

To summarize, basic iteration (`for i in object`) yields:

- **Series:** values
- **DataFrame:** column labels

To iterate over DataFrame rows, you can use these methods:

- **iterrows():** Iterates over DataFrame rows as (index, Series) pairs. This converts rows to Series objects, which can change data types and affect performance.
- **itertuples():** Iterates over DataFrame rows as namedtuples of the values. This method is much faster than iterrows() and is generally preferable for accessing DataFrame values.

However, keep in mind that iterating through pandas objects is **VERY SLOW**. Manual row iteration is usually **unnecessary** and should be **avoided**.

In [84]:
df = pd.DataFrame({
    'dish': ['Salad', 'Pasta', 'Ice Cream', 'Cake', 'Steak'],
    'total_price': [36.00, 120.00, 14.5, 54.00, 172.5],
    'quantity': [3, 2, 1, 4, 3],
})
df

Unnamed: 0,dish,total_price,quantity
0,Salad,36.0,3
1,Pasta,120.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,172.5,3


In [85]:
# Iterating over DataFrame column labels

for col in df:
    print(col)

dish
total_price
quantity


In [86]:
# Iterating over DataFrame

for index, row in df.iterrows():
    print(index, row, sep="\n")

0
dish           Salad
total_price     36.0
quantity           3
Name: 0, dtype: object
1
dish           Pasta
total_price    120.0
quantity           2
Name: 1, dtype: object
2
dish           Ice Cream
total_price         14.5
quantity               1
Name: 2, dtype: object
3
dish           Cake
total_price    54.0
quantity          4
Name: 3, dtype: object
4
dish           Steak
total_price    172.5
quantity           3
Name: 4, dtype: object


# Function Application and Mapping

When working with pandas, applying custom logic to data is a common task which often involves using User-Defined Functions (UDFs). UDFs allow you to encapsulate your own logic or leverage functions from other libraries to operate on pandas objects.

Pandas provides two main methods for applying UDFs: `apply` and `map`. The choice between these methods depends on the level of operation your UDF performs—whether it processes an entire DataFrame, a Series, or individual elements.

- **Row or Column-wise Function Application:** Use the `apply()` method for functions applied along the axes of a DataFrame. The `axis` argument (default to `'index'`) controls where (row or column) the function is applied. Note that the return type of the function affects the final output:
    - If the function returns a Series, the output is a DataFrame, with columns matching the Series index.
    - If the function returns any other type, the output is a Series.

    You can adjust this behavior using the `result_type` parameter.

- **Elementwise Function Application**: Use the `map()` method to apply a function that operates on single values. On a Series, `map()` accepts any function that takes and returns a single value. On DataFrames, it applies the function elementwise.

To chain multiple apply and map calls, use `pipe()`.

In [87]:
df = pd.DataFrame({
    'dish': ['Salad', 'Pasta', 'Ice Cream', 'Cake', 'Steak'],
    'total_price': [36.00, 120.00, 14.5, 54.00, 172.5],
    'quantity': [3, 2, 1, 4, 3],
})
df

Unnamed: 0,dish,total_price,quantity
0,Salad,36.0,3
1,Pasta,120.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,172.5,3


In [88]:
def individual_price(row: pd.Series):
    return row['total_price'] / row['quantity']
    
df.apply(individual_price, axis='columns')

0    12.0
1    60.0
2    14.5
3    13.5
4    57.5
dtype: float64

In [89]:
df['total_price'].apply(lambda x: x / 2)

0    18.00
1    60.00
2     7.25
3    27.00
4    86.25
Name: total_price, dtype: float64

In [90]:
# We can achieve the same behavior using map.
# However, unlike apply, it doesn't expect parameters
# but still can handle missing values.

df['total_price'].map(lambda x: x / 2, na_action='ignore')

0    18.00
1    60.00
2     7.25
3    27.00
4    86.25
Name: total_price, dtype: float64

In [91]:
# When using map, you can pass a dict too
# It's useful for multiples if-then idiom

real_name = {
    'Salad': 'Caesar salad',
    'Pasta': 'Fettuccine Alfredo',
    'Cake': 'Ninho and Strawberry',
}
df['dish'].map(real_name)

0            Caesar salad
1      Fettuccine Alfredo
2                     NaN
3    Ninho and Strawberry
4                     NaN
Name: dish, dtype: object

In [92]:
# Invoking function with parameter

def final_price(price: pd.Series, dolar: bool) -> float:
    multiplier = 6 if dolar else 1
    return price * multiplier

df['total_price'].apply(final_price, dolar=True)

0     216.0
1     720.0
2      87.0
3     324.0
4    1035.0
Name: total_price, dtype: float64

In [93]:
# Apply an element-wise operation on a DataFrame,
# applying it element by element

df[['quantity', 'total_price']].apply(np.sqrt)

Unnamed: 0,quantity,total_price
0,1.732051,6.0
1,1.414214,10.954451
2,1.0,3.807887
3,2.0,7.348469
4,1.732051,13.133926


In [94]:
# We can achieve the same behavior
# using map on DataFrame

df[['quantity', 'total_price']].map(np.sqrt)

Unnamed: 0,quantity,total_price
0,1.732051,6.0
1,1.414214,10.954451
2,1.0,3.807887
3,2.0,7.348469
4,1.732051,13.133926


In [95]:
# Using reducing/aggregation functions column-wise

df[['quantity', 'total_price']].apply(np.sum, axis=0)

quantity        13.0
total_price    397.0
dtype: float64

In [96]:
# Using reducing/aggregation functions column-wise

df[['quantity', 'total_price']].apply(np.sum, axis=1)

0     39.0
1    122.0
2     15.5
3     58.0
4    175.5
dtype: float64

In [97]:
# Control result type (broadcasting)

df2 = df.copy()
df2.loc[:, ['quantity', 'total_price']] = df2[['quantity', 'total_price']].apply(np.sum, axis=0, result_type='broadcast')
df2

Unnamed: 0,dish,total_price,quantity
0,Salad,397.0,13
1,Pasta,397.0,13
2,Ice Cream,397.0,13
3,Cake,397.0,13
4,Steak,397.0,13


In [98]:
# Control result type (extend)

def enrich(row: pd.Series):
    return pd.Series(
        [row['total_price'] / row['quantity'], row['total_price'] / 2],
        index=['individual_price', 'price_for_two']
    )

df.apply(enrich, axis='columns', result_type='expand')

Unnamed: 0,individual_price,price_for_two
0,12.0,18.0
1,60.0,60.0
2,14.5,7.25
3,13.5,27.0
4,57.5,86.25


## No Need for Loops

The `apply()` method is the recommended alternative to looping over pandas DataFrames. It lets you apply any function (specially UDFs) along a DataFrame's axis, providing the flexibility to implement custom logic without manual loops.

# Descriptive Statistics

Pandas objects provide a range of mathematical and statistical built-in methods for computing descriptive statistics on Series and DataFrame objects with support to handle missing data by default. These methods are primarily aggregations, which produces lower-dimensional results such as `sum()`, `mean()`, and `quantile()`.

Still yet, there are also some methods like `cumsum()` and `cumprod()` which return an object of the same size.

The table below shows some popular aggregation methods.

| Method         | Description                                                                                  |
|----------------|----------------------------------------------------------------------------------------------|
| count          | Number of non-NA values                                                                      |
| min, max       | Compute minimum and maximum values                                                           |
| idxmin, idxmax | Compute index labels at which minimum or maximum value is obtained, respectively             |
| quantile       | Compute sample quantile ranging from 0 to 1 (default: 0.5)                                   |
| sum            | Sum of values                                                                                |
| mean           | Mean of values                                                                               |
| median         | Arithmetic median (50% quantile) of values                                                   |
| var            | Sample variance of values                                                                    |
| std            | Sample standard deviation of values                                                          |

In [99]:
np.random.uniform(-1, 1, size=(4,3))

array([[ 0.14039354, -0.12279697,  0.97674768],
       [-0.79591038, -0.58224649, -0.67738096],
       [ 0.30621665, -0.49341679, -0.06737845],
       [-0.51114882, -0.68206083, -0.77924972]])

In [100]:
data = np.array([
    [0.14039354, -0.12279697, 0.97674768],
    [-0.79591038, np.nan, -0.67738096],
    [0.30621665, -0.49341679, np.nan],
    [0.17302587, -0.95978491, np.nan]
])

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['col1', 'col2', 'col3'])
df

Unnamed: 0,col1,col2,col3
a,0.140394,-0.122797,0.976748
b,-0.79591,,-0.677381
c,0.306217,-0.493417,
d,0.173026,-0.959785,


In [101]:
# Calling DataFrame’s sum method returns a Series containing column sums

df.sum()

col1   -0.176274
col2   -1.575999
col3    0.299367
dtype: float64

In [102]:
# Passing axis="columns" or axis=1 sums across the columns instead

df.sum(axis='columns')

a    0.994344
b   -1.473291
c   -0.187200
d   -0.786759
dtype: float64

## Skipping Missing Data

By default, pandas exclude NaN values when computing the result. This behavior is controlled by `skipna` parameter. If `skipna` is set to `True`, any NaN value in a row or column will result in NaN for the corresponding computing axis.

In [103]:
df.mean(axis='index', skipna=False)

col1   -0.044069
col2         NaN
col3         NaN
dtype: float64

In [104]:
df.mean(axis='index', skipna=True)

col1   -0.044069
col2   -0.525333
col3    0.149683
dtype: float64

## `unique()`, `value_counts()`, `isin()` 

In addition to descriptive statistics operations, pandas offers the `unique()` and `value_counts()` built-in methods to evaluate data uniqueness and counting occurrences, respectively.

- `unique():` Return the unique values in a Series or Index.
- `value_counts():` Returns a Series containing counts of unique values in descending order. Parameters include normalize to return proportions instead of counts, sort to control the sorting order, ascending to specify the order, and dropna to include or exclude NaN values.

Additionally, there's the `isin()` method, which checks whether each element in the DataFrame or Series is contained in a specified list-like object. This is useful for filtering data based on membership in a given set of values.

In [105]:
s = pd.Series(['apple', 'coconut', 'banana', 'banana', 'apple', 'banana'])
s

0      apple
1    coconut
2     banana
3     banana
4      apple
5     banana
dtype: object

In [106]:
s.unique()

array(['apple', 'coconut', 'banana'], dtype=object)

In [107]:
s.value_counts()

banana     3
apple      2
coconut    1
Name: count, dtype: int64

In [108]:
s.isin(['apple'])

0     True
1    False
2    False
3    False
4     True
5    False
dtype: bool

# Sorting and Ranking

Sorting and ranking DataFrames or Series is an essential operation when manipulating data. It ensures your data is organized according to your needs, enhancing readability and usability.

## Sorting

Pandas supports three types of sorting: by index labels, by column values, and by a combination of both.

- `sort_index()`: Sort lexicographically row labels (index) or column labels. This method returns a new and sorted object. You specify the index (row labels or columns) using the `axis` parameter.
- `sort_values()`: Sort a DataFrame (or Series) by its column or row values. Specify one or more columns to determine the sorted order using the optional by parameter.
- To sort by index and values, combine the use of both methods.

By default, data is sorted in ascending order. To sort in descending order, set the `ascending` parameter to `False`.

In [109]:
df = pd.DataFrame(np.arange(16).reshape(4,4), index=[3, 1, 2, 0], columns=['C', 'B', 'D', 'A'])
df

Unnamed: 0,C,B,D,A
3,0,1,2,3
1,4,5,6,7
2,8,9,10,11
0,12,13,14,15


In [110]:
df.sort_index()

Unnamed: 0,C,B,D,A
0,12,13,14,15
1,4,5,6,7
2,8,9,10,11
3,0,1,2,3


In [111]:
df.sort_index(axis='columns')

Unnamed: 0,A,B,C,D
3,3,1,0,2
1,7,5,4,6
2,11,9,8,10
0,15,13,12,14


In [112]:
df.sort_values(by=['C', 'B'], ascending=False)

Unnamed: 0,C,B,D,A
0,12,13,14,15
2,8,9,10,11
1,4,5,6,7
3,0,1,2,3


In [113]:
df['A'].sort_values(ascending=False)

0    15
2    11
1     7
3     3
Name: A, dtype: int64

## Ranking

Ranking involves assigning ranks from one to the number of valid data points in an array, starting with the lowest value. Pandas provides `rank` method for both Series and DataFrame objects to achieve this. By default, the rank method handles ties by assigning the mean rank to each group.

You can also control the assignment of ranks based on their order of appearance in the data. This is managed using the method parameter, which which offers options such as `'average'`, `'min'`, `'max'`, `'first'`, and `'dense'`

In [114]:
df = pd.DataFrame({
    'dish': ['Salad', 'Pasta', 'Ice Cream', 'Cake', 'Steak'],
    'quantity': [3, 2, 1, 4, 3],
})
df

Unnamed: 0,dish,quantity
0,Salad,3
1,Pasta,2
2,Ice Cream,1
3,Cake,4
4,Steak,3


In [115]:
df['default_tank'] = df['quantity'].rank()
df['1st_rank'] = df['quantity'].rank(method='first')
df['max_rank'] = df['quantity'].rank(method='max')
df['min_rank'] = df['quantity'].rank(method='min')
df

Unnamed: 0,dish,quantity,default_tank,1st_rank,max_rank,min_rank
0,Salad,3,3.5,3.0,4.0,3.0
1,Pasta,2,2.0,2.0,2.0,2.0
2,Ice Cream,1,1.0,1.0,1.0,1.0
3,Cake,4,5.0,5.0,5.0,5.0
4,Steak,3,3.5,4.0,4.0,3.0


# Reindexing

As we already know, the Index object is a key component in pandas' data manipulation toolkit, so handling it properly is important.

When loading a dataset or creating a DataFrame, setting an index is a common task specially if there's some column subset useful as index. You can achieve this using the `set_index()` method of a DataFrame, which accepts a column name for a regular Index or a list of column names for a MultiIndex, resulting in a new, re-indexed DataFrame.

Conversely, the `reset_index()` method reverses this operation. It moves the index values back into the DataFrame's columns and reverts to a simple integer index.

In [116]:
df = pd.DataFrame({
    'dish': ['Salad', 'Pasta', 'Ice Cream', 'Cake', 'Steak'],
    'total_price': [36.00, 120.00, 14.5, 54.00, 172.5],
    'quantity': [3, 2, 1, 4, 3],
})
df

Unnamed: 0,dish,total_price,quantity
0,Salad,36.0,3
1,Pasta,120.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,172.5,3


In [117]:
# Setup index

df = df.set_index('dish')
df

Unnamed: 0_level_0,total_price,quantity
dish,Unnamed: 1_level_1,Unnamed: 2_level_1
Salad,36.0,3
Pasta,120.0,2
Ice Cream,14.5,1
Cake,54.0,4
Steak,172.5,3


In [118]:
# Reset index

df = df.reset_index()
df

Unnamed: 0,dish,total_price,quantity
0,Salad,36.0,3
1,Pasta,120.0,2
2,Ice Cream,14.5,1
3,Cake,54.0,4
4,Steak,172.5,3


Aditionally, if you need to create a new object with values rearranged to match a new axis index, you can use the `reindex` method.

The `reindex` method generates a new object with values aligned to a specified new axis index, either for rows or columns. This alignment adds missing values where necessary. The `method` parameter in `reindex` supports various interpolation or filling approaches to handle these missing values.

In [119]:
df.reindex(columns=['total_price', 'dish'])

Unnamed: 0,total_price,dish
0,36.0,Salad
1,120.0,Pasta
2,14.5,Ice Cream
3,54.0,Cake
4,172.5,Steak


# References

- [Python for Data Analysis by Wes McKinney (3e)](https://wesmckinney.com/book/)
- [Pandas Official Documentation](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Frequently Asked Questions (FAQ) on Pandas](https://pandas.pydata.org/docs/user_guide/gotchas.html)
- [Deep Dive into pandas Copy-on-Write Mode: Part I](https://towardsdatascience.com/deep-dive-into-pandas-copy-on-write-mode-part-i-26982e7408c6)


# Appendix: Multi-Indexing / Advanced Indexing

Multi-indexing in pandas is a very important and useful concept that allows for more complex data structures by enabling multiple levels of indexing on rows and columns. This hierarchical indexing system lets you work with higher-dimensional data in a more intuitive and accessible way.

Multi-indexing provides a way to group and aggregate data across multiple dimensions, supporting advanced data manipulation and analysis. This makes it a powerful tool for handling real-world datasets that often come with nested or hierarchical relationships.

# Exercises

To help you understand the concepts covered in this notebook, here are some practice problems.

These questions refer to a dataset containing information on the salary, position, and employability of professionals in the data field, primarily in the US. The dataset is available on [Kaggle by Zee solver](https://www.kaggle.com/datasets/zeesolver/data-eng-salary-2024/data).

## Dataset Description

*"The 2024 dataset on data developer salaries and employment attributes offers valuable insights into the evolving landscape of data developers. It includes key variables such as salary, job title, experience level, employment type, employee residence, remote work ratio, company location, and company size. This data enables detailed analysis of salary trends, employment patterns, and geographic variations in data developer roles. Researchers, analysts, and organizations can leverage this dataset to better understand compensation trends, the distribution of data developer roles across different regions, and the impact of remote work and company size on employment in this field."*

## Columns Description

- **experience_level:** Level of professional experience (e.g., junior, mid, senior).
- **employment_type:** Type of job contract (e.g., full-time, part-time, contract).
- **job_title:** The specific role or title of the employee (e.g., Data Engineer).
- **salary:** The compensation received, in the original currency.
- **salary_currency:** The currency in which the salary is paid.
- **salary_in_usd:** The salary converted into US dollars for comparison.
- **employee_residence:** The location where the employee resides.
- **remote_ratio:** Percentage of work done remotely.
- **company_location:** The geographical location of the company.
- **company_size:** The scale of the company, often based on employee count.



In [4]:
pd.options.mode.copy_on_write = True

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/ahayasic/workshop-pandas-zero-to-hero/main/datasets/data-jobs-salary-2024/Dataset_salary_2024.csv')
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2024,SE,FT,AI Engineer,202730,USD,202730,US,0,US,M
1,2024,SE,FT,AI Engineer,92118,USD,92118,US,0,US,M
2,2024,SE,FT,Data Engineer,130500,USD,130500,US,0,US,M
3,2024,SE,FT,Data Engineer,96000,USD,96000,US,0,US,M
4,2024,SE,FT,Machine Learning Engineer,190000,USD,190000,US,0,US,M


1. Select rows where experience_level is 'SE' and display the mean salary_in_usd for these rows.

2. Select rows where job_title is 'Machine Learning Engineer' or 'Data Engineer' and show the top 10 salary_in_usd. Display the job_title, salary_in_usd, employee_residence, and experience_level.

3. Select rows where the salary_in_usd is between 100000 and 150000 and show the count of each experience_level.

4. Select rows where the employee_residence is in ['US', 'CA', 'UK'] and calculate the sum of salary_in_usd for these rows.

5. What are the job_title, experience_level, and employment_type of the highest and lowest salary_in_usd for employees whose employee_residence is not 'US'?

6. What are the minimum, maximum, and average salary_in_usd for the year 2021 for experience_level 'EX'?

7. Create a new column salary_in_brl and convert salary_in_usd to BRL (use a 6x conversion rate to <2023 year and 5x for 2024).

8. Create a new column job_category based on job_title where titles containing 'Engineer' are categorized as 'Engineering', and others are categorized as 'Other'.

9. Create a new column tax_rate where the rate is 5% if experience_level is 'EN' and 20% otherwise.

10. Create a new column annual_bonus where it is 5% of salary_in_usd for 'EN', 8% for 'MI', 12% for 'SE', and 20% for 'EX'.

11. Increase all salary_in_usd values by 10%

12. Add a new column happy_birthday_bonus which is 5% of salary_in_usd

13. Calculate the total compensation (salary_in_usd + happy_birthday_bonus + annual_bonus)

14. Calculate the final compensation (salary_in_usd + happy_birthday_bonus + annual_bonus - tax)

15. (Plus) Given questions 10 to 14, create a solution using `iterrows()`, `apply()`, and *arithmetic operators*, each one in a specific cell. At the start of each cell, add the `%%timeit` magic command.