# Grouping and aggregation with pandas

## Aggregation and reduction

Similar to NumPy, pandas supports data aggregation and reduction functions 
such as computing sums or averages. By _"aggregation"_ or _"reduction"_ 
we mean that the result of a computation has a lower dimension than the original data: for example, the mean reduces a series of observations (1 dimension) into a scalar value (0 dimensions).

Unlike NumPy, these operations
can be applied to subsets of the data which have been
grouped according to some criterion. 

Such operations are often referred to as *split-apply-combine* (see the official [user guide](https://pandas.pydata.org/docs/user_guide/groupby.html)) as they involve these three steps:

1. *Split* data into groups based on some criteria;
2. *Apply* some function to each group separately; and
3. *Combine* the results into a single `DataFrame` or `Series`.

See also the pandas [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for an illustration of such operations.

We first set the path pointing to the folder which contains the data files used in this lecture. You may need to adapt it to your own environment.

In [1]:
# Uncomment this to use files in the local data/ directory
DATA_PATH = '../data'

# Uncomment this to load data directly from GitHub

# DATA_PATH = 'https://raw.githubusercontent.com/richardfoltyn/TECH2-H24/main/data'

### Working with entire DataFrames

The simplest way to perform data reduction is to invoke the desired
function on the entire `DataFrame`.

In [2]:
import pandas as pd

# Read in Titanic passenger data, set PassenderId column as index
df = pd.read_csv(f'{DATA_PATH}/titanic.csv', index_col='PassengerId')

# Compute mean of all numerical columns
df.mean(numeric_only=True)

Survived     0.383838
Pclass       2.308642
Age         29.699118
Fare        32.204208
dtype: float64

Methods such as [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) 
are by default applied column-wise to each
column. The `numeric_only=True` argument is used to discard
all non-numeric columns (depending on the version of pandas, `mean()` will
issue a warning if there are non-numerical columns in the `DataFrame`).

One big advantage over NumPy is that missing values (represented
by `np.nan`) are automatically ignored:

In [3]:
# mean() automatically drops missing observations
mean_pandas = df['Age'].mean()

# Compare this to the NumPy variant:
import numpy as np

# Returns NaN since some ages are missing (coded as NaN)
mean_numpy = np.mean(df['Age'].to_numpy())

print(f'Mean using Pandas: {mean_pandas}')
print(f'Mean using NumPy:  {mean_numpy}')

Mean using Pandas: 29.69911764705882
Mean using NumPy:  nan


As we have seen previously, NumPy implements an additional set of aggregation functions which drop NaNs, for example [`np.nanmean()`](https://numpy.org/doc/2.0/reference/generated/numpy.nanmean.html).

### Working on subsets of data (grouping)

Applying aggregation functions to the entire `DataFrame` is similar
to what we can do with NumPy. The added flexibility of pandas
becomes obvious once we want to apply these functions to subsets of
data, i.e., groups which we can define based on values or index labels.

For example, we can easily group passengers by class using
[`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html):

In [4]:
import pandas as pd

# Import Titanic data set, set PassenderId column as index
df = pd.read_csv(f'{DATA_PATH}/titanic.csv', index_col='PassengerId')

# Group observations by accommodation class (first, second, third)
groups = df.groupby(['Pclass'])

Here `groups` is a special pandas objects which can subsequently be
used to process group-specific data. To compute the group-wise
averages, we can simply run

In [5]:
groups.mean(numeric_only=True)

Unnamed: 0_level_0,Survived,Age,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.62963,38.233441,84.154687
2,0.472826,29.87763,20.662183
3,0.242363,25.14062,13.67555


Groups support column indexing: if we want to only compute the
total fare paid by passengers in each class, we can do this as follows:

In [6]:
groups['Fare'].sum()

Pclass
1    18177.4125
2     3801.8417
3     6714.6951
Name: Fare, dtype: float64

#### Built-in aggregations

There are numerous routines to aggregate grouped data, for example:

- [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html):
    averages within each group
- [`sum()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html):
    sum values within each group
- [`std()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.std.html), 
    [`var()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.var.html): 
    within-group standard deviation and variance
- [`quantile()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html):
    compute quantiles within each group
- [`size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html): 
    number of observations in each group
- [`count()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html):
    number of non-missing observations in each group
- [`first()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html), 
    [`last()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.last.html): 
    first and last elements in each group
-   [`min()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.min.html), 
    [`max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html): 
    minimum and maximum elements within a group

See the [official documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods) for a complete list.

*Example: Number of elements within each group*

In [9]:
groups.size()       # return number of elements in each group

Pclass
1    216
2    184
3    491
dtype: int64

*Example: Return first observation of each group*

In [7]:
groups.last()

Unnamed: 0_level_0,Survived,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0,C148,C
2,0,"Montvila, Rev. Juozas",male,27.0,211536,13.0,E77,S
3,0,"Dooley, Mr. Patrick",male,32.0,370376,7.75,E121,Q


In [8]:
groups[['Survived', 'Age', 'Sex', 'Fare']].first()      # return first observation in each group

Unnamed: 0_level_0,Survived,Age,Sex,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,38.0,female,71.2833
2,1,14.0,female,30.0708
3,0,22.0,male,7.25


<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the average survival rate by sex (stored in the <TT>Sex</TT> column).</li>
    <li>Count the number of passengers aged 50+. Compute the average survival rate by sex for this group.</li>
    <li>Count the number of passengers below the age of 20 by class and sex. Compute the average survival rate for this group (by class and sex).</li>
</ol>
</div>

In [11]:
#1)
df.groupby("Sex")["Survived"].mean()


Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [12]:
#1)
df["Sex"].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [14]:

df2= df.loc[df["Sex"] == "female"]
df2

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...
881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,230433,26.0000,,S
883,0,3,"Dahlberg, Miss Gerda Ulrika",female,22.0,7552,10.5167,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,382652,29.1250,,Q
888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S


In [15]:
#2)
df.query("Age>=50").groupby("Sex")["Survived"].mean()

Sex
female    0.909091
male      0.134615
Name: Survived, dtype: float64

In [16]:
df.query("Age<20").groupby(["Pclass","Sex"])["Survived"].mean()

Pclass  Sex   
1       female    0.928571
        male      0.571429
2       female    1.000000
        male      0.526316
3       female    0.533333
        male      0.190476
Name: Survived, dtype: float64

In [None]:
mean_age= df["Age"].mean()


In [None]:
N = len(df)
for i in range(N):
    age = df.loc[i,"Age"]
    if np.isnan(age):
        df.loc[i,"Age"] = mean_age
df.loc[missing]

#### Writing custom aggregations

We can create custom aggregation routines by calling 
[`agg()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
(short-hand for [`aggregate()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html))
on the grouped object. These functions operate on one column at a time, so it is only possible to use observations from that column for computations. 

For example, we can alternatively call the built-in aggregation functions we just covered via `agg()`:

In [18]:

df.groupby("Pclass")["Age"].agg(np.mean)

  df.groupby("Pclass")["Age"].agg(np.mean)


Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [19]:
df.query("Age >= 40").groupby("Pclass")["Age"].count()

Pclass
1    81
2    37
3    45
Name: Age, dtype: int64

In [24]:
def my_agg(x):
    return np.sum(x >= 40)
df.groupby("Pclass")

In [23]:
df.groupby("Pclass")["Age"].agg(lambda x: np.sum (x >= 40))

Pclass
1    81
2    37
3    45
Name: Age, dtype: int64

In [9]:
# Calculate group means in needlessly complicated way
groups["Age"].agg("mean")

# More direct approach:
# groups["age"].mean()

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

On the other hand, we _have to_ use `agg()` if there is no built-in function to perform the desired aggregation.
To illustrate, imagine that we want to count the number of passengers aged 40+ in each class. There is no built-in function to achieve this, so we need to use `agg()` combined with a custom function to perform the desired aggregation:

In [10]:
import numpy as np

groups['Age'].agg(lambda x: np.sum(x >= 40))

Pclass
1    81
2    37
3    45
Name: Age, dtype: int64

Note that we called `agg()` only on the column `Age`, otherwise
the function would be applied to every column separately, which is not
what we want.

#### Applying multiple functions at once

It is possible to apply multiple functions in a single call by passing a list of functions. These can be passed as strings or as callables (functions).

*Example: Applying multiple functions to a **single** column*

 To compute the mean and median passenger age by class, we proceed as follows:

In [27]:
df.groupby("Pclass").agg(max_fare=("Fare","max"),av_age=("Age","mean") )

Unnamed: 0_level_0,max_fare,av_age
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,512.3292,38.233441
2,73.5,29.87763
3,69.55,25.14062


In [25]:
df.groupby("Pclass")["Age"].agg(("mean","median","min","max"))

Unnamed: 0_level_0,mean,median,min,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,38.233441,37.0,0.92,80.0
2,29.87763,29.0,0.67,70.0
3,25.14062,24.0,0.42,74.0


In [11]:
groups['Age'].agg(['mean', 'median'])

Unnamed: 0_level_0,mean,median
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.233441,37.0
2,29.87763,29.0
3,25.14062,24.0



Note that we could have also specified these function by passing references to the corresponding NumPy functions instead:

In [12]:
groups['Age'].agg([np.mean, np.median])

  groups['Age'].agg([np.mean, np.median])
  groups['Age'].agg([np.mean, np.median])


Unnamed: 0_level_0,mean,median
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.233441,37.0
2,29.87763,29.0
3,25.14062,24.0


The following more advanced syntax allows us to create new column names using existing columns and some operation:

```python
    groups.agg(
        new_column_name1=('column_name1', 'operation1'),
        new_column_name2=('column_name2', 'operation2'),
        ...
    )
```
This is called ["named aggregation"](https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation)
as the keywords determine the output column _names_.

*Example: Applying multiple functions to **multiple** columns*

The following code computes the average age and the highest fare in a single aggregation:

In [13]:
groups.agg(
    average_age=('Age', 'mean'), 
    max_fare=('Fare', 'max')
)

Unnamed: 0_level_0,average_age,max_fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.233441,512.3292
2,29.87763,73.5
3,25.14062,69.55


Finally, the most flexible aggregation method is `apply()` which calls a
given function, passing the _entire_ group-specific subset of data (including
all columns) as an argument. You need to use apply if data from more than one column is required to compute a statistic of interest.

<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the minimum, maximum and average age by embarkation port (stored in the column <TT>Embarked</TT>) in a single <TT>agg()</TT> operation.
    Note that there are several ways to solve this problem.</li>
    <li>Compute the number of passengers, the average age and the fraction of women by embarkation port in a single <TT>agg()</TT> operation. This one is more challenging and probably requires use of <TT>lambda</TT> expressions.</li>
</ol>
</div>

In [26]:
df["Embarked"].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [31]:
df.groupby("Embarked")["Age"].agg(("min","max","mean"))

Unnamed: 0_level_0,min,max,mean
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,0.42,71.0,30.814769
Q,2.0,70.5,28.089286
S,0.67,80.0,29.445397


In [32]:
df["Female"] = (df["Sex"]=="female")

In [36]:
df.groupby("Embarked").agg(number_passanger= ("Age","size"), avg_age= ("Age","mean"),frac_women= ("Female","mean"))

Unnamed: 0_level_0,number_passanger,avg_age,frac_women
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,168,30.814769,0.434524
Q,77,28.089286,0.467532
S,644,29.445397,0.315217


## Transformations

In the previous section, we combined grouping and reduction, i.e., data at the group level was reduced to a single statistic such as the mean. Alternatively, we can combine grouping with the
[`transform()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html) function which assigns the result of a computation to each observation within a group and consequently leaves the number of observations unchanged.

For example, for _each_ observation we could compute the average fare by class as follows:

In [40]:
df["Avg_age"]= df.groupby("Pclass")["Age"].transform("mean")

In [47]:
def my_diff(x):
    return x - np.mean(x)
df["Diff_fare"]= df.groupby("Pclass")["Fare"].transform(my_diff)


In [48]:
df['Avg_Fare'] = df.groupby('Pclass')[['Fare']].transform('mean')

# Print results for each institution
df[['Pclass', 'Fare', 'Avg_Fare']].head(10)

Unnamed: 0_level_0,Pclass,Fare,Avg_Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,7.25,13.67555
2,1,71.2833,84.154687
3,3,7.925,13.67555
4,1,53.1,84.154687
5,3,8.05,13.67555
6,3,8.4583,13.67555
7,1,51.8625,84.154687
8,3,21.075,13.67555
9,3,11.1333,13.67555
10,2,30.0708,20.662183


As you can see, instead of collapsing the `DataFrame` to only 3 observations (one for each class), the number of observations remains the same, and the average fare is constant within each class. 

When would we want to use `transform()` instead of aggregation? Such use cases arise whenever we want to perform computations that include the individual value as well as an aggregate statistic.

*Example: Deviation from average fare*

Assume that we want to compute how much each passenger's fare differed from the average fare in their respective class. We could compute this using `transform()` as follows:

In [49]:
import numpy as np

# Compute difference of passenger's fare and avg. fare paid within class
df['Fare_Diff'] = df.groupby('Pclass')['Fare'].transform(lambda x: x - np.mean(x))

# Print relevant columns
df[['Pclass', 'Fare', 'Fare_Diff']].head(10)

Unnamed: 0_level_0,Pclass,Fare,Fare_Diff
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,7.25,-6.42555
2,1,71.2833,-12.871387
3,3,7.925,-5.75055
4,1,53.1,-31.054687
5,3,8.05,-5.62555
6,3,8.4583,-5.21725
7,1,51.8625,-32.292187
8,3,21.075,7.39945
9,3,11.1333,-2.54225
10,2,30.0708,9.408617


<div class="alert alert-info">
<h3> Your turn</h3>
Use the Titanic data set to perform the following aggregations:
<ol>
    <li>Compute the <i>excess</i> fare paid by each passenger relative to the minimum fare by embarkation port and class, i.e., compute <i>Fare - min(Fare)</i>
        by port and class.</li>
</ol>
</div>

***
# Working with time series data

In economics and finance, we frequently work with time series data, i.e., observations that are associated with a particular point in time (time stamp) or a time period. pandas offers comprehensive support for such data, in particular if the time stamp or time period is used as the index of a `Series` or `DataFrame`.
This section presents a few of the most important concepts, see the official [documentation](https://pandas.pydata.org/docs/user_guide/timeseries.html) for a comprehensive guide.

To illustrate, let's construct a set of daily data for the first three months of 2024, i.e., the period 2024-01-01 to 2024-03-31 using the 
[`date_range()`](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html) function
(we use the data format `YYYY-MM-DD` in this section, but pandas also supports other date formats).

In [16]:
import pandas as pd
import numpy as  np

# Create sequence of dates from 2024-01-01 to 2024-03-31 
# at daily frequency
index = pd.date_range(start='2024-01-01', end='2024-03-31', freq='D')

# Use date range as index for Series with some artificial data
data = pd.Series(np.arange(len(index)), index=index)

# Print first 5 observations
data.head(5)

2024-01-01    0
2024-01-02    1
2024-01-03    2
2024-01-04    3
2024-01-05    4
Freq: D, dtype: int64

## Indexing with date/time indices

pandas implements several convenient ways to select observations associated with a particular date or a set of dates. For example, if we want to select one specific date, we can pass it as a string to `.loc[]`:

In [17]:
# Select single observation by date
data.loc['2024-01-01']

0

It is also possible to select a time period by passing a start and end point (where the end point is included, as usual with label-based indexing in pandas):

In [18]:
# Select first 5 days
data.loc['2024-01-01':'2024-01-05']

2024-01-01    0
2024-01-02    1
2024-01-03    2
2024-01-04    3
2024-01-05    4
Freq: D, dtype: int64

A particularly useful way to index time periods is a to pass a partial index. For example, if we want to select all observations from January 2024, we could use the range `'2024-01-01':'2024-01-31'`, but it is much easier to specify the partial index `'2024-01'` instead which includes all observations from January.

In [19]:
# Select all observations from January 2024
data.loc['2024-01']

2024-01-01     0
2024-01-02     1
2024-01-03     2
2024-01-04     3
2024-01-05     4
2024-01-06     5
2024-01-07     6
2024-01-08     7
2024-01-09     8
2024-01-10     9
2024-01-11    10
2024-01-12    11
2024-01-13    12
2024-01-14    13
2024-01-15    14
2024-01-16    15
2024-01-17    16
2024-01-18    17
2024-01-19    18
2024-01-20    19
2024-01-21    20
2024-01-22    21
2024-01-23    22
2024-01-24    23
2024-01-25    24
2024-01-26    25
2024-01-27    26
2024-01-28    27
2024-01-29    28
2024-01-30    29
2024-01-31    30
Freq: D, dtype: int64

## Lags, differences, and other useful transformations

When working with time series data, we often need to create lags or leads of a variable (e.g., if we want to include lagged values in a regression model). In pandas, this is done using 
[`shift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html)
which shifts the index by the desired number of periods (default: 1). For example, invoking
`shift(1)` creates lagged observations of each column in the `DataFrame`:

In [20]:
# Lag observations by 1 period
data.shift(1).head(5)

2024-01-01    NaN
2024-01-02    0.0
2024-01-03    1.0
2024-01-04    2.0
2024-01-05    3.0
Freq: D, dtype: float64

Note that the first observation is now missing since there is no preceding observation which could have provided the lagged value.

Another useful method is 
[`diff()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html)
which computes the difference between adjacent observations (the period over which the difference
is taken can be passed as a parameter).

In [21]:
# Compute 1-period difference
data.diff().head(5)

2024-01-01    NaN
2024-01-02    1.0
2024-01-03    1.0
2024-01-04    1.0
2024-01-05    1.0
Freq: D, dtype: float64

Note that `diff()` is identical to manually computing the difference with the lagged value like this:
```python
data - data.shift()
```

Additionally, we can use 
[`pct_change()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html)
which computes the percentage change (the relative difference) over a given number of periods (default: 1).

In [22]:
# Compute percentage change vs. previous period
data.pct_change().head(5)

2024-01-01         NaN
2024-01-02         inf
2024-01-03    1.000000
2024-01-04    0.500000
2024-01-05    0.333333
Freq: D, dtype: float64

Again, this is just a convenience method that is a short-cut for manually computing the percentage change:
```python
(data - data.shift()) / data.shift()
```

## Resampling and aggregation

Another useful feature of the time series support in pandas is *resampling* which is used to group observations by time period and apply some aggregation function.
This can be accomplished using the 
[`resample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html)
method which in its simplest form takes a string argument that describes how observations should be grouped
(`'YE'` for aggregation to years, `'QE'` for quarters, `'ME'` for months, `'W'` for weeks, etc.).

For example, if we want to aggregate our 3 months of artificial daily data to monthly frequency, we would use `resample('ME')`. This returns an object which is very similar to the one returned by `groupby()` we studied previously, and we can call various aggregation methods such as `mean()`:

In [23]:
# Resample to monthly frequency, aggregate to mean of daily observations 
# within each month
data.resample('ME').mean()

2024-01-31    15.0
2024-02-29    45.0
2024-03-31    75.0
Freq: ME, dtype: float64

Similarly, we can use `resample('W')` to resample to weekly frequency. Below,
we combine this with the aggregator 
[`last()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.last.html) 
to return the last observation of each week (weeks by default start on Sundays):

In [24]:
# Return last observation of each week
data.resample('W').last()

2024-01-07     6
2024-01-14    13
2024-01-21    20
2024-01-28    27
2024-02-04    34
2024-02-11    41
2024-02-18    48
2024-02-25    55
2024-03-03    62
2024-03-10    69
2024-03-17    76
2024-03-24    83
2024-03-31    90
Freq: W-SUN, dtype: int64