In [2]:
import numpy as np
import pandas as pd

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an **entire DataFrame or Series** , **row- or column-wise**, or **elementwise**.

- Tablewise Function Application: **pipe()**
- Row or Column-wise Function Application: **apply()**
- Aggregation API: **agg()** and **transform()**
- Applying Elementwise Functions: **applymap()**

### Tablewise function application


some functions, takes the whole dataframe and apply some operations on them

 if the function needs to be called in a **chain**, consider using the **pipe()** method.

In [63]:
df = pd.DataFrame(
    {
        'city_and_country': ["Chicago, IL", "NewYork, NYC"]
    }
)
df

Unnamed: 0,city_and_country
0,"Chicago, IL"
1,"NewYork, NYC"


In [57]:
def extract_city_name(df) :
    df['city_name'] = df['city_and_country'].str.split(', ').str.get(0)
    return df

In [58]:
def add_country_name(df, country_name) :
    df['city_and_country'] = df['city_name'] + '-' + country_name
    return df

In [59]:
add_country_name(extract_city_name(df), 'US')

Unnamed: 0,city_and_country,city_name
0,Chicago-US,Chicago
1,NewYork-US,NewYork


instead of doing that, you can do this: 

In [62]:
df.pipe(extract_city_name).pipe(add_country_name, country_name='US')

Unnamed: 0,city_and_country,city_name
0,Chicago-US,Chicago
1,NewYork-US,NewYork


pay attention that your functions must return the 'df', and if they return nothing, it won't work!

### Row or column-wise function application


In [67]:
df = pd.DataFrame(
    {
        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
    }
)
df

Unnamed: 0,one,two,three
a,-0.527831,2.235041,
b,-0.345712,-1.494081,1.001526
c,0.036244,1.852079,-0.796647
d,,0.324931,-0.157571


In [69]:
df.apply(np.mean)

one     -0.279100
two      0.729493
three    0.015769
dtype: float64

In [71]:
df.apply(np.mean, axis=1)

a    0.853605
b   -0.279423
c    0.363892
d    0.083680
dtype: float64

In [72]:
df.apply("mean")

one     -0.279100
two      0.729493
three    0.015769
dtype: float64

The return type of the function passed to apply() affects the type of the final output from DataFrame.apply for the default behaviour:

- If the applied function returns a Series, the final output is a DataFrame. The columns match the index of the Series returned by the applied function.
- If the applied function returns any other type, the final output is a Series.

This default behaviour can be overridden using the result_type, which accepts three options: **reduce**, **broadcast**, and **expand**. These will determine how list-likes return values expand (or not) to a DataFrame.

In [3]:
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=1000),
)
tsdf

Unnamed: 0,A,B,C
2000-01-01,0.646363,-1.991371,-0.629919
2000-01-02,-0.982122,0.791272,0.870482
2000-01-03,-0.295142,-0.887436,1.589199
2000-01-04,-0.164457,-0.093939,-1.059803
2000-01-05,-1.683638,-1.351925,0.433507
...,...,...,...
2002-09-22,-0.405605,0.881207,-0.320403
2002-09-23,-1.010377,0.146564,-0.379882
2002-09-24,1.746125,-0.747612,-1.514185
2002-09-25,1.247031,0.566825,0.638369


In [4]:
tsdf.apply(pd.Series.idxmax)

A   2000-05-20
B   2002-07-04
C   2000-02-18
dtype: datetime64[ns]

___

In [86]:
def substract_and_devide(x, sub, dev):
    return (x - sub)/dev

In [94]:
substract_and_devide(3, 2, 1)

1.0

In [96]:
tsdf.apply(substract_and_devide, args=[2, 1]) # you can give the parameters of you function to apply

Unnamed: 0,A,B,C
2000-01-01,-3.501748,-1.691045,-2.921484
2000-01-02,-2.278006,-0.947675,-1.403341
2000-01-03,-2.541073,-2.628935,-1.264815
2000-01-04,-1.827911,-0.947848,-2.415320
2000-01-05,-3.147390,-2.464809,-3.335453
...,...,...,...
2002-09-22,-2.529966,-1.904061,-2.012666
2002-09-23,-3.088455,-1.684880,-0.167563
2002-09-24,-1.320803,-2.646608,-1.591851
2002-09-25,-3.994000,-1.539966,-0.748881


In [97]:
tsdf.apply(substract_and_devide, sub=2, dev=1) # you can also give it as keyword argument

Unnamed: 0,A,B,C
2000-01-01,-3.501748,-1.691045,-2.921484
2000-01-02,-2.278006,-0.947675,-1.403341
2000-01-03,-2.541073,-2.628935,-1.264815
2000-01-04,-1.827911,-0.947848,-2.415320
2000-01-05,-3.147390,-2.464809,-3.335453
...,...,...,...
2002-09-22,-2.529966,-1.904061,-2.012666
2002-09-23,-3.088455,-1.684880,-0.167563
2002-09-24,-1.320803,-2.646608,-1.591851
2002-09-25,-3.994000,-1.539966,-0.748881


___

Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:



In [3]:
s = pd.Series([0, 2, np.nan, 8])
s

0    0.0
1    2.0
2    NaN
3    8.0
dtype: float64

In [6]:
s.interpolate(method='polynomial', order=2)

0    0.000000
1    2.000000
2    4.666667
3    8.000000
dtype: float64

In [105]:
# first let's make some NaN elements in the data frame
tsdf.iloc[1:3, :] = np.NaN
tsdf.head()

Unnamed: 0,A,B,C
2000-01-01,-1.501748,0.308955,-0.921484
2000-01-02,,,
2000-01-03,,,
2000-01-04,0.172089,1.052152,-0.41532
2000-01-05,-1.14739,-0.464809,-1.335453


In [107]:
# now, for handling these NaN elements, by using interpolate:
tsdf.head().apply(pd.Series.interpolate, method='polynomial', order=2)

Unnamed: 0,A,B,C
2000-01-01,-1.501748,0.308955,-0.921484
2000-01-02,-0.005091,1.439033,-0.208336
2000-01-03,0.552855,1.686766,-0.039614
2000-01-04,0.172089,1.052152,-0.41532
2000-01-05,-1.14739,-0.464809,-1.335453


___
Finally, apply() takes an argument **raw** which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has **positive performance implications** if you do not need the indexing functionality.

for example if you know that the row is in form of numpy adarray and you know that some numpy function<br>
will be faster on it, you can set raw=True

### Aggregation API


In [8]:
tsdf = pd.DataFrame(
    np.random.randn(10, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=10),
)

In [9]:
tsdf.iloc[3:7] = np.nan
tsdf

Unnamed: 0,A,B,C
2000-01-01,1.337564,1.430222,-1.347326
2000-01-02,0.059352,1.598077,1.99849
2000-01-03,-0.629846,-0.289471,-0.142053
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,1.158912,-1.003115,-1.638364
2000-01-09,-0.298271,2.322714,0.315841
2000-01-10,-0.06076,0.149427,0.548375


Using a single function is equivalent to apply(). You can also pass named methods as strings. These will return a Series of the aggregated output:



In [10]:
tsdf.agg(np.sum)
# you can also use aggregate() instead of agg()     they both do the same thing

A    1.566951
B    4.207854
C   -0.265037
dtype: float64

In [11]:
tsdf.sum()

A    1.566951
B    4.207854
C   -0.265037
dtype: float64

In [113]:
tsdf['A'].agg(np.sum)

0.3349363905386644

### Aggregating with multiple functions

In [114]:
tsdf.agg(
    [np.sum, np.mean]
)

Unnamed: 0,A,B,C
sum,0.334936,-1.019666,-1.752516
mean,0.055823,-0.169944,-0.292086


### Aggregating with a dict


In [121]:
tsdf.agg(
    {'A': [np.mean],
     'B': [np.mean, np.sum],
    }
)

Unnamed: 0,A,B
mean,0.055823,-0.169944
sum,,-1.019666


!!! if you use a function that is not aggregatable( is that even a word!?? ) like np.exp,
you will recive a wired output !!! <br>
or lambda function like lambda x: x*2<br>
because this functions are not aggregatable like np.mean or np.sum

### Mixed dtypes

In [122]:
mdf = pd.DataFrame(
    {
        "A": [1, 2, 3],
        "B": [1.0, 2.0, 3.0],
        "C": ["foo", "bar", "baz"],
        "D": pd.date_range("20130101", periods=3),
    }
)
mdf

Unnamed: 0,A,B,C,D
0,1,1.0,foo,2013-01-01
1,2,2.0,bar,2013-01-02
2,3,3.0,baz,2013-01-03


In [123]:
mdf.agg(['min', 'sum'])

  mdf.agg(['min', 'sum'])


Unnamed: 0,A,B,C,D
min,1,1.0,bar,2013-01-01
sum,6,6.0,foobarbaz,NaT


In [125]:
mdf.drop('D', axis=1).agg(['min', 'sum'])

Unnamed: 0,A,B,C
min,1,1.0,bar
sum,6,6.0,foobarbaz


### Custom describe


In [133]:
from functools import partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"
q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"

In [134]:
tsdf.describe()

Unnamed: 0,A,B,C
count,6.0,6.0,6.0
mean,0.055823,-0.169944,-0.292086
std,1.61558,0.661301,0.660579
min,-1.514748,-0.894342,-1.261432
25%,-1.210628,-0.562196,-0.601348
50%,-0.124855,-0.372484,-0.365027
75%,0.650618,0.160902,0.203897
max,2.762838,0.90638,0.525552


In [139]:
tsdf.agg(
    [
        'count',
        'mean',
        'std',
        'min',
        q_25,
        'median',
        q_75,
        'max'
    ]
)

Unnamed: 0,A,B,C
count,6.0,6.0,6.0
mean,0.055823,-0.169944,-0.292086
std,1.61558,0.661301,0.660579
min,-1.514748,-0.894342,-1.261432
25%,-1.210628,-0.562196,-0.601348
median,-0.124855,-0.372484,-0.365027
75%,0.650618,0.160902,0.203897
max,2.762838,0.90638,0.525552


### Transform API


The transform() method returns an object that is indexed the same (same size) as the original. This API allows you to provide multiple operations at the same time rather than one-by-one. Its API is quite similar to the .agg API.

In [142]:
tsdf = pd.DataFrame(
    np.random.randn(10, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=10),
)

tsdf.iloc[3:7] = np.nan
tsdf

Unnamed: 0,A,B,C
2000-01-01,-0.134446,-1.165682,0.559153
2000-01-02,0.754433,-0.27937,-0.241849
2000-01-03,-0.146348,0.357,-0.639142
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,-0.801472,-0.039668,0.124807
2000-01-09,-0.069687,-1.240844,0.078372
2000-01-10,-2.079853,1.213023,-0.104695


In [143]:
tsdf.transform(np.abs)

Unnamed: 0,A,B,C
2000-01-01,0.134446,1.165682,0.559153
2000-01-02,0.754433,0.27937,0.241849
2000-01-03,0.146348,0.357,0.639142
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.801472,0.039668,0.124807
2000-01-09,0.069687,1.240844,0.078372
2000-01-10,2.079853,1.213023,0.104695


### Transform with multiple functions


In [144]:
tsdf.transform([np.abs, lambda x: 0])

Unnamed: 0_level_0,A,A,B,B,C,C
Unnamed: 0_level_1,absolute,<lambda>,absolute,<lambda>,absolute,<lambda>
2000-01-01,0.134446,0,1.165682,0,0.559153,0
2000-01-02,0.754433,0,0.27937,0,0.241849,0
2000-01-03,0.146348,0,0.357,0,0.639142,0
2000-01-04,,0,,0,,0
2000-01-05,,0,,0,,0
2000-01-06,,0,,0,,0
2000-01-07,,0,,0,,0
2000-01-08,0.801472,0,0.039668,0,0.124807,0
2000-01-09,0.069687,0,1.240844,0,0.078372,0
2000-01-10,2.079853,0,1.213023,0,0.104695,0


### Transforming with a dict


In [145]:
tsdf.transform(
    {
        'A': np.abs,
        'B': np.exp,
        'C': lambda x: 0,
    }
)

Unnamed: 0,A,B,C
2000-01-01,0.134446,0.31171,0
2000-01-02,0.754433,0.75626,0
2000-01-03,0.146348,1.429036,0
2000-01-04,,,0
2000-01-05,,,0
2000-01-06,,,0
2000-01-07,,,0
2000-01-08,0.801472,0.961109,0
2000-01-09,0.069687,0.28914,0
2000-01-10,2.079853,3.363638,0


In [148]:
tsdf.transform(
    {
        "A": np.abs,
        "B": [lambda x: 0, np.exp]
    }
)


Unnamed: 0_level_0,A,B,B
Unnamed: 0_level_1,absolute,<lambda>,exp
2000-01-01,0.134446,0,0.31171
2000-01-02,0.754433,0,0.75626
2000-01-03,0.146348,0,1.429036
2000-01-04,,0,
2000-01-05,,0,
2000-01-06,,0,
2000-01-07,,0,
2000-01-08,0.801472,0,0.961109
2000-01-09,0.069687,0,0.28914
2000-01-10,2.079853,0,3.363638


Note:Two major differences between apply and transform

There are two major differences between the transform and apply groupby methods.

- **Input**:
apply implicitly passes all the columns for each group as a DataFrame to the custom function.
while transform passes each column for each group individually as a Series to the custom function.
- **Output**:
The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.
So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

In [151]:
# df.transform(np.sum) --> raises ValueError: Function did not transform
# df.transform(add_two_columns, axis='columns') --> raises ValueError: Function did not transform

# these two work fine for apply()

In [152]:
def add_1(s):
    return s + 1

In [153]:
tsdf.apply(add_1)

Unnamed: 0,A,B,C
2000-01-01,0.865554,-0.165682,1.559153
2000-01-02,1.754433,0.72063,0.758151
2000-01-03,0.853652,1.357,0.360858
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.198528,0.960332,1.124807
2000-01-09,0.930313,-0.240844,1.078372
2000-01-10,-1.079853,2.213023,0.895305


In [154]:
tsdf.transform(add_1)

Unnamed: 0,A,B,C
2000-01-01,0.865554,-0.165682,1.559153
2000-01-02,1.754433,0.72063,0.758151
2000-01-03,0.853652,1.357,0.360858
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.198528,0.960332,1.124807
2000-01-09,0.930313,-0.240844,1.078372
2000-01-10,-1.079853,2.213023,0.895305


### Applying elementwise functions


Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods **applymap**() on **DataFrame** and analogously **map**() on **Series** accept any Python function taking a single value and returning a single value. For example:

In [156]:
df

Unnamed: 0,one,two,three
a,-0.527831,2.235041,
b,-0.345712,-1.494081,1.001526
c,0.036244,1.852079,-0.796647
d,,0.324931,-0.157571


In [157]:
def f(x):
    return len(str(x))

In [158]:
df['one'].apply(f)

a    19
b    20
c    19
d     3
Name: one, dtype: int64

In [159]:
df['one'].map(f)

a    19
b    20
c    19
d     3
Name: one, dtype: int64

In [160]:
df.applymap(f)

Unnamed: 0,one,two,three
a,19,17,3
b,20,19,15
c,19,18,19
d,3,18,19


In [161]:
df.apply(f)

one      81
two      81
three    83
dtype: int64

Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a secondary series:



In [171]:
s = pd.Series(
    [
        'X', 'X', 'Y', 'X', 'Y', 'Y'
    ],
)
s

0    X
1    X
2    Y
3    X
4    Y
5    Y
dtype: object

In [172]:
s.map(
    pd.Series({'X': 0, 'Y': 1})
)

0    0
1    0
2    1
3    0
4    1
5    1
dtype: int64

___
training

In [174]:
tsdf.head()

Unnamed: 0,A,B,C
2000-01-01,-0.134446,-1.165682,0.559153
2000-01-02,0.754433,-0.27937,-0.241849
2000-01-03,-0.146348,0.357,-0.639142
2000-01-04,,,
2000-01-05,,,


In [178]:
tsdf.head().transform(pd.Series.interpolate, method='linear', order=2)

Unnamed: 0,A,B,C
2000-01-01,-0.134446,-1.165682,0.559153
2000-01-02,0.754433,-0.27937,-0.241849
2000-01-03,-0.146348,0.357,-0.639142
2000-01-04,-0.146348,0.357,-0.639142
2000-01-05,-0.146348,0.357,-0.639142


In [177]:
tsdf.head().apply(pd.Series.interpolate, method='linear', order=2)

Unnamed: 0,A,B,C
2000-01-01,-0.134446,-1.165682,0.559153
2000-01-02,0.754433,-0.27937,-0.241849
2000-01-03,-0.146348,0.357,-0.639142
2000-01-04,-0.146348,0.357,-0.639142
2000-01-05,-0.146348,0.357,-0.639142
