# Basics - Apply, Map and Vectorised Functions

In [1]:
import pandas as pd
import numpy as np

data = np.round(np.random.normal(size=(4, 3)), 2)
df = pd.DataFrame(data, columns=["A", "B", "C"])
df.head()

Unnamed: 0,A,B,C
0,-0.36,1.25,-0.26
1,0.05,-0.51,0.11
2,0.56,-0.13,0.16
3,0.37,-0.84,-2.01


## Apply

Used to execute an arbitrary function again an entire dataframe, or a subection. Applies in a vectorised fashion.

In [2]:
df.apply(lambda x: 1 + np.abs(x))

Unnamed: 0,A,B,C
0,1.36,2.25,1.26
1,1.05,1.51,1.11
2,1.56,1.13,1.16
3,1.37,1.84,3.01


In [3]:
df.A.apply(np.abs)

0    0.36
1    0.05
2    0.56
3    0.37
Name: A, dtype: float64

In [6]:
#def double_if_positive(x):
#    if x > 0:
#        return 2 * x
#    return x
#
#df.apply(double_if_positive)

In [4]:
def double_if_positive(x):
    x[x > 0] *= 2
    return x

df.apply(double_if_positive)

Unnamed: 0,A,B,C
0,-0.36,2.5,-0.26
1,0.1,-0.51,0.22
2,1.12,-0.13,0.32
3,0.74,-0.84,-2.01


In [6]:
df

Unnamed: 0,A,B,C
0,-0.36,2.5,-0.26
1,0.1,-0.51,0.22
2,1.12,-0.13,0.32
3,0.74,-0.84,-2.01


In [7]:
def double_if_positive(x):
    x = x.copy()
    x[x > 0] *= 2
    return x

df.apply(double_if_positive, raw=True)

Unnamed: 0,A,B,C
0,-0.36,5.0,-0.26
1,0.2,-0.51,0.44
2,2.24,-0.13,0.64
3,1.48,-0.84,-2.01


## Map

Similar to apply, but operators on Series, and uses dictionary based inputs rather than an array of values.


In [11]:
series = pd.Series(["Steve", "Alex", "Jess", "Mark"])

In [12]:
series.map({"Steve": "Stephen"})

0    Stephen
1        NaN
2        NaN
3        NaN
dtype: object

In [13]:
series.map(lambda d: f"I am {d}")

0    I am Steve
1     I am Alex
2     I am Jess
3     I am Mark
dtype: object

## Vectorised functions

Pandas and numpy obviously have tons of these, here are some examples

In [17]:
display(df, df.abs())

Unnamed: 0,A,B,C
0,-1.01,-0.7,1.24
1,1.56,0.1,1.36
2,0.76,-0.05,-0.07
3,-1.55,-0.19,-0.08


Unnamed: 0,A,B,C
0,1.01,0.7,1.24
1,1.56,0.1,1.36
2,0.76,0.05,0.07
3,1.55,0.19,0.08


In [18]:
series = pd.Series(["Obi-Wan Kenobi", "Luke Skywalker", "Han Solo", "Leia Organa"])

In [20]:
"Luke Skywalker".split()

['Luke', 'Skywalker']

In [23]:
series.str.split(expand=True)

Unnamed: 0,0,1
0,Obi-Wan,Kenobi
1,Luke,Skywalker
2,Han,Solo
3,Leia,Organa


In [24]:
series.str.contains("Skywalker")

0    False
1     True
2    False
3    False
dtype: bool

In [26]:
series.str.upper().str.split()

0    [OBI-WAN, KENOBI]
1    [LUKE, SKYWALKER]
2          [HAN, SOLO]
3       [LEIA, ORGANA]
dtype: object

## User defined functions

Lets investigate a super simple example of trying to find the hypotenuse given x and y distances.


In [16]:
data2 = np.random.normal(10, 2, size=(100000, 2))
df2 = pd.DataFrame(data2, columns=["x", "y"])

In [17]:
hypot = (df2.x**2 + df2.y**2)**0.5
print(hypot[0])

8.602557842281396


In [18]:
def hypot1(x, y):
    return np.sqrt(x**2 + y**2)

h1 = []
for index, (x, y) in df2.iterrows():
    h1.append(hypot1(x, y))
print(h1[0])

8.602557842281396


In [19]:
def hypot2(row):
    return np.sqrt(row.x**2 + row.y**2)

h2 = df2.apply(hypot2, axis=1)
print(h2[0])

8.602557842281396


In [20]:
def hypot3(xs, ys):
    return np.sqrt(xs**2 + ys**2)
h3 = hypot3(df2.x, df2.y)
print(h3[0])

8.602557842281396


Vectorising everything you can is the key to speeding up your code. Once you've done that, you should use other tools to investigate. PyCharm Professional has a great optimisation tool built in. Jupyter has %lprun (line profiler) command you can find here: https://github.com/rkern/line_profiler

### Recap

* apply
* map
* .str & similar