# [Applying Functions to Series and DataFrames](#)

**Applying functions** is a fundamental skill in data manipulation with Pandas. It allows you to transform, analyze, and extract insights from your data efficiently. In this lecture, we'll explore various methods to apply functions to both Series and DataFrames in Pandas.


Pandas provides several powerful methods for applying functions:

- `.apply()`: A versatile method for applying functions to Series or DataFrame axes
- `.map()`: Used for transforming Series based on a mapping or function


<img src="../images/apply-function.png" width="800">

Throughout this lecture, we'll cover:
- How to use built-in functions and create custom functions
- Applying functions to Series and DataFrames
- Advanced techniques like vectorization and aggregation
- Practical examples and best practices for efficient data manipulation


By mastering these techniques, you'll be able to write cleaner, more efficient code for data processing tasks in Pandas.


## <a id='toc1_'></a>[Applying Functions to Series](#toc0_)

Series are one-dimensional labeled arrays in Pandas. We can apply various functions to transform or analyze these Series efficiently.


### <a id='toc1_1_'></a>[Using Built-in Functions](#toc0_)


Pandas and NumPy provide many built-in functions that can be directly applied to Series. These functions are optimized for performance and are often vectorized, meaning they operate on the entire Series at once.


In [1]:
import pandas as pd

In [2]:
# Create a sample Series
s = pd.Series([1, 2, 3, 4, 5])


In [3]:
# Using built-in functions
s.sum()


15

In [4]:
s.mean()


3.0

In [5]:
s.max()


5

In [6]:
s.min()


1

In [7]:
s.abs()


0    1
1    2
2    3
3    4
4    5
dtype: int64

You can also use NumPy functions directly on Series:


In [8]:
import numpy as np

In [9]:
np.log(s)

0    0.000000
1    0.693147
2    1.098612
3    1.386294
4    1.609438
dtype: float64

In [10]:
np.exp(s)

0      2.718282
1      7.389056
2     20.085537
3     54.598150
4    148.413159
dtype: float64

In [11]:
np.sin(s)

0    0.841471
1    0.909297
2    0.141120
3   -0.756802
4   -0.958924
dtype: float64

### <a id='toc1_2_'></a>[Applying Custom Functions with .apply()](#toc0_)


The `.apply()` method allows you to apply custom functions to every element in a Series.


In [12]:
# Define a custom function
def square(x):
    return x ** 2

In [13]:
# Apply the custom function
s.apply(square)

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [14]:
# Using a lambda function
s.apply(lambda x: x ** 2)

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [15]:
# More complex custom function
def categorize(x):
    if x < 3:
        return 'Low'
    elif x < 5:
        return 'Medium'
    else:
        return 'High'

s.apply(categorize)

0       Low
1       Low
2    Medium
3    Medium
4      High
dtype: object

### <a id='toc1_3_'></a>[Using .map() for Series Transformation](#toc0_)


The `.map()` method is useful for transforming a Series based on a mapping (dictionary) or a function.


In [16]:
# Using a dictionary for mapping
mapping = {1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five'}
s.map(mapping)

0      One
1      Two
2    Three
3     Four
4     Five
dtype: object

In [17]:
# Using a function with map
s.map(lambda x: x * 10)

0    10
1    20
2    30
3    40
4    50
dtype: int64

To avoid applying the function to missing values (and keep them as NaN) `na_action='ignore'` can be used:

In [18]:
s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])

In [19]:
s.map('I am a {}'.format)

0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

In [20]:
# Handling missing values in mapping
s.map('I am a {}'.format, na_action='ignore')

0       I am a cat
1       I am a dog
2              NaN
3    I am a rabbit
dtype: object

`.map()` is particularly useful when you want to replace values in a Series based on a predefined mapping.


In [21]:
# Example with categorical data
fruits = pd.Series(['apple', 'banana', 'cherry', 'date', 'elderberry'])
fruit_colors = {'apple': 'red', 'banana': 'yellow', 'cherry': 'red', 'date': 'brown'}

fruits.map(fruit_colors)

0       red
1    yellow
2       red
3     brown
4       NaN
dtype: object

**Note**: `.map()` is generally faster than `.apply()` for simple operations, especially when using a dictionary for mapping. However, `.apply()` is more flexible and can handle more complex functions.


These methods provide powerful ways to transform and analyze Series data in Pandas. Choose the appropriate method based on your specific use case and performance requirements.

## <a id='toc2_'></a>[Applying Functions to DataFrames](#toc0_)

DataFrames are two-dimensional labeled data structures in Pandas. We can apply functions to entire DataFrames, specific columns, rows, or individual elements.


### <a id='toc2_1_'></a>[Using DataFrame-wide Functions](#toc0_)


Many built-in functions in Pandas can be applied directly to DataFrames, operating on all numeric columns.


In [22]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

# Apply DataFrame-wide functions
df.sum()

A      15
B     150
C    1500
dtype: int64

In [23]:
df.mean()

A      3.0
B     30.0
C    300.0
dtype: float64

In [24]:
df.max()

A      5
B     50
C    500
dtype: int64

In [25]:
df.min()

A      1
B     10
C    100
dtype: int64

In [26]:
df.describe()

Unnamed: 0,A,B,C
count,5.0,5.0,5.0
mean,3.0,30.0,300.0
std,1.581139,15.811388,158.113883
min,1.0,10.0,100.0
25%,2.0,20.0,200.0
50%,3.0,30.0,300.0
75%,4.0,40.0,400.0
max,5.0,50.0,500.0


### <a id='toc2_2_'></a>[Applying Functions to Columns with .apply()](#toc0_)


You can use `.apply()` to apply a function to each column of a DataFrame.


In [27]:
# Apply a function to each column
df.apply(np.sum)

A      15
B     150
C    1500
dtype: int64

In [28]:
# Custom function for columns
def column_stats(col):
    return pd.Series({
        'min': col.min(),
        'max': col.max(),
        'mean': col.mean(),
        'median': col.median()
    })

df.apply(column_stats)

Unnamed: 0,A,B,C
min,1.0,10.0,100.0
max,5.0,50.0,500.0
mean,3.0,30.0,300.0
median,3.0,30.0,300.0


### <a id='toc2_3_'></a>[Applying Functions to Rows with .apply(axis=1)](#toc0_)


To apply a function to each row of a DataFrame, use `.apply()` with `axis=1`.


In [29]:
# Function to apply to each row
def row_sum(row):
    return row.sum()

df.apply(row_sum, axis=1)

0    111
1    222
2    333
3    444
4    555
dtype: int64

In [30]:
# More complex row operation
def categorize_row(row):
    total = row.sum()
    if total < 200:
        return 'Low'
    elif total < 400:
        return 'Medium'
    else:
        return 'High'

df.apply(categorize_row, axis=1)

0       Low
1    Medium
2    Medium
3      High
4      High
dtype: object

### <a id='toc2_4_'></a>[Using `.map()` for Element-wise Operations](#toc0_)


`.map()` applies a function to every element in the DataFrame.


In [60]:
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

In [61]:
# Apply a function to every element
df.map(lambda x: f"{x:.2f}")

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,20.0,200.0
2,3.0,30.0,300.0
3,4.0,40.0,400.0
4,5.0,50.0,500.0


In [62]:
# Another example: categorizing values
def categorize(x):
    if x <= 50:
        return 'Low'
    elif x <= 250:
        return 'Medium'
    else:
        return 'High'

df.map(categorize)

Unnamed: 0,A,B,C
0,Low,Low,Medium
1,Low,Low,Medium
2,Low,Low,High
3,Low,Low,High
4,Low,Low,High


**Note**: While `.map()` is convenient for element-wise operations, it can be slower than vectorized operations for large DataFrames. When possible, use vectorized operations or apply functions to specific columns for better performance.


In [63]:
# Vectorized operation (faster)
bins = [-np.inf, 50, 250, np.inf]
labels = ['Low', 'Medium', 'High']
df_categorized = pd.DataFrame({
    'A': pd.cut(df['A'], bins=bins, labels=labels),
    'B': pd.cut(df['B'], bins=bins, labels=labels),
    'C': pd.cut(df['C'], bins=bins, labels=labels)
})

df_categorized

Unnamed: 0,A,B,C
0,Low,Low,Medium
1,Low,Low,Medium
2,Low,Low,High
3,Low,Low,High
4,Low,Low,High


These methods provide powerful ways to manipulate and analyze DataFrame data in Pandas. Choose the appropriate method based on your specific use case, considering both functionality and performance.

## <a id='toc3_'></a>[Advanced Function Application](#toc0_)

As you become more proficient with Pandas, you'll encounter situations that require more advanced function application techniques. This section covers vectorized operations, lambda functions, and applying multiple functions simultaneously.


### <a id='toc3_1_'></a>[Vectorized Operations for Performance](#toc0_)


Vectorized operations in Pandas are highly optimized and perform operations on entire arrays at once, rather than element by element. This leads to significant performance improvements, especially for large datasets.


In [35]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': np.random.rand(100000),
    'B': np.random.rand(100000)
})
df

Unnamed: 0,A,B
0,0.298588,0.024213
1,0.543137,0.699933
2,0.466441,0.592198
3,0.374486,0.866813
4,0.891496,0.299927
...,...,...
99995,0.474479,0.814835
99996,0.657864,0.767147
99997,0.139012,0.356356
99998,0.343908,0.801411


In [36]:
# Vectorized operation (fast)
%timeit df['C'] = df['A'] + df['B']

77.4 µs ± 649 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [37]:
# Non-vectorized operation using apply (slow)
%timeit df['D'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

279 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [38]:
# Compare results
df.head()

Unnamed: 0,A,B,C,D
0,0.298588,0.024213,0.322801,0.322801
1,0.543137,0.699933,1.24307,1.24307
2,0.466441,0.592198,1.058639,1.058639
3,0.374486,0.866813,1.241299,1.241299
4,0.891496,0.299927,1.191422,1.191422


Whenever possible, use vectorized operations for better performance:


In [39]:
# More vectorized operations
df['E'] = np.sqrt(df['A'])
df['F'] = np.where(df['B'] > 0.5, 'High', 'Low')

df.head()

Unnamed: 0,A,B,C,D,E,F
0,0.298588,0.024213,0.322801,0.322801,0.546432,Low
1,0.543137,0.699933,1.24307,1.24307,0.736978,High
2,0.466441,0.592198,1.058639,1.058639,0.682965,High
3,0.374486,0.866813,1.241299,1.241299,0.611953,High
4,0.891496,0.299927,1.191422,1.191422,0.94419,Low


### <a id='toc3_2_'></a>[Using lambda Functions](#toc0_)


Lambda functions are anonymous, inline functions that are useful for simple operations. They're often used with `.apply()`, `.map()`, and other Pandas methods.


In [40]:
# Using lambda with Series
s = pd.Series([1, 2, 3, 4, 5])
s.apply(lambda x: x**2 if x % 2 == 0 else x**3)

0      1
1      4
2     27
3     16
4    125
dtype: int64

In [41]:
# Using lambda with DataFrame columns
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
})
df

Unnamed: 0,A,B
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [42]:
df['C'] = df['A'].apply(lambda x: 'Even' if x % 2 == 0 else 'Odd')
df['D'] = df.apply(lambda row: row['A'] * row['B'], axis=1)
df

Unnamed: 0,A,B,C,D
0,1,10,Odd,10
1,2,20,Even,40
2,3,30,Odd,90
3,4,40,Even,160
4,5,50,Odd,250


While lambda functions are convenient, they can make code less readable for complex operations. In such cases, it's often better to define a named function.


### <a id='toc3_3_'></a>[Applying Function to Multiple Columns](#toc0_)

When working with DataFrames, you often need to apply the same function to multiple columns simultaneously. Pandas provides several ways to accomplish this efficiently.

You can use the `.apply()` method on a subset of columns:


In [74]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500],
    'D': ['a', 'b', 'c', 'd', 'e']
})
df

Unnamed: 0,A,B,C,D
0,1,10,100,a
1,2,20,200,b
2,3,30,300,c
3,4,40,400,d
4,5,50,500,e


In [75]:
# Apply a function to multiple numeric columns
df[['A', 'B', 'C']].apply(lambda x: x * 2)

Unnamed: 0,A,B,C
0,2,20,200
1,4,40,400
2,6,60,600
3,8,80,800
4,10,100,1000


For element-wise operations on multiple columns, you can use `.map()`:


In [77]:
# Apply a function to all elements in specific columns
df[['A', 'B', 'C']].map(lambda x: f"{x:.2f}")

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,20.0,200.0
2,3.0,30.0,300.0
3,4.0,40.0,400.0
4,5.0,50.0,500.0


These methods provide flexible and efficient ways to apply functions to multiple columns in Pandas DataFrames. Choose the most appropriate method based on your specific use case, considering both readability and performance.

Remember that for large datasets, vectorized operations and `pd.eval()` are generally more performant than apply-based methods. However, for complex custom logic, `.apply()` and its variants remain valuable tools in the Pandas toolkit.

### <a id='toc3_4_'></a>[Applying Multiple Functions with .agg()](#toc0_)


The `.agg()` method allows you to apply multiple functions to one or more columns simultaneously.


In [43]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})
df

Unnamed: 0,A,B,C
0,1,10,100
1,2,20,200
2,3,30,300
3,4,40,400
4,5,50,500


In [44]:
# Apply multiple functions to all columns
df.agg(['sum', 'mean', 'max', 'min'])

Unnamed: 0,A,B,C
sum,15.0,150.0,1500.0
mean,3.0,30.0,300.0
max,5.0,50.0,500.0
min,1.0,10.0,100.0


In [70]:
# Apply multiple functions to all columns
df.agg(['sum', 'mean', 'max', 'min'], axis='rows')

Unnamed: 0,A,B,C
sum,15.0,150.0,1500.0
mean,3.0,30.0,300.0
max,5.0,50.0,500.0
min,1.0,10.0,100.0


In [71]:
# Apply multiple functions to all columns
df.agg(['sum', 'mean', 'max', 'min'], axis=0)

Unnamed: 0,A,B,C
sum,15.0,150.0,1500.0
mean,3.0,30.0,300.0
max,5.0,50.0,500.0
min,1.0,10.0,100.0


In [73]:
# Apply multiple functions to all columns
df.agg(['sum', 'mean', 'max', 'min'], axis='columns')

Unnamed: 0,sum,mean,max,min
0,111.0,37.0,100.0,1.0
1,222.0,74.0,200.0,2.0
2,333.0,111.0,300.0,3.0
3,444.0,148.0,400.0,4.0
4,555.0,185.0,500.0,5.0


In [45]:
# Apply different functions to different columns
df.agg({
    'A': ['sum', 'mean'],
    'B': ['min', 'max'],
    'C': 'std'
})

Unnamed: 0,A,B,C
sum,15.0,,
mean,3.0,,
min,,10.0,
max,,50.0,
std,,,158.113883


In [46]:
# Using custom functions with .agg()
def range_calc(x):
    return x.max() - x.min()

df.agg(['mean', range_calc])

Unnamed: 0,A,B,C
mean,3.0,30.0,300.0
range_calc,4.0,40.0,400.0


The `.agg()` method is particularly useful when you need to compute multiple summary statistics for your data efficiently.


These advanced techniques allow for more flexible and efficient data manipulation in Pandas. By combining vectorized operations, lambda functions, and aggregation methods, you can perform complex data transformations and analyses with concise and performant code.

## <a id='toc4_'></a>[Practical Examples and Use Cases](#toc0_)

Let's explore some practical examples and use cases for applying functions to Series and DataFrames. These examples will demonstrate how to use the techniques we've learned in real-world scenarios.


### <a id='toc4_1_'></a>[Example 1: Data Cleaning and Transformation](#toc0_)


Suppose we have a dataset of customer information with some inconsistencies:


In [47]:
# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown'],
    'Age': [28, 35, 42, 31],
    'Income': ['$45,000', '$60,000', '$75,000', '$55,000'],
    'City': ['NEW YORK', 'los angeles', 'Chicago', 'HOUston']
})

df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,"$45,000",NEW YORK
1,Jane Doe,35,"$60,000",los angeles
2,Bob Johnson,42,"$75,000",Chicago
3,Alice Brown,31,"$55,000",HOUston


Let's clean and transform this data:


In [48]:
# Clean up names (capitalize first letter of each word)
df['Name'] = df['Name'].apply(lambda x: x.title())
df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,"$45,000",NEW YORK
1,Jane Doe,35,"$60,000",los angeles
2,Bob Johnson,42,"$75,000",Chicago
3,Alice Brown,31,"$55,000",HOUston


In [49]:
# Convert income to numeric, removing '$' and ','
df['Income'] = df['Income'].apply(lambda x: float(x.replace('$', '').replace(',', '')))
df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,45000.0,NEW YORK
1,Jane Doe,35,60000.0,los angeles
2,Bob Johnson,42,75000.0,Chicago
3,Alice Brown,31,55000.0,HOUston


In [50]:
# Standardize city names (capitalize)
df['City'] = df['City'].apply(lambda x: x.capitalize())
df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,45000.0,New york
1,Jane Doe,35,60000.0,Los angeles
2,Bob Johnson,42,75000.0,Chicago
3,Alice Brown,31,55000.0,Houston


In [51]:
# Add a new column for income category
df['Income_Category'] = df['Income'].apply(lambda x: 'High' if x > 60000 else 'Medium' if x > 40000 else 'Low')
df

Unnamed: 0,Name,Age,Income,City,Income_Category
0,John Smith,28,45000.0,New york,Medium
1,Jane Doe,35,60000.0,Los angeles,Medium
2,Bob Johnson,42,75000.0,Chicago,High
3,Alice Brown,31,55000.0,Houston,Medium


### <a id='toc4_2_'></a>[Example 2: Text Data Processing](#toc0_)


Let's process a dataset containing customer reviews:


In [52]:
# Create a sample DataFrame with customer reviews
df = pd.DataFrame({
    'Review': [
        "Great product, highly recommended!",
        "Disappointing quality, wouldn't buy again.",
        "Average product, nothing special.",
        "Excellent service and fast delivery!",
        "Terrible customer support, avoid this company."
    ]
})
df

Unnamed: 0,Review
0,"Great product, highly recommended!"
1,"Disappointing quality, wouldn't buy again."
2,"Average product, nothing special."
3,Excellent service and fast delivery!
4,"Terrible customer support, avoid this company."


In [53]:
# Function to calculate review length
def review_length(text):
    return len(text.split())

In [54]:
# Function to detect sentiment (very simple approach)
def detect_sentiment(text):
    positive_words = ['great', 'excellent', 'good', 'best', 'amazing']
    negative_words = ['bad', 'terrible', 'worst', 'disappointing', 'avoid']

    text_lower = text.lower()
    if any(word in text_lower for word in positive_words):
        return 'Positive'
    elif any(word in text_lower for word in negative_words):
        return 'Negative'
    else:
        return 'Neutral'

In [55]:
# Apply functions to the DataFrame
df['Review_Length'] = df['Review'].apply(review_length)
df['Sentiment'] = df['Review'].apply(detect_sentiment)

df

Unnamed: 0,Review,Review_Length,Sentiment
0,"Great product, highly recommended!",4,Positive
1,"Disappointing quality, wouldn't buy again.",5,Negative
2,"Average product, nothing special.",4,Neutral
3,Excellent service and fast delivery!,5,Positive
4,"Terrible customer support, avoid this company.",6,Negative


These examples demonstrate how applying functions to Series and DataFrames can be used in various real-world scenarios, from data cleaning and transformation to complex analyses in time series, text processing, and financial data. The flexibility and power of Pandas make it an excellent tool for handling diverse data manipulation tasks.