# [String Manipulation](#)

String manipulation is a crucial skill in data analysis and preprocessing. Pandas provides a powerful and efficient way to work with text data in Series and DataFrames. In this lecture, we'll explore various techniques for manipulating strings in Pandas, from basic operations to advanced regex-based transformations.


Pandas leverages Python's string methods and extends them to work with Series and DataFrames, allowing for vectorized string operations. This means you can apply string functions to entire columns of data at once, which is both convenient and performant.


Some key benefits of using Pandas for string manipulation include:

1. **Vectorized operations**: Apply string methods to entire Series or DataFrame columns efficiently.
2. **Handling of missing values**: Pandas automatically handles NaN/None values in string operations.
3. **Chaining methods**: Combine multiple string operations in a single line of code.
4. **Integration with regular expressions**: Powerful pattern matching and extraction capabilities.


Throughout this lecture, we'll cover:
- Accessing string methods in Pandas
- Basic string operations like changing case and trimming whitespace
- Extracting and replacing substrings
- Working with regular expressions
- Splitting and joining strings
- Advanced string operations and practical use cases


Let's start by importing Pandas and creating a sample DataFrame to work with:


In [1]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with text data
df = pd.DataFrame({
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown', np.nan],
    'Email': ['john@example.com', 'jane@example.com', 'bob@example.com', 'alice@example.com', 'unknown@example.com'],
    'Address': ['123 Main St, NY', '456 Elm St, CA', '789 Oak St, TX', '321 Pine St, FL', '654 Maple St, WA']
})

df

Unnamed: 0,Name,Email,Address
0,John Smith,john@example.com,"123 Main St, NY"
1,Jane Doe,jane@example.com,"456 Elm St, CA"
2,Bob Johnson,bob@example.com,"789 Oak St, TX"
3,Alice Brown,alice@example.com,"321 Pine St, FL"
4,,unknown@example.com,"654 Maple St, WA"


This DataFrame contains common types of string data you might encounter in real-world datasets. We'll use this DataFrame to demonstrate various string manipulation techniques throughout the lecture.


Remember that when working with strings in Pandas, we typically access string methods through the `.str` accessor. This allows Pandas to apply the operations in a vectorized manner, which is much more efficient than iterating over each element individually.


In the following sections, we'll dive deep into specific string manipulation techniques and see how they can be applied to clean, transform, and extract information from text data in Pandas.

## <a id='toc1_'></a>[Accessing String Methods in Pandas](#toc0_)

In Pandas, string methods are accessed through the `.str` accessor. This accessor provides a convenient way to apply string operations to entire Series or DataFrame columns containing string data. The `.str` accessor allows you to use most of Python's string methods, as well as some additional Pandas-specific string functions.


Here's how you can access string methods in Pandas:


In [2]:
# Create a sample Series with string data
s = pd.Series(['apple', 'banana', 'cherry', 'date', np.nan])

# Access string methods using the .str accessor
s.str.upper()

0     APPLE
1    BANANA
2    CHERRY
3      DATE
4       NaN
dtype: object

Let's explore some key points about using the `.str` accessor:


### <a id='toc1_1_'></a>[Vectorized Operations](#toc0_)


The `.str` accessor applies operations to all non-null elements in a Series or DataFrame column. This vectorized approach is much faster than iterating over each element individually.


In [3]:
# Vectorized operation on the 'Name' column
df['Name'].str.lower()

0     john smith
1       jane doe
2    bob johnson
3    alice brown
4            NaN
Name: Name, dtype: object

### <a id='toc1_2_'></a>[Handling Missing Values](#toc0_)


Pandas automatically handles missing values (NaN or None) when using string methods. These values are typically preserved or ignored, depending on the operation.


In [4]:
# Notice how the NaN value is preserved
df['Name'].str.len()

0    10.0
1     8.0
2    11.0
3    11.0
4     NaN
Name: Name, dtype: float64

### <a id='toc1_4_'></a>[Accessing Individual Characters](#toc0_)


You can access individual characters or slices of strings using indexing:


In [None]:
# Get the first character of each name
df['Name'].str[0]

0      J
1      J
2      B
3      A
4    NaN
Name: Name, dtype: object

In [None]:
# Get the first three characters of each name
df['Name'].str[:3]

0    Joh
1    Jan
2    Bob
3    Ali
4    NaN
Name: Name, dtype: object

### <a id='toc1_3_'></a>[Method Chaining](#toc0_)


You can chain multiple string operations together for more complex transformations:


In [5]:
# Chain multiple string operations
df['Email'].str.split('@').str[0].str.capitalize()

0       John
1       Jane
2        Bob
3      Alice
4    Unknown
Name: Email, dtype: object

### <a id='toc1_5_'></a>[Boolean Indexing](#toc0_)


String methods that return boolean values can be used for filtering:


In [8]:
df['Name'].str.startswith('J', na=False)

0     True
1     True
2    False
3    False
4    False
Name: Name, dtype: bool

In [9]:
# Filter names that start with 'J'
# df[df['Name'].str.startswith('J')] # will raise an error since there are NaN values
df[df['Name'].str.startswith('J', na=False)] # use na=False to handle NaN values

Unnamed: 0,Name,Email,Address
0,John Smith,john@example.com,"123 Main St, NY"
1,Jane Doe,jane@example.com,"456 Elm St, CA"


### <a id='toc1_6_'></a>[Applying to DataFrame Columns](#toc0_)


You can apply string methods to specific columns in a DataFrame:


In [10]:
# Apply string method to multiple columns
df[['Name', 'Email']].apply(lambda x: x.str.upper())

Unnamed: 0,Name,Email
0,JOHN SMITH,JOHN@EXAMPLE.COM
1,JANE DOE,JANE@EXAMPLE.COM
2,BOB JOHNSON,BOB@EXAMPLE.COM
3,ALICE BROWN,ALICE@EXAMPLE.COM
4,,UNKNOWN@EXAMPLE.COM


### <a id='toc1_7_'></a>[Regular Expressions](#toc0_)


The `.str` accessor also provides methods for working with regular expressions:


In [11]:
# Extract the domain from email addresses
df['Email'].str.extract(r'@(.+)$')

Unnamed: 0,0
0,example.com
1,example.com
2,example.com
3,example.com
4,example.com


Remember that when using the `.str` accessor, you're working with Pandas Series or DataFrame columns, not individual Python strings. This means that some Python string methods might behave slightly differently or have additional parameters when used through the `.str` accessor.


By leveraging the `.str` accessor, you can perform powerful and efficient string manipulations on your Pandas data structures. In the following sections, we'll explore specific string operations in more detail and see how they can be applied to solve common data cleaning and transformation tasks.

## <a id='toc2_'></a>[Basic String Operations](#toc0_)

Pandas provides a variety of basic string operations that are frequently used in data cleaning and preprocessing. Let's explore some of the most common operations.


### <a id='toc2_1_'></a>[Changing Case (lower, upper, title, swapcase)](#toc0_)


Changing the case of strings is often necessary for standardization or formatting purposes.


In [12]:
# Create a sample Series
s = pd.Series(['APPLE', 'banana', 'Cherry', 'DATE'])

In [13]:
# Convert to lowercase
s.str.lower()

0     apple
1    banana
2    cherry
3      date
dtype: object

In [14]:
# Convert to uppercase
s.str.upper()

0     APPLE
1    BANANA
2    CHERRY
3      DATE
dtype: object

In [15]:
# Convert to title case
s.str.title()

0     Apple
1    Banana
2    Cherry
3      Date
dtype: object

In [16]:
# Swap case
s.str.swapcase()

0     apple
1    BANANA
2    cHERRY
3      date
dtype: object

In [17]:
# Applying to our DataFrame
df['Name'].str.title()

0     John Smith
1       Jane Doe
2    Bob Johnson
3    Alice Brown
4            NaN
Name: Name, dtype: object

### <a id='toc2_2_'></a>[Trimming Whitespace (strip, lstrip, rstrip)](#toc0_)


Removing unwanted whitespace is crucial for data cleaning and comparison operations.


In [18]:
# Create a Series with whitespace
s = pd.Series(['  apple  ', ' banana ', 'cherry  ', '  date'])

In [19]:
# Remove whitespace from both ends
s.str.strip()

0     apple
1    banana
2    cherry
3      date
dtype: object

In [20]:
# Remove whitespace from left side
s.str.lstrip()

0     apple  
1     banana 
2    cherry  
3        date
dtype: object

In [21]:
# Remove whitespace from right side
s.str.rstrip()

0      apple
1     banana
2     cherry
3       date
dtype: object

In [22]:
# Applying to our DataFrame
df['Address'] = df['Address'].str.strip()
df['Address']

0     123 Main St, NY
1      456 Elm St, CA
2      789 Oak St, TX
3     321 Pine St, FL
4    654 Maple St, WA
Name: Address, dtype: object

You can also specify characters to remove:


In [23]:
# Remove specific characters
pd.Series(['##apple##', '#banana#', 'cherry##']).str.strip('#')

0     apple
1    banana
2    cherry
dtype: object

### <a id='toc2_3_'></a>[Padding Strings (pad, center, ljust, rjust)](#toc0_)


Padding strings is useful for formatting output or aligning text.


In [24]:
# Create a Series of strings
s = pd.Series(['apple', 'banana', 'cherry'])

In [25]:
# Pad strings to a specified length
s.str.pad(width=10, side='left', fillchar='*')

0    *****apple
1    ****banana
2    ****cherry
dtype: object

In [26]:
# Center strings
s.str.center(10, fillchar='-')

0    --apple---
1    --banana--
2    --cherry--
dtype: object

In [27]:
# Left justify
s.str.ljust(10, fillchar='>')

0    apple>>>>>
1    banana>>>>
2    cherry>>>>
dtype: object

In [28]:
# Right justify
s.str.rjust(10, fillchar='<')

0    <<<<<apple
1    <<<<banana
2    <<<<cherry
dtype: object

In [29]:
# Applying to our DataFrame
df['Name'].str.ljust(15, fillchar='-')

0    John Smith-----
1    Jane Doe-------
2    Bob Johnson----
3    Alice Brown----
4                NaN
Name: Name, dtype: object

These methods are particularly useful when you need to format strings for display or when working with fixed-width data formats.


These basic string operations form the foundation of text data preprocessing in Pandas. They are essential for cleaning and standardizing text data before more complex analyses or transformations are performed. In the next sections, we'll explore more advanced string manipulation techniques.

## <a id='toc3_'></a>[String Information and Checks](#toc0_)

String information and checks are essential operations when working with text data. Pandas provides several methods to extract information from strings and check their contents.


### <a id='toc3_1_'></a>[Length of Strings](#toc0_)


To get the length of strings in a Series or DataFrame column, we use the `.str.len()` method:


In [30]:
# Calculate the length of each name
df['Name'].str.len()

0    10.0
1     8.0
2    11.0
3    11.0
4     NaN
Name: Name, dtype: float64

In [31]:
# Calculate the length of each email address
df['Email'].str.len()


0    16
1    16
2    15
3    17
4    19
Name: Email, dtype: int64

This operation returns the number of characters in each string, including spaces and punctuation. It's useful for tasks like:
- Identifying unusually long or short entries
- Filtering data based on string length
- Creating features for machine learning models


### <a id='toc3_2_'></a>[Checking String Contents (startswith, endswith, contains)](#toc0_)


Pandas provides several methods to check the contents of strings:


The `.str.startswith()` method checks if a string starts with a specified substring:


In [32]:
# Check if names start with 'J'
df['Name'].str.startswith('J')

0     True
1     True
2    False
3    False
4      NaN
Name: Name, dtype: object

In [33]:
# Check if email addresses start with 'john'
df['Email'].str.startswith('john')

0     True
1    False
2    False
3    False
4    False
Name: Email, dtype: bool

Similarly, `.str.endswith()` checks if a string ends with a specified substring:


In [34]:
# Check if email addresses end with '.com'
df['Email'].str.endswith('.com')

0    True
1    True
2    True
3    True
4    True
Name: Email, dtype: bool

In [35]:
# Check if addresses end with 'NY'
df['Address'].str.endswith('NY')

0     True
1    False
2    False
3    False
4    False
Name: Address, dtype: bool

The `.str.contains()` method checks if a string contains a specified substring:


In [36]:
# Check if names contain 'Smith'
df['Name'].str.contains('Smith')

0     True
1    False
2    False
3    False
4      NaN
Name: Name, dtype: object

In [37]:
# Check if addresses contain 'St'
df['Address'].str.contains('St')

0    True
1    True
2    True
3    True
4    True
Name: Address, dtype: bool

You can also use regular expressions with `contains()`:


In [38]:
# Check if email contains a digit
df['Email'].str.contains(r'\d')

0    False
1    False
2    False
3    False
4    False
Name: Email, dtype: bool

These methods are case-sensitive by default. You can make them case-insensitive by passing the `case=False` argument:


In [39]:
# Case-insensitive check for 'SMITH' in names
df['Name'].str.contains('SMITH', case=False)

0     True
1    False
2    False
3    False
4      NaN
Name: Name, dtype: object

These string checking methods return boolean Series, which can be useful for:
- Filtering data based on string contents
- Creating binary features for analysis or modeling
- Validating data entries


Here's an example of using these methods for filtering:


In [40]:
# Filter rows where the name starts with 'J' and the address contains 'St'
df[(df['Name'].str.startswith('J')) & (df['Address'].str.contains('St'))]

Unnamed: 0,Name,Email,Address
0,John Smith,john@example.com,"123 Main St, NY"
1,Jane Doe,jane@example.com,"456 Elm St, CA"


Remember that these methods automatically handle missing values (NaN) in the Series, treating them as False in the resulting boolean Series. If you want to preserve NaN values, you can use the `na=np.nan` parameter:


In [41]:
# Preserve NaN values in the result
df['Name'].str.contains('Smith', na=np.nan)

0     True
1    False
2    False
3    False
4      NaN
Name: Name, dtype: object

These string information and checking methods provide powerful tools for exploring and manipulating text data in Pandas, allowing you to efficiently analyze and filter your data based on string contents.

## <a id='toc4_'></a>[Extracting and Replacing Substrings](#toc0_)

Extracting specific parts of strings and replacing substrings are common operations in data cleaning and text analysis. Pandas provides several methods to perform these tasks efficiently.


### <a id='toc4_1_'></a>[Substring Extraction (slice, get)](#toc0_)


Pandas offers multiple ways to extract substrings from Series or DataFrame columns.


The `str.slice()` method allows you to extract a portion of each string in a Series.


In [42]:
# Create a sample Series
s = pd.Series(['apple', 'banana', 'cherry', 'date'])
s

0     apple
1    banana
2    cherry
3      date
dtype: object

In [43]:
# Extract the first three characters
s.str.slice(0, 3)

0    app
1    ban
2    che
3    dat
dtype: object

In [44]:
# Extract from the second character to the end
s.str.slice(1)

0     pple
1    anana
2    herry
3      ate
dtype: object

In [45]:
# Extract the last two characters
s.str.slice(-2)

0    le
1    na
2    ry
3    te
dtype: object

In [46]:
# Applying to our DataFrame
df['Name'].str.slice(0, 5)

0    John 
1    Jane 
2    Bob J
3    Alice
4      NaN
Name: Name, dtype: object

The `str.get()` method is used to extract a single character at a specified position.


In [47]:
# Get the first character
s.str.get(0)

0    a
1    b
2    c
3    d
dtype: object

In [48]:
# Get the last character
s.str.get(-1)

0    e
1    a
2    y
3    e
dtype: object

In [49]:
# Applying to our DataFrame
df['Name'].str.get(0)

0      J
1      J
2      B
3      A
4    NaN
Name: Name, dtype: object

### <a id='toc4_2_'></a>[Finding and Replacing (find, replace)](#toc0_)


Finding and replacing substrings are crucial operations for data cleaning and transformation.


The `str.find()` method locates the position of a substring within each string in a Series.


In [50]:
# Create a sample Series
s = pd.Series(['apple pie', 'banana split', 'cherry tart'])
s

0       apple pie
1    banana split
2     cherry tart
dtype: object

In [51]:
# Find the position of 'a' in each string
s.str.find('a')

0    0
1    1
2    8
dtype: int64

In [52]:
# Find the position of 'an' in each string
s.str.find('an')

0   -1
1    1
2   -1
dtype: int64

In [53]:
# Applying to our DataFrame
df['Email'].str.find('@')

0    4
1    4
2    3
3    5
4    7
Name: Email, dtype: int64

The `str.replace()` method substitutes occurrences of a substring with another string.


In [54]:
# Replace 'a' with 'X'
s.str.replace('a', 'X')

0       Xpple pie
1    bXnXnX split
2     cherry tXrt
dtype: object

In [55]:
# Replace the first occurrence of 'a' with 'X'
s.str.replace('a', 'X', n=1)

0       Xpple pie
1    bXnana split
2     cherry tXrt
dtype: object

In [56]:
# Using regex for more complex replacements
s.str.replace(r'\s+', '_', regex=True)

0       apple_pie
1    banana_split
2     cherry_tart
dtype: object

In [57]:
# Applying to our DataFrame
df['Email'].str.replace('example.com', 'newdomain.com')

0       john@newdomain.com
1       jane@newdomain.com
2        bob@newdomain.com
3      alice@newdomain.com
4    unknown@newdomain.com
Name: Email, dtype: object

Let's combine these techniques in a practical example:


In [59]:
df

Unnamed: 0,Name,Email,Address
0,John Smith,john@example.com,"123 Main St, NY"
1,Jane Doe,jane@example.com,"456 Elm St, CA"
2,Bob Johnson,bob@example.com,"789 Oak St, TX"
3,Alice Brown,alice@example.com,"321 Pine St, FL"
4,,unknown@example.com,"654 Maple St, WA"


In [60]:
df['Email'].str.find('@')

0    4
1    4
2    3
3    5
4    7
Name: Email, dtype: int64

In [61]:
# Extract username from email
df['Email'].str.split('@').str[0]

0       john
1       jane
2        bob
3      alice
4    unknown
Name: Email, dtype: object

In [62]:
# Mask part of the email for privacy
df['Email'].apply(lambda x: x[:3] + '*' * (x.find('@') - 3) + x[x.find('@'):] if isinstance(x, str) else x)

0       joh*@example.com
1       jan*@example.com
2        bob@example.com
3      ali**@example.com
4    unk****@example.com
Name: Email, dtype: object

In [63]:
# Replace state abbreviations with full names
state_mapping = {'NY': 'New York', 'CA': 'California', 'TX': 'Texas', 'FL': 'Florida', 'WA': 'Washington'}
df['Address'].replace(state_mapping, regex=True)

0       123 Main St, New York
1      456 Elm St, California
2           789 Oak St, Texas
3        321 Pine St, Florida
4    654 Maple St, Washington
Name: Address, dtype: object

In this example, we've:
1. Extracted usernames from email addresses.
2. Created a masked version of email addresses for privacy.
3. Replaced state abbreviations with full state names in the address.


These operations demonstrate how substring extraction and replacement can be used to transform and enrich your data. They're particularly useful for tasks like data anonymization, standardization, and feature engineering.


Remember that when working with large datasets, vectorized operations using Pandas string methods are generally more efficient than applying Python's built-in string methods to each element individually. However, for very complex string manipulations, you might need to combine Pandas methods with custom Python functions using `apply()`.m


## <a id='toc5_'></a>[Regular Expressions in Pandas](#toc0_)

Regular expressions (regex) are powerful tools for pattern matching and text manipulation. Pandas integrates regex functionality, allowing you to perform complex string operations efficiently on Series and DataFrame columns.


### <a id='toc5_1_'></a>[Introduction to Regular Expressions](#toc0_)


Regular expressions are sequences of characters that define a search pattern. They're particularly useful for:
- Validating string formats (e.g., email addresses, phone numbers)
- Extracting specific parts of strings
- Complex find-and-replace operations


In Pandas, you can use regex with many string methods by setting the `regex=True` parameter.


### <a id='toc5_2_'></a>[Using `re.match`, `re.search`, and `re.findall`](#toc0_)


Pandas provides methods that correspond to Python's `re` module functions:


Checks if the pattern matches at the beginning of the string.


In [64]:
# Create a sample Series
s = pd.Series(['foo123', 'bar456', '123baz'])

# Check if strings start with letters
s.str.match(r'[a-zA-Z]+')

0     True
1     True
2    False
dtype: bool

Checks if the pattern is contained anywhere in the string (similar to re.search).


In [65]:
# Check if strings contain digits
s.str.contains(r'\d+')

0    True
1    True
2    True
dtype: bool

Finds all occurrences of the pattern in each string.


In [66]:
# Find all digits in each string
s.str.findall(r'\d+')

0    [123]
1    [456]
2    [123]
dtype: object

In [68]:
# Applying to our DataFrame
df['Address'].str.findall(r'\d+')  # Extract all numbers from addresses

0    [123]
1    [456]
2    [789]
3    [321]
4    [654]
Name: Address, dtype: object

### <a id='toc5_3_'></a>[Extracting Information with Regex](#toc0_)


Regex is particularly powerful for extracting structured information from strings.


Extracts the first occurrence of a pattern with capture groups.


In [72]:
s

0    foo123
1    bar456
2    123baz
dtype: object

In [70]:
# Extract the first word and number from each string
s.str.extract(r'([a-zA-Z]+)(\d+)')

Unnamed: 0,0,1
0,foo,123.0
1,bar,456.0
2,,


In [71]:
# Applying to our DataFrame
df['Address'].str.extract(r'(\d+)\s+(.+),\s+(\w{2})')  # Extract street number, street name, and state

Unnamed: 0,0,1,2
0,123,Main St,NY
1,456,Elm St,CA
2,789,Oak St,TX
3,321,Pine St,FL
4,654,Maple St,WA


Extracts all occurrences of a pattern with capture groups.


In [73]:
# Extract all words and numbers
s.str.extractall(r'([a-zA-Z]+)(\d+)')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,foo,123
1,0,bar,456


In [74]:
# Applying to our DataFrame
df['Email'].str.extractall(r'(\w+)@(\w+\.\w+)')  # Extract username and domain from email

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,john,example.com
1,0,jane,example.com
2,0,bob,example.com
3,0,alice,example.com
4,0,unknown,example.com


Let's combine these techniques in a practical example:


In [86]:
# Ensure we have our sample DataFrame
df = pd.DataFrame({
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown', None],
    'Email': ['john@example.com', 'jane@example.com', 'bob@example.com', 'alice@example.com', 'unknown@example.com'],
    'Address': ['123 Main St, NY 10001', '456 Elm St, CA 90210', '789 Oak St, TX 75001', '321 Pine St, FL 33101', '654 Maple St, WA 98101']
})
df

Unnamed: 0,Name,Email,Address
0,John Smith,john@example.com,"123 Main St, NY 10001"
1,Jane Doe,jane@example.com,"456 Elm St, CA 90210"
2,Bob Johnson,bob@example.com,"789 Oak St, TX 75001"
3,Alice Brown,alice@example.com,"321 Pine St, FL 33101"
4,,unknown@example.com,"654 Maple St, WA 98101"


In [87]:
# Extract components from the Address
address_components = df['Address'].str.extract(r'(\d+)\s+(.+),\s+(\w{2})\s+(\d{5})')
address_components.columns = ['Street_Number', 'Street_Name', 'State', 'Zip_Code']
address_components

Unnamed: 0,Street_Number,Street_Name,State,Zip_Code
0,123,Main St,NY,10001
1,456,Elm St,CA,90210
2,789,Oak St,TX,75001
3,321,Pine St,FL,33101
4,654,Maple St,WA,98101


In [88]:
# Extract username and domain from Email
email_components = df['Email'].str.extract(r'(\w+)@(\w+\.\w+)')
email_components.columns = ['Username', 'Domain']
email_components

Unnamed: 0,Username,Domain
0,john,example.com
1,jane,example.com
2,bob,example.com
3,alice,example.com
4,unknown,example.com


In [89]:
# Combine the extracted information with the original DataFrame
df_expanded = pd.concat([df, address_components, email_components], axis=1)
df_expanded

Unnamed: 0,Name,Email,Address,Street_Number,Street_Name,State,Zip_Code,Username,Domain
0,John Smith,john@example.com,"123 Main St, NY 10001",123,Main St,NY,10001,john,example.com
1,Jane Doe,jane@example.com,"456 Elm St, CA 90210",456,Elm St,CA,90210,jane,example.com
2,Bob Johnson,bob@example.com,"789 Oak St, TX 75001",789,Oak St,TX,75001,bob,example.com
3,Alice Brown,alice@example.com,"321 Pine St, FL 33101",321,Pine St,FL,33101,alice,example.com
4,,unknown@example.com,"654 Maple St, WA 98101",654,Maple St,WA,98101,unknown,example.com


In [90]:
# Validate email format
df_expanded['Valid_Email'] = df_expanded['Email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')
df_expanded

Unnamed: 0,Name,Email,Address,Street_Number,Street_Name,State,Zip_Code,Username,Domain,Valid_Email
0,John Smith,john@example.com,"123 Main St, NY 10001",123,Main St,NY,10001,john,example.com,True
1,Jane Doe,jane@example.com,"456 Elm St, CA 90210",456,Elm St,CA,90210,jane,example.com,True
2,Bob Johnson,bob@example.com,"789 Oak St, TX 75001",789,Oak St,TX,75001,bob,example.com,True
3,Alice Brown,alice@example.com,"321 Pine St, FL 33101",321,Pine St,FL,33101,alice,example.com,True
4,,unknown@example.com,"654 Maple St, WA 98101",654,Maple St,WA,98101,unknown,example.com,True


In [92]:
# Extract area code from Zip_Code
df_expanded['Area_Code'] = df_expanded['Zip_Code'].str.slice(0, 3)

# Display results
df_expanded

Unnamed: 0,Name,Email,Address,Street_Number,Street_Name,State,Zip_Code,Username,Domain,Valid_Email,Area_Code
0,John Smith,john@example.com,"123 Main St, NY 10001",123,Main St,NY,10001,john,example.com,True,100
1,Jane Doe,jane@example.com,"456 Elm St, CA 90210",456,Elm St,CA,90210,jane,example.com,True,902
2,Bob Johnson,bob@example.com,"789 Oak St, TX 75001",789,Oak St,TX,75001,bob,example.com,True,750
3,Alice Brown,alice@example.com,"321 Pine St, FL 33101",321,Pine St,FL,33101,alice,example.com,True,331
4,,unknown@example.com,"654 Maple St, WA 98101",654,Maple St,WA,98101,unknown,example.com,True,981


In this comprehensive example, we've:
1. Extracted structured information from the Address field (street number, street name, state, and zip code).
2. Parsed the Email field to separate username and domain.
3. Validated the email format using a regex pattern.
4. Extracted the area code from the zip code.


This demonstrates how regular expressions can be used to extract, validate, and transform complex string data in Pandas. Regular expressions are incredibly powerful, but they can also become complex. It's often a good practice to break down complex regex patterns into smaller, more manageable parts and test them individually before combining them into more sophisticated patterns.


Remember that while regex operations in Pandas are vectorized, they can still be computationally expensive for very large datasets. In such cases, consider using more specialized text processing libraries or preprocessing your data in chunks.

## <a id='toc6_'></a>[Splitting and Joining Strings](#toc0_)

Splitting strings into multiple parts and joining strings together are common operations in data preprocessing and text analysis. Pandas provides efficient methods for these operations that can be applied to entire Series or DataFrame columns.


### <a id='toc6_1_'></a>[Splitting Strings (split, rsplit)](#toc0_)


Pandas offers methods to split strings based on a specified delimiter.


The `str.split()` method splits a string into a list of substrings.


In [104]:
# Create a sample Series
s = pd.Series(['apple pie', 'banana split', 'cherry tart', 'hot date pudding'])
s

0           apple pie
1        banana split
2         cherry tart
3    hot date pudding
dtype: object

In [105]:
# Split strings by space
s.str.split()

0            [apple, pie]
1         [banana, split]
2          [cherry, tart]
3    [hot, date, pudding]
dtype: object

In [110]:
# Split with expand=True to return a DataFrame
s.str.split(expand=True)

Unnamed: 0,0,1,2
0,apple,pie,
1,banana,split,
2,cherry,tart,
3,hot,date,pudding


In [109]:
s.str.split(n=1, expand=True)

Unnamed: 0,0,1
0,apple,pie
1,banana,split
2,cherry,tart
3,hot,date pudding


In [107]:
# Applying to our DataFrame
df['Name'].str.split(expand=True)

Unnamed: 0,0,1
0,John,Smith
1,Jane,Doe
2,Bob,Johnson
3,Alice,Brown
4,,


The `str.rsplit()` method is similar to `split()`, but starts from the right side of the string.


In [111]:
# Right split with a limit
s.str.rsplit(n=1, expand=True)

Unnamed: 0,0,1
0,apple,pie
1,banana,split
2,cherry,tart
3,hot date,pudding


In [112]:
# Applying to our DataFrame
df['Email'].str.rsplit('@', n=1, expand=True)

Unnamed: 0,0,1
0,john,example.com
1,jane,example.com
2,bob,example.com
3,alice,example.com
4,unknown,example.com


### <a id='toc6_2_'></a>[Joining Strings (cat)](#toc0_)


The `str.cat()` method is used to concatenate strings in Pandas.


In [113]:
# Create sample Series
first_names = pd.Series(['John', 'Jane', 'Bob', 'Alice'])
last_names = pd.Series(['Smith', 'Doe', 'Johnson', 'Brown'])

In [114]:
# Concatenate with a space separator
full_names = first_names.str.cat(last_names, sep=' ')
full_names

0     John Smith
1       Jane Doe
2    Bob Johnson
3    Alice Brown
dtype: object

In [115]:
# Concatenate multiple Series
pd.Series(['Mr.', 'Ms.', 'Mr.', 'Ms.']).str.cat([first_names, last_names], sep=' ')


0     Mr. John Smith
1       Ms. Jane Doe
2    Mr. Bob Johnson
3    Ms. Alice Brown
dtype: object

In [116]:
# Applying to our DataFrame
df['Address'].str.cat(df['Email'], sep=' | ')

0        123 Main St, NY 10001 | john@example.com
1         456 Elm St, CA 90210 | jane@example.com
2          789 Oak St, TX 75001 | bob@example.com
3       321 Pine St, FL 33101 | alice@example.com
4    654 Maple St, WA 98101 | unknown@example.com
Name: Address, dtype: object

Let's combine these techniques in a practical example:


In [122]:
# Ensure we have our sample DataFrame
df = pd.DataFrame({
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown', None],
    'Email': ['john@example.com', 'jane@example.com', 'bob@example.com', 'alice@example.com', 'unknown@example.com'],
    'Address': ['123 Main St, NY, 10001', '456 Elm St, CA, 90210', '789 Oak St, TX, 75001', '321 Pine St, FL, 33101', '654 Maple St, WA, 98101']
})
df

Unnamed: 0,Name,Email,Address
0,John Smith,john@example.com,"123 Main St, NY, 10001"
1,Jane Doe,jane@example.com,"456 Elm St, CA, 90210"
2,Bob Johnson,bob@example.com,"789 Oak St, TX, 75001"
3,Alice Brown,alice@example.com,"321 Pine St, FL, 33101"
4,,unknown@example.com,"654 Maple St, WA, 98101"


In [123]:
# Split Name into First Name and Last Name
name_split = df['Name'].str.split(n=1, expand=True)
df['First_Name'] = name_split[0]
df['Last_Name'] = name_split[1]
df

Unnamed: 0,Name,Email,Address,First_Name,Last_Name
0,John Smith,john@example.com,"123 Main St, NY, 10001",John,Smith
1,Jane Doe,jane@example.com,"456 Elm St, CA, 90210",Jane,Doe
2,Bob Johnson,bob@example.com,"789 Oak St, TX, 75001",Bob,Johnson
3,Alice Brown,alice@example.com,"321 Pine St, FL, 33101",Alice,Brown
4,,unknown@example.com,"654 Maple St, WA, 98101",,


In [124]:
# Split Email into Username and Domain
email_split = df['Email'].str.split('@', expand=True)
df['Username'] = email_split[0]
df['Domain'] = email_split[1]
df

Unnamed: 0,Name,Email,Address,First_Name,Last_Name,Username,Domain
0,John Smith,john@example.com,"123 Main St, NY, 10001",John,Smith,john,example.com
1,Jane Doe,jane@example.com,"456 Elm St, CA, 90210",Jane,Doe,jane,example.com
2,Bob Johnson,bob@example.com,"789 Oak St, TX, 75001",Bob,Johnson,bob,example.com
3,Alice Brown,alice@example.com,"321 Pine St, FL, 33101",Alice,Brown,alice,example.com
4,,unknown@example.com,"654 Maple St, WA, 98101",,,unknown,example.com


In [125]:
# Split Address into components
address_split = df['Address'].str.rsplit(',', n=2, expand=True)
df['Street'] = address_split[0]
df['State'] = address_split[1].str.strip().str[:2]
df['Zip'] = address_split[2].str.strip()
df

Unnamed: 0,Name,Email,Address,First_Name,Last_Name,Username,Domain,Street,State,Zip
0,John Smith,john@example.com,"123 Main St, NY, 10001",John,Smith,john,example.com,123 Main St,NY,10001
1,Jane Doe,jane@example.com,"456 Elm St, CA, 90210",Jane,Doe,jane,example.com,456 Elm St,CA,90210
2,Bob Johnson,bob@example.com,"789 Oak St, TX, 75001",Bob,Johnson,bob,example.com,789 Oak St,TX,75001
3,Alice Brown,alice@example.com,"321 Pine St, FL, 33101",Alice,Brown,alice,example.com,321 Pine St,FL,33101
4,,unknown@example.com,"654 Maple St, WA, 98101",,,unknown,example.com,654 Maple St,WA,98101


In [126]:
# Join First Name and Last Name with a comma
df['Name_Formatted'] = df['Last_Name'].str.cat(df['First_Name'], sep=', ')
df

Unnamed: 0,Name,Email,Address,First_Name,Last_Name,Username,Domain,Street,State,Zip,Name_Formatted
0,John Smith,john@example.com,"123 Main St, NY, 10001",John,Smith,john,example.com,123 Main St,NY,10001,"Smith, John"
1,Jane Doe,jane@example.com,"456 Elm St, CA, 90210",Jane,Doe,jane,example.com,456 Elm St,CA,90210,"Doe, Jane"
2,Bob Johnson,bob@example.com,"789 Oak St, TX, 75001",Bob,Johnson,bob,example.com,789 Oak St,TX,75001,"Johnson, Bob"
3,Alice Brown,alice@example.com,"321 Pine St, FL, 33101",Alice,Brown,alice,example.com,321 Pine St,FL,33101,"Brown, Alice"
4,,unknown@example.com,"654 Maple St, WA, 98101",,,unknown,example.com,654 Maple St,WA,98101,


In [127]:
# Create a full contact info string
df['Contact_Info'] = df['Name'].str.cat([df['Email'], df['Address']], sep=' | ')
df

Unnamed: 0,Name,Email,Address,First_Name,Last_Name,Username,Domain,Street,State,Zip,Name_Formatted,Contact_Info
0,John Smith,john@example.com,"123 Main St, NY, 10001",John,Smith,john,example.com,123 Main St,NY,10001,"Smith, John","John Smith | john@example.com | 123 Main St, N..."
1,Jane Doe,jane@example.com,"456 Elm St, CA, 90210",Jane,Doe,jane,example.com,456 Elm St,CA,90210,"Doe, Jane","Jane Doe | jane@example.com | 456 Elm St, CA, ..."
2,Bob Johnson,bob@example.com,"789 Oak St, TX, 75001",Bob,Johnson,bob,example.com,789 Oak St,TX,75001,"Johnson, Bob","Bob Johnson | bob@example.com | 789 Oak St, TX..."
3,Alice Brown,alice@example.com,"321 Pine St, FL, 33101",Alice,Brown,alice,example.com,321 Pine St,FL,33101,"Brown, Alice","Alice Brown | alice@example.com | 321 Pine St,..."
4,,unknown@example.com,"654 Maple St, WA, 98101",,,unknown,example.com,654 Maple St,WA,98101,,


In this comprehensive example, we've:
1. Split the 'Name' column into 'First_Name' and 'Last_Name'.
2. Parsed the 'Email' column into 'Username' and 'Domain'.
3. Extracted components from the 'Address' column.
4. Created a formatted name with last name first.
5. Generated a full contact info string by joining multiple columns.


These operations demonstrate how splitting and joining strings can be used to restructure and enrich your data. They're particularly useful for:
- Parsing complex string fields into structured data
- Reformatting data for display or analysis
- Combining information from multiple columns into a single field


Remember that when working with real-world data, you may encounter inconsistencies or missing values. It's always a good practice to handle potential errors and edge cases in your string operations. For example:


By mastering these splitting and joining techniques, you can efficiently manipulate and restructure string data in Pandas, making your data cleaning and preprocessing tasks much more manageable.

## <a id='toc7_'></a>[Practical Examples and Use Cases](#toc0_)

Let's explore some real-world scenarios where string manipulation in Pandas is particularly useful. These examples will demonstrate how to apply the techniques we've learned to solve common data processing tasks.


### <a id='toc7_1_'></a>[Cleaning and Standardizing Customer Data](#toc0_)


Suppose we have a messy customer database that needs cleaning and standardization.


In [128]:
import pandas as pd
import numpy as np

In [129]:
# Create a sample messy customer database
df = pd.DataFrame({
    'Customer_Name': ['john doe', 'JANE SMITH', 'Bob Johnson', '  Alice Brown  ', np.nan],
    'Email': ['john.doe@email.com', 'jane.smith@email.com', 'bob.j@email', 'alice@email.com', 'unknown'],
    'Phone': ['(123) 456-7890', '987-654-3210', '1234567890', '(456)789-0123', np.nan],
    'Address': ['123 Main St., New York, NY 10001', '456 Elm St, Los Angeles CA 90001', '789 Oak Road TX 75001', '321 Pine Avenue, Miami FL 33101', '654 Maple St Seattle WA 98101']
})
df

Unnamed: 0,Customer_Name,Email,Phone,Address
0,john doe,john.doe@email.com,(123) 456-7890,"123 Main St., New York, NY 10001"
1,JANE SMITH,jane.smith@email.com,987-654-3210,"456 Elm St, Los Angeles CA 90001"
2,Bob Johnson,bob.j@email,1234567890,789 Oak Road TX 75001
3,Alice Brown,alice@email.com,(456)789-0123,"321 Pine Avenue, Miami FL 33101"
4,,unknown,,654 Maple St Seattle WA 98101


In [130]:
# Clean and standardize the data
df['Customer_Name'] = df['Customer_Name'].str.title().str.strip()
df['Email'] = df['Email'].str.lower().str.replace(r'@email$', '@email.com', regex=True)
df['Phone'] = df['Phone'].str.replace(r'\D', '', regex=True).str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True)
df

Unnamed: 0,Customer_Name,Email,Phone,Address
0,John Doe,john.doe@email.com,(123) 456-7890,"123 Main St., New York, NY 10001"
1,Jane Smith,jane.smith@email.com,(987) 654-3210,"456 Elm St, Los Angeles CA 90001"
2,Bob Johnson,bob.j@email.com,(123) 456-7890,789 Oak Road TX 75001
3,Alice Brown,alice@email.com,(456) 789-0123,"321 Pine Avenue, Miami FL 33101"
4,,unknown,,654 Maple St Seattle WA 98101


In [131]:
# Extract city and state from address
df[['Street', 'City', 'State_Zip']] = df['Address'].str.split(',', expand=True)
df[['State', 'Zip']] = df['State_Zip'].str.strip().str.split(expand=True)
df['State'] = df['State'].str.upper()
df['Zip'] = df['Zip'].str.replace(r'\D', '', regex=True)
df

Unnamed: 0,Customer_Name,Email,Phone,Address,Street,City,State_Zip,State,Zip
0,John Doe,john.doe@email.com,(123) 456-7890,"123 Main St., New York, NY 10001",123 Main St.,New York,NY 10001,NY,10001.0
1,Jane Smith,jane.smith@email.com,(987) 654-3210,"456 Elm St, Los Angeles CA 90001",456 Elm St,Los Angeles CA 90001,,,
2,Bob Johnson,bob.j@email.com,(123) 456-7890,789 Oak Road TX 75001,789 Oak Road TX 75001,,,,
3,Alice Brown,alice@email.com,(456) 789-0123,"321 Pine Avenue, Miami FL 33101",321 Pine Avenue,Miami FL 33101,,,
4,,unknown,,654 Maple St Seattle WA 98101,654 Maple St Seattle WA 98101,,,,


In [132]:
# Drop intermediate columns and reorder
df = df.drop('State_Zip', axis=1)
df = df[['Customer_Name', 'Email', 'Phone', 'Street', 'City', 'State', 'Zip']]
df

Unnamed: 0,Customer_Name,Email,Phone,Street,City,State,Zip
0,John Doe,john.doe@email.com,(123) 456-7890,123 Main St.,New York,NY,10001.0
1,Jane Smith,jane.smith@email.com,(987) 654-3210,456 Elm St,Los Angeles CA 90001,,
2,Bob Johnson,bob.j@email.com,(123) 456-7890,789 Oak Road TX 75001,,,
3,Alice Brown,alice@email.com,(456) 789-0123,321 Pine Avenue,Miami FL 33101,,
4,,unknown,,654 Maple St Seattle WA 98101,,,


### <a id='toc7_3_'></a>[Extracting Information from Product Descriptions](#toc0_)


Imagine we have a dataset of product descriptions and we need to extract key information.


In [144]:
# Sample product data
products = pd.DataFrame({
    'Product_ID': ['A001', 'B002', 'C003', 'D004', 'E005'],
    'Description': [
        'Red leather wallet, size: 4x3 inches, price: $29.99',
        'Blue denim jeans, waist: 32, length: 34, price: $59.95',
        'Stainless steel watch, diameter: 40mm, water resistant to 50m, price: $129.00',
        'Wireless headphones, battery life: 20 hours, color: black, price: $89.99',
        'Laptop backpack, capacity: 20L, color: gray, fits up to 15" laptop, price: $49.99'
    ]
})
products

Unnamed: 0,Product_ID,Description
0,A001,"Red leather wallet, size: 4x3 inches, price: $..."
1,B002,"Blue denim jeans, waist: 32, length: 34, price..."
2,C003,"Stainless steel watch, diameter: 40mm, water r..."
3,D004,"Wireless headphones, battery life: 20 hours, c..."
4,E005,"Laptop backpack, capacity: 20L, color: gray, f..."


In [145]:
# Extract information using regex
products['Price'] = products['Description'].str.extract(r'price: \$(\d+\.\d{2})').astype(float)
products['Color'] = products['Description'].str.extract(r'color: (\w+)')
products['Size'] = products['Description'].str.extract(r'size: ([\w\s]+)')
products['Dimensions'] = products['Description'].str.extract(r'(\d+(?:x\d+)?\s*(?:inches|mm|"|\'))')
products

Unnamed: 0,Product_ID,Description,Price,Color,Size,Dimensions
0,A001,"Red leather wallet, size: 4x3 inches, price: $...",29.99,,4x3 inches,4x3 inches
1,B002,"Blue denim jeans, waist: 32, length: 34, price...",59.95,,,
2,C003,"Stainless steel watch, diameter: 40mm, water r...",129.0,,,40mm
3,D004,"Wireless headphones, battery life: 20 hours, c...",89.99,black,,
4,E005,"Laptop backpack, capacity: 20L, color: gray, f...",49.99,gray,,"15"""


In [146]:
# Analyze the data
print("\nAverage price:", products['Price'].mean())
print("Most common color:", products['Color'].mode()[0])
print("Price range:", products['Price'].min(), "-", products['Price'].max())


Average price: 71.784
Most common color: black
Price range: 29.99 - 129.0


### <a id='toc7_4_'></a>[Normalizing and Categorizing Text Data](#toc0_)


Let's normalize and categorize a dataset of book titles and authors.


In [147]:
# Sample book data
books = pd.DataFrame({
    'Title': ['The Great Gatsby', 'To Kill a Mockingbird', '1984', 'Pride and Prejudice', 'The Catcher in the Rye'],
    'Author': ['F. Scott Fitzgerald', 'Harper Lee', 'George Orwell', 'Jane Austen', 'J.D. Salinger'],
    'Year': ['1925', '1960', '1949', '1813', '1951']
})
books

Unnamed: 0,Title,Author,Year
0,The Great Gatsby,F. Scott Fitzgerald,1925
1,To Kill a Mockingbird,Harper Lee,1960
2,1984,George Orwell,1949
3,Pride and Prejudice,Jane Austen,1813
4,The Catcher in the Rye,J.D. Salinger,1951


In [148]:
# Normalize titles and authors
books['Title_Normalized'] = books['Title'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
books['Author_Last_Name'] = books['Author'].str.split().str[-1].str.lower()
books

Unnamed: 0,Title,Author,Year,Title_Normalized,Author_Last_Name
0,The Great Gatsby,F. Scott Fitzgerald,1925,the great gatsby,fitzgerald
1,To Kill a Mockingbird,Harper Lee,1960,to kill a mockingbird,lee
2,1984,George Orwell,1949,1984,orwell
3,Pride and Prejudice,Jane Austen,1813,pride and prejudice,austen
4,The Catcher in the Rye,J.D. Salinger,1951,the catcher in the rye,salinger


In [149]:
# Categorize by century
books['Century'] = books['Year'].astype(int).apply(lambda x: f"{(x-1)//100 + 1}th century")
books

Unnamed: 0,Title,Author,Year,Title_Normalized,Author_Last_Name,Century
0,The Great Gatsby,F. Scott Fitzgerald,1925,the great gatsby,fitzgerald,20th century
1,To Kill a Mockingbird,Harper Lee,1960,to kill a mockingbird,lee,20th century
2,1984,George Orwell,1949,1984,orwell,20th century
3,Pride and Prejudice,Jane Austen,1813,pride and prejudice,austen,19th century
4,The Catcher in the Rye,J.D. Salinger,1951,the catcher in the rye,salinger,20th century


In [150]:
# Create a searchable field
books['Searchable'] = books['Title_Normalized'] + ' ' + books['Author'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
books

Unnamed: 0,Title,Author,Year,Title_Normalized,Author_Last_Name,Century,Searchable
0,The Great Gatsby,F. Scott Fitzgerald,1925,the great gatsby,fitzgerald,20th century,the great gatsby f scott fitzgerald
1,To Kill a Mockingbird,Harper Lee,1960,to kill a mockingbird,lee,20th century,to kill a mockingbird harper lee
2,1984,George Orwell,1949,1984,orwell,20th century,1984 george orwell
3,Pride and Prejudice,Jane Austen,1813,pride and prejudice,austen,19th century,pride and prejudice jane austen
4,The Catcher in the Rye,J.D. Salinger,1951,the catcher in the rye,salinger,20th century,the catcher in the rye jd salinger


In [159]:
# Demonstrate search functionality
search_term = 'gatsby'
results = books[books['Searchable'].str.contains(search_term)]
print("\nSearch results for '{}':".format(search_term))
results[['Title', 'Author']]


Search results for 'gatsby':


Unnamed: 0,Title,Author
0,The Great Gatsby,F. Scott Fitzgerald
