## **Lecture: Regular Expressions for Advanced String Cleaning and Feature Engineering**
Regular expressions (regex) are powerful tools for working with text data in pandas. They allow you to perform complex string pattern matching, substitution, and extraction operations. This lecture note will cover the use of regular expressions for advanced string cleaning and feature engineering, with practical examples using datasets from the seaborn library. In pandas, you can leverage the power of regular expressions through the `str.contains()`, `str.extract()`, `str.replace()`, and `str.split()` methods, among others. Regular expressions can significantly simplify complex string operations and help you extract valuable information from your data.

Let's start by importing the necessary libraries and loading a dataset from seaborn.

In [None]:
import pandas as pd
import seaborn as sns

# Load the tips dataset from seaborn
tips = sns.load_dataset("tips")

Before diving into examples, let's review some basic regular expression syntax:

- `.` matches any character except newline
- `\d` matches any digit character
- `\w` matches any word character (alphanumeric and underscore)
- `\s` matches any whitespace character
- `^` matches the start of a string
- `$` matches the end of a string
- `[]` matches any character within the brackets
- `|` matches either the expression before or after the pipe
- `()` groups expressions together

**1. str.contains()**

This method checks if each string element contains the specified regex pattern and returns a boolean Series.

In [None]:
# Check if the 'time' column contains a digit followed by ':' and two digits
tips['has_time_format'] = tips['time'].str.contains(r'\d+:\d{2}')
print(tips['has_time_format'].head())


**2. str.extract()**

This method extracts substrings from each string element based on the specified regex pattern.

In [None]:
# Extract the hour and minute from the 'time' column
tips['time_components'] = tips['time'].str.extract(r'(\d+):(\d+)', expand=True)
print(tips['time_components'].head())


**3. str.replace()**

This method replaces occurrences of a regex pattern in each string element with a specified value.

In [None]:
# Replace all non-alphabetic characters with an empty string in the 'day' column
tips['day_cleaned'] = tips['day'].str.replace(r'[^a-zA-Z]', '')
print(tips['day_cleaned'].head())


**4. str.split()**

This method splits each string element by the specified regex pattern and returns a list of substrings.

In [None]:
# Split the 'time' column on the colon (':') using regex
tips['time_split'] = tips['time'].str.split(r':', expand=True)
print(tips['time_split'].head())

**5. Combining regex with other string methods**

You can combine regular expressions with other string methods for more complex operations.

In [None]:
# Extract the first word from the 'day' column
tips['day_first_word'] = tips['day'].str.extract(r'(\w+)', expand=False).str.lower()
print(tips['day_first_word'].head())