# Lecture 2.2: Regular Expressions for Advanced String Cleaning and Feature Engineering in Pandas

## Introduction
In this lecture, we will explore the power of regular expressions (regex) and how they can be leveraged in Pandas for advanced string cleaning and feature engineering tasks. Regular expressions provide a powerful and flexible way to match, extract, and manipulate text patterns, enabling you to tackle complex text data challenges with ease. We will dive into practical examples using text data from the scikit-learn library to showcase the capabilities of regex in Pandas.

## Understanding Regular Expressions
Regular expressions are a sequence of characters that define a search pattern. They are a powerful tool for working with text data, as they allow you to perform complex text matching and manipulation operations. Pandas' `str` accessor provides seamless integration with regular expressions, making it easy to apply these patterns to your data.

Here are some common regular expression operators and their meanings:

- **`.`** (dot): Matches any single character except newline
- **`^`** (caret): Matches the beginning of the string
- **`$`** (dollar sign): Matches the end of the string
- **`*`** (asterisk): Matches zero or more occurrences of the preceding character or group
- **`+`** (plus): Matches one or more occurrences of the preceding character or group
- **`?`** (question mark): Matches zero or one occurrence of the preceding character or group
- **`[]`** (square brackets): Defines a character class, matching any one of the enclosed characters

## Practical Examples
Let's dive into some practical examples using text data from the scikit-learn `20_newsgroup` dataset.

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load the 20 newsgroup dataset
news = fetch_20newsgroups(subset='all')
data = pd.DataFrame({'text': news.data})

1. **Removing email addresses:**

In [2]:
data['text_cleaned'] = data['text'].str.replace(r'\S+@\S+', '', regex=True)
print(data['text_cleaned'].head())

0    From: Mamatha Devineni Ratnam \nSubject: Pens ...
1    From:  (Matthew B Lawson)\nSubject: Which high...
2    From:  (Hilmi Eren)\nSubject: Re: ARMENIA SAYS...
3    From:  (Guy Dawson)\nSubject: Re: IDE vs SCSI,...
4    From: Alexander Samuel McDiarmid \nSubject: dr...
Name: text_cleaned, dtype: object


2. **Extracting URLs:**

In [3]:
data['urls'] = data['text'].str.extract(r'(https?://\S+)', expand=False)
print(data['urls'].head())

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: urls, dtype: object


3. **Counting the number of digits in each text:**

In [4]:
data['digit_count'] = data['text'].str.count(r'\d')
print(data['digit_count'].head())

0     5
1    11
2     4
3    65
4    18
Name: digit_count, dtype: int64


4. **Removing HTML tags:**

In [5]:
data['text_cleaned'] = data['text'].str.replace(r'<[^>]+>', '', regex=True)
print(data['text_cleaned'].head())

0    From: Mamatha Devineni Ratnam \nSubject: Pens ...
1    From: mblawson@midway.ecn.uoknor.edu (Matthew ...
2    From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...
3    From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...
4    From: Alexander Samuel McDiarmid \nSubject: dr...
Name: text_cleaned, dtype: object


5. **Extracting email domains:**

In [6]:
data['email_domain'] = data['text'].str.extract(r'\S+@(\S+)', expand=False)
print(data['email_domain'].head())

0          andrew.cmu.edu>
1    midway.ecn.uoknor.edu
2                dsv.su.se
3           austin.ibm.com
4          andrew.cmu.edu>
Name: email_domain, dtype: object


6. **Splitting text into words and counting word frequencies:**

In [10]:
data['word_count'] = data['text'].str.split().str.len()
data.head()

Unnamed: 0,text,text_cleaned,urls,digit_count,email_domain,words,word_count
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,From: Mamatha Devineni Ratnam \nSubject: Pens ...,,5,andrew.cmu.edu>,"[From:, Mamatha, Devineni, Ratnam, <mr47+@andr...",157
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,,11,midway.ecn.uoknor.edu,"[From:, mblawson@midway.ecn.uoknor.edu, (Matth...",134
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,,4,dsv.su.se,"[From:, hilmi-er@dsv.su.se, (Hilmi, Eren), Sub...",568
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,,65,austin.ibm.com,"[From:, guyd@austin.ibm.com, (Guy, Dawson), Su...",538
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,From: Alexander Samuel McDiarmid \nSubject: dr...,,18,andrew.cmu.edu>,"[From:, Alexander, Samuel, McDiarmid, <am2o+@a...",150


## Feature Engineering with Regular Expressions
Regular expressions can also be used for feature engineering, where you can create new features based on specific text patterns. For example, you can extract the number of URLs, email addresses, or hashtags in a text, and use these as features in your machine learning models.

In [12]:
# Extract the number of URLs
data['num_urls'] = data['text'].str.count(r'https?://\S+')

# Extract the number of email addresses
data['num_emails'] = data['text'].str.count(r'\S+@\S+')

# Extract the number of hashtags
data['num_hashtags'] = data['text'].str.count(r'#\w+')

Unnamed: 0,text,text_cleaned,urls,digit_count,email_domain,words,word_count,num_urls,num_emails,num_hashtags
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,From: Mamatha Devineni Ratnam \nSubject: Pens ...,,5,andrew.cmu.edu>,"[From:, Mamatha, Devineni, Ratnam, <mr47+@andr...",157,0,1,0
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,,11,midway.ecn.uoknor.edu,"[From:, mblawson@midway.ecn.uoknor.edu, (Matth...",134,0,2,0
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,,4,dsv.su.se,"[From:, hilmi-er@dsv.su.se, (Hilmi, Eren), Sub...",568,0,2,0
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,,65,austin.ibm.com,"[From:, guyd@austin.ibm.com, (Guy, Dawson), Su...",538,0,8,0
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,From: Alexander Samuel McDiarmid \nSubject: dr...,,18,andrew.cmu.edu>,"[From:, Alexander, Samuel, McDiarmid, <am2o+@a...",150,0,1,0
...,...,...,...,...,...,...,...,...,...,...
18841,From: jim.zisfein@factory.com (Jim Zisfein) \n...,From: jim.zisfein@factory.com (Jim Zisfein) \n...,,16,factory.com,"[From:, jim.zisfein@factory.com, (Jim, Zisfein...",337,0,4,0
18842,From: rdell@cbnewsf.cb.att.com (richard.b.dell...,From: rdell@cbnewsf.cb.att.com (richard.b.dell...,,21,cbnewsf.cb.att.com,"[From:, rdell@cbnewsf.cb.att.com, (richard.b.d...",144,0,5,0
18843,From: westes@netcom.com (Will Estes)\nSubject:...,From: westes@netcom.com (Will Estes)\nSubject:...,,10,netcom.com,"[From:, westes@netcom.com, (Will, Estes), Subj...",139,0,2,0
18844,From: steve@hcrlgw (Steven Collins)\nSubject: ...,From: steve@hcrlgw (Steven Collins)\nSubject: ...,,32,hcrlgw,"[From:, steve@hcrlgw, (Steven, Collins), Subje...",194,0,4,0
