# 🐼 String Processing with Pandas

In [0]:
import pandas as pd

In [3]:
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]
df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


🔢 Number of characters for each string in the data frame

In [4]:
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

🍡 Number of tokens for each string in the data frame

In [7]:
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

💬 Entries contain the word `<word>`

In [8]:
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

## 🤯 Regex in Pandas

🔢 Digit occurrences in each string (regex)

In [11]:
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

 🔍 All occurances of the digits

In [12]:
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

⌛ Hours and minutes

In [14]:
df['text'].str.findall(r'(\d{1,2}):(\d{1,2})')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

💢 Replace patterns with fix values

In [15]:
df['text'].str.replace(r'\w+day', 'esma')

0         esma: The doctor's appointment is at 2:45pm.
1      esma: The dentist's appointment is at 11:30 am.
2         esma: At 7:00pm, there is a basketball game!
3        esma: Be back home by 11:15 pm at the latest.
4    esma: Take the train at 08:10 am, arrive at 09...
Name: text, dtype: object

💥 Replace patterns with dynamic values

In [19]:
df['text'].str.replace(r'(\w+day)', lambda s:s.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

✨ Using matches to create new columns

In [20]:
df['text'].str.extract(r'(\d{1,2}):(\d{1,2})')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [22]:
df['text'].str.extractall(r'(\d{1,2}):(\d{1,2}) *((?:pm|am))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,2,45,pm
1,0,11,30,am
2,0,7,0,pm
3,0,11,15,pm
4,0,8,10,am
4,1,9,0,am


👩‍💼 Column Naming

In [28]:
df['text'].str.extractall(r'(?P<Time>(?P<Hour>\d{1,2}):(?P<Min>\d{1,2}) *(?P<Period>(?:pm|am)))')

Unnamed: 0_level_0,Unnamed: 1_level_0,Time,Hour,Min,Period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am
