String Manipulation

String Object Methods

In [25]:
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

In [26]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [27]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [28]:
'::'.join(pieces) #faster

'a::b::guido'

In [29]:
print('guido' in val)
print(val.index(','))
print(val.find(':'))

True
1
-1


In [30]:
val.index(':')

ValueError: substring not found

In [31]:
val.count(',')

2

In [32]:
val.replace(',', '::')
val.replace(',', '')

'ab  guido'

Regular Expressions

In [33]:
import re
text = "foo    bar\t baz  \tqux"
re.split('\s+', text) #The regex describing one or more whitespace characters is \s+

['foo', 'bar', 'baz', 'qux']

In [34]:
#You can compile the regex yourself with re.compile, forming a reusable regex object:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [35]:
#get a list of all patterns matching the regex, you can use the findall method
regex.findall(text)

['    ', '\t ', '  \t']

To avoid unwanted escaping with \ in a regular expression, use raw string literals like r'C:\x' instead of the equivalent 'C:\\x'

Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.

In [36]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
Bob ab._%+-@ab.cc.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [37]:
regex.findall(text)

['dave@google.com',
 'steve@gmail.com',
 'rob@gmail.com',
 'ryan@yahoo.com',
 'ab._%+-@ab.cc.com']

In [38]:
#search returns a special match object for the first email address in the text. 
#the match object can only tell us the start and end position of the pattern in the string
m = regex.search(text)
m
text[m.start():m.end()]

'dave@google.com'

In [39]:
#regex.match returns None, as it only will match if the pattern occurs at the start of the string
print(regex.match(text))

None


In [40]:
#sub will return a new string with occurrences of the pattern replaced by the a new string
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
Bob REDACTED



Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment

In [41]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

A match object produced by this modified regex returns a tuple of the pattern components with its groups method

In [42]:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

findall returns a list of tuples when the pattern has groups

In [43]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com'),
 ('ab._%+-', 'ab.cc', 'com')]

sub also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth

In [44]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
Bob Username: ab._%+-, Domain: ab.cc, Suffix: com



Vectorized String Functions in pandas

In [45]:
import numpy as np
import pandas as pd
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
print(data)
data.isnull()

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object


Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [46]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [47]:
pattern
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [48]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [49]:
matches.str.get(1)
matches.str[0]

Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64

In [50]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object