# String Manipulation


In [7]:
val = 'a,b ,  guido'
print(val.split(','))

['a', 'b ', '  guido']


In [8]:
# split is often combined with strip to trim whitespace (including line breaks):
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [10]:
# These substrings could be concatenated together with a two-colon delimiter using addition
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [31]:
# But this isn’t a practical generic method. A faster and more Pythonic way is to pass a
# list or tuple to the join method on the string '::':
print('::'.join(pieces))
print('***'.join(['Andrei', 'Benya', 'Yulia']))

a::b::guido
Andrei***Benya***Yulia


In [20]:
# Other methods are concerned with locating substrings. Using Python’s in keyword is
# the best way to detect a substring, though index and find can also be used
print('guido' in val)
print(val.index('g'))
print(val.find(':'))


True
7
-1


In [18]:
# Note the difference between find and index is that index raises an exception if the
# string isn’t found (versus returning –1)
val.index(':')

ValueError: substring not found

In [21]:
# Relatedly, count returns the number of occurrences of a particular substring
val.count(',')

2

In [23]:
# replace will substitute occurrences of one pattern for another. It is commonly used
# to delete patterns, too, by passing an empty string
print(val.replace(',', '::'))
print(val.replace(',', ''))

a::b ::  guido
ab   guido


In [29]:
# endswith
print(val.endswith('o'))
print(val.endswith('j'))

True
False


In [28]:
# startswith
print(val.startswith('a'))
print(val.startswith('x'))

True
False


In [32]:
val

'a,b ,  guido'

In [34]:
print(val.find(','))
print(val.rfind(','))

1
4


### Regular expressions

In [3]:
import re

text = 'foo   bar\t baz   \tqux'

re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object

In [4]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the
findall method

In [5]:
regex.findall(text)

['   ', '\t ', '   \t']

In [6]:
re.findall('\s+', text)

['   ', '\t ', '   \t']

match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string. As a less trivial example, let’s consider a block of text and a regular expression capable of identifying most email addresses

To avoid unwanted escaping with \ in a regular expression, use raw
string literals like r'C:\x' instead of the equivalent 'C:\\x'.

In [7]:
text = '''Dave dave@google.com
    Steve steve@gmail.com
    Rob rob@gmail.com
    Ryan ryan@yahoo.com'''
print(text)
pattern = r'[A-Z0-9.%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# # re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags = re.IGNORECASE)



Dave dave@google.com
    Steve steve@gmail.com
    Rob rob@gmail.com
    Ryan ryan@yahoo.com


Using findall on the text produces a list of the email addresses:

In [8]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [12]:
m = regex.search(text)
print(m)
print(m.start())
print(m.end())
print(text[m.start():m.end()])

<re.Match object; span=(5, 20), match='dave@google.com'>
5
20
dave@google.com


regex.match returns None, as it only will match if the pattern occurs at the start of the
string

In [14]:
print(regex.match(text))
print(regex.match('andrew.kristov@gmail.com this is my email'))

None
<re.Match object; span=(0, 24), match='andrew.kristov@gmail.com'>


Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string:

In [16]:
print(regex.sub('<e-mail>', text))

Dave <e-mail>
    Steve <e-mail>
    Rob <e-mail>
    Ryan <e-mail>


Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:

In [17]:
pattern = r'([A-Z0-9._%-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags = re.IGNORECASE)


re.compile(r'([A-Z0-9._%-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})',
re.IGNORECASE|re.UNICODE)

A match object produced by this modified regex returns a tuple of the pattern com‐ ponents with its groups method:

In [19]:
m = regex.match('andrew.kristov@gmail.com')
m.groups()

('andrew.kristov', 'gmail', 'com')

findall returns a list of tuples when the pattern has groups:


In [27]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth:

In [30]:
print(regex.sub(r'<Username: \1, Domain: \2, Suffix: \3>', text))

Dave <Username: dave, Domain: google, Suffix: com>
    Steve <Username: steve, Domain: gmail, Suffix: com>
    Rob <Username: rob, Domain: gmail, Suffix: com>
    Ryan <Username: ryan, Domain: yahoo, Suffix: com>


# Vectorized String Functions in pandas

Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:

In [37]:
import numpy as np
import pandas as pd

In [42]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [43]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string opera‐ tions that skip NA values. These are accessed through Series’s str attribute; for exam‐ ple, we could check whether each email address has 'gmail' in it with str.contains:


In [45]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE:

In [50]:
pattern = r'([A-Z0-9._%-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags = re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or index into the str attribute:

In [51]:
matches = data.str.match(pattern, flags = re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

To access elements in the embedded lists, we can pass an index to either of these functions:

In [55]:
matches.str(0)

AttributeError: Can only use .str accessor with string values!