<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#String-Object-Methods" data-toc-modified-id="String-Object-Methods-1">String Object Methods</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#See-Table-7-3-for-a-listing-of-some-of-Python’s-string-methods" data-toc-modified-id="See-Table-7-3-for-a-listing-of-some-of-Python’s-string-methods-1.0.1">See Table 7-3 for a listing of some of Python’s string methods</a></span></li></ul></li></ul></li><li><span><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-2">Regular Expressions</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Table-7-4-provides-a-brief-summary" data-toc-modified-id="Table-7-4-provides-a-brief-summary-2.0.1">Table 7-4 provides a brief summary</a></span></li></ul></li></ul></li><li><span><a href="#Vectorized-String-Functions-in-pandas" data-toc-modified-id="Vectorized-String-Functions-in-pandas-3">Vectorized String Functions in pandas</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#See-Table-7-5-for-more-pandas-string-methods." data-toc-modified-id="See-Table-7-5-for-more-pandas-string-methods.-3.0.1">See Table 7-5 for more pandas string methods.</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import numpy as np
import pandas as pd

# 7.3 String Manipulation
pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

## String Object Methods
As an example, a comma-separated string can be broken into pieces with `split`:

In [2]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

`split` is often combined with `strip` to trim whitespace (including line breaks):

In [3]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using addition:

In [4]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

A faster and more Pythonic way is to pass a list or tuple to the `join` method on the string '::':

In [5]:
'::'.join(pieces)

'a::b::guido'

Using Python’s `in` keyword is the best way to detect a substring, though index and find can also be used:

In [6]:
'guido' in val

True

In [7]:
val.index(',')

1

In [8]:
val.find(':')

-1

Note the difference between `find` and `index` is that `index` raises an exception if the string isn’t found (versus returning –1):

In [9]:
val.index(':')

ValueError: substring not found

Relatedly, count returns the number of occurrences of a particular substring:

In [10]:
val.count(',')

2

replace will substitute occurrences of one pattern for another. 

In [11]:
val.replace(',', '::')

'a::b:: guido'

In [12]:
val.replace(',', '')

'ab guido'

#### See Table 7-3 for a listing of some of Python’s string methods
![image.png](attachment:image.png)

## Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex) string patterns in text.
* The re module functions fall into three categories: **pattern matching, substitution, and splitting.**
* Let’s look at a simple example: suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is `\s+`:

In [13]:
import re
text = "foo bar\t baz \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

* The regular expression is first compiled, and then its split method is called on the passed text
* You can compile the regex yourself with `re.compile`

In [14]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the `findall` method:

In [15]:
regex.findall(text)

[' ', '\t ', ' \t']

Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.
* `findall` returns all matches in a string
* `search` returns only the first match. 
* `match` only matches at the beginning of the string. 

In [16]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [17]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

`search` returns a special match object for the first email address in the text.

In [18]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [19]:
text[m.start():m.end()]

'dave@google.com'

`regex.match` returns `None`, as it only will match if the pattern occurs at the start of the string:

In [20]:
print(regex.match(text))

None


Relatedly, `sub` will return a new string with occurrences of the pattern replaced by the a new string

In [21]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. **To do this, put parentheses around the parts of the pattern to segment:**

In [22]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

A match object produced by this modified regex returns a tuple of the pattern components with its `groups` method:

In [23]:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

`findall` returns a list of tuples when the pattern has groups:

In [24]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

`sub` also has access to groups in each match using special symbols like `\1` and `\2`. The symbol `\1` corresponds to the first matched group, `\2` corresponds to the second, and so forth:

In [25]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



#### Table 7-4 provides a brief summary
![image.png](attachment:image.png)

## Vectorized String Functions in pandas
Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. 

In [26]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [27]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, **but it will fail on the NA (null) values.**

In [28]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like `IGNORECASE`:

In [29]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [30]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

* There are a couple of ways to do vectorized element retrieval. 
* Either use str.get or index into the str attribute:

In [31]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

To access elements in the embedded lists, we can pass an index to either of these functions:

#### See Table 7-5 for more pandas string methods.
![image.png](attachment:image.png)

![image.png](attachment:image.png)