## String Manipulations

In [14]:
val = 'a,     b,  guido      , bajo'
print(val)

a,     b,  guido      , bajo


In [4]:
# splitting the data by , and strip the whitespace
val2 = [x.strip() for x in val.split(',')]
val2

['a', 'b', 'guido']

In [5]:
# tuple assignment
first, second, third = val2
first + "::" + second + "::" + third

'a::b::guido'

In [6]:
# practical method is join
"::".join(val2)

'a::b::guido'

In [8]:
# checking if guido is in val2
'guido' in val2

True

In [10]:
# searching in string
print("index",val.index(','))
print("find",val.find(','))

index 1
find 1


In [13]:
print("find",val.find(':'))   # find and index behave same if string is available
print("index",val.index(':')) # index throws an exception where find returns -1

find -1


ValueError: substring not found

In [20]:
# get string counts
print(", -- ", val.count(','),"\n"
     "a --", val.count('a'))

, --  3 
a -- 2


In [21]:
# replace will substitute occurrences of one pattern for another. This is commonly used
# to delete patterns, too, by passing an empty string:
val.replace(',', '::')

'a::     b::  guido      :: bajo'

In [22]:
val.replace(',', '')

'a     b  guido       bajo'

### Python built-in string methods

```
count                  Return the number of non-overlapping occurrences of substring in the string.
endswith, startswith   Returns True if string ends with suffix (starts with prefix).
join                   Use string as delimiter for concatenating a sequence of other strings.
index                  Return position of first character in substring if found in the string. Raises ValueError if not found.
find                   Return position of first character of first occurrence of substring in the string. Like index, but returns -1 if not found.
rfind                  Return position of first character of last occurrence of substring in the string. Returns -1 if not found.
replace                Replace occurrences of string with another string.
strip, rstrip, lstrip  Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split                  Break string into list of substrings using passed delimiter.
lower, upper           Convert alphabet characters to lowercase or uppercase, respectively.
ljust, rjust           Left justify or right justify, respectively. Pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width
```

###  Regular expressions

In [2]:
import re

In [3]:
text = "foo    bar\t baz  \tqux"
text

'foo    bar\t baz  \tqux'

In [4]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [5]:
# compiled version of regex
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [6]:
# to get a list of all patterns matching the regex, you can use the findall method:
regex.findall(text)

['    ', '\t ', '  \t']

In [7]:
# match and search are closely related to findall. While findall returns all matches in a
# string, search returns only the first match. More rigidly, match only matches at the
# beginning of the string.

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [8]:
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [9]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [12]:
# search returns a special match object for the first email address in the text. For the
# above regex, the match object can only tell us the start and end position of the pattern
# in the string:

m = regex.search(text)
m

<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>

In [11]:
text[m.start():m.end()]

'dave@google.com'

In [14]:
# regex.match returns None, as it only will match if the pattern occurs at the start of the string:

print(regex.match(text))

None


In [15]:
# sub will return a new string with occurrences of the pattern replaced by the a new string:

print (regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



In [16]:
# to find email addresses and simultaneously segment each address
# into its 3 components: username, domain name, and domain suffix. 
# To do this, put parentheses around the parts of the pattern to segment:

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [17]:
regex = re.compile(pattern, flags=re.IGNORECASE)

In [18]:
# A match object produced by this modified regex returns a tuple of the pattern components
# with its groups method

m = regex.match('wesm@bright.net')

In [19]:
m.groups()

('wesm', 'bright', 'net')

In [20]:
# findall returns a list of tuples when the pattern has groups:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [21]:
# sub also has access to groups in each match using special symbols like \1, \2, etc.:
print (regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



#### Regular expression methods

```
findall, finditer          Return all non-overlapping matching patterns in a string. findall returns a list of all patterns while finditer returns them one by one from an iterator.
match                      Match pattern at start of string and optionally segment pattern components into groups. If the pattern matches, returns a match object, otherwise None.
search                     Scan string for match to pattern; returning a match object if so. Unlike match, the match can be anywhere in the string as opposed to only at the beginning.
split                      Break string into pieces at each occurrence of pattern.
sub, subn                  Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression. Use symbols \1, \2, ... to refer to match group elements in the replacement string.
```

### Vectorized string functions in pandas