## [ String Manipulation ]
- python is popular for raw data manipulation, especially for strings and text.
- string object methods make most operations easy.
- regular expressions (regex) are used for complex pattern matching.
- pandas enhances this by allowing vectorized string and regex operations on entire arrays. 
- pandas handles missing data (NaN) gracefully during string operations.
- this makes text cleaning and analysis more efficient and concise

In [67]:
import numpy as np 
import pandas as pd 

## [ Python Built-In String Object Methods ]

In [68]:
# in many string munging and scripting applications, built-in string methods are sufficient.

# example, a comma-separated string can be broken into pieces with split
val = "a,b, guido"
val.split(",")

['a', 'b', ' guido']

In [69]:
# split is often combined with strip to trim whitespace (including line breaks)
pieces = [x.strip() for x in val.split(",")]
pieces

['a', 'b', 'guido']

In [70]:
# these substrings could be concatenated together using addition
first, second, third = pieces
first + "??" + second + "??" + third

'a??b??guido'

In [71]:
# but this isn't a practical generic method. a faster and more pythonic way is to pass a list or tuple to the join method on the string "?"
"?".join(pieces) 

'a?b?guido'

In [72]:
# other methods are concerned with locating substrings.
# using python's "in" keyword is the best way to detect a substring, though index and find can also be used
"guido" in val

True

In [73]:
val.index(",")

1

In [74]:
val.find("?")

-1

In [75]:
# difference between find and index is that index raises an exception if the string isn't found
val.index("??")

ValueError: substring not found

In [16]:
# count returns the number of occurences of a particular substring
val.count(",")

2

In [18]:
# replace will substitute occurrences of one pattern for another.
# commonly used to delete patterns, too, by passing an empty string

val.replace(",", ":")

'a:b: guido'

In [19]:
val.replace(",", "")

'ab guido'



| **Method**               | **Description**                                      |
|--------------------------|------------------------------------------------------|
| `str.lower()`            | Converts string to lowercase                         |
| `str.upper()`            | Converts string to uppercase                         |
| `str.title()`            | Capitalizes first letter of each word                |
| `str.capitalize()`       | Capitalizes first character of the string            |
| `str.swapcase()`         | Swaps lowercase to uppercase and vice versa          |
| `str.strip()`            | Removes leading/trailing spaces                      |
| `str.lstrip()`           | Removes leading spaces                               |
| `str.rstrip()`           | Removes trailing spaces                              |
| `str.replace(old, new)`  | Replaces substring with another                      |
| `str.split(sep)`         | Splits string into list                              |
| `str.rsplit(sep)`        | Splits from the right                                |
| `str.join(iterable)`     | Joins elements with a string as separator            |
| `str.find(sub)`          | Returns lowest index of substring; -1 if not found   |
| `str.rfind(sub)`         | Returns highest index of substring                   |
| `str.index(sub)`         | Like `find()` but raises error if not found          |
| `str.rindex(sub)`        | Like `rfind()` but raises error if not found         |
| `str.startswith(sub)`    | Checks if string starts with `sub`                   |
| `str.endswith(sub)`      | Checks if string ends with `sub`                     |
| `str.isalpha()`          | Checks if all characters are alphabets               |
| `str.isdigit()`          | Checks if all characters are digits                  |
| `str.isnumeric()`        | Checks if string has only numeric characters         |
| `str.isalnum()`          | Checks if all characters are alphanumeric            |
| `str.isspace()`          | Checks if all characters are whitespace              |
| `str.islower()`          | Checks if all letters are lowercase                  |
| `str.isupper()`          | Checks if all letters are uppercase                  |
| `str.istitle()`          | Checks if string is title-cased                      |
| `str.zfill(width)`       | Pads string with zeros on the left                   |
| `str.ljust(width)`       | Left-justifies string with spaces                    |
| `str.rjust(width)`       | Right-justifies string with spaces                   |
| `str.center(width)`      | Centers string with spaces                           |
| `str.count(sub)`         | Counts occurrences of substring                      |
| `str.partition(sep)`     | Splits into 3 parts: before, sep, after              |
| `str.rpartition(sep)`    | Same as above but from right                         |
| `str.encode()`           | Encodes string to bytes                              |
| `str.casefold()`         | More aggressive lowercasing (for comparisons)        |
| `str.expandtabs()`       | Replaces tabs with spaces                            |


## [ Regular Expressions ]
- Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. 
- A single expresssion, commonly called a regex, is a string formed according to the regular expression language.
- Python's built-in `re` module is responsible for applying regular expressions to strings.
- writing regular expressions is itself a broad topic

- the `re` module functions fall into three categories:   
    - pattern matching
    - substitution
    - splitting
- Naturally these are all related: a regex describes a pattern to locate in the text, which can then be used for many purposes.

In [30]:
# example
# suppose we wanted to split a string with a variable number of whitespace characters (tabs, space, and newlines)
# the regex describing one or more whitespace characters is `\s+`

import re
text = "foo     bar\t  baz  \tqux"
re.split(r"\s+", text)

# on calling line 7, the regular expression is first compiled, and then its split method is called on the passed text.
# we can compile the regex yourself with re.compile, forming a reusable regex object

['foo', 'bar', 'baz', 'qux']

In [21]:
regex = re.compile(r"\s+")
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [22]:
# if you want to get a list of all patterns matching the regex, use findall method

regex.findall(text)

['     ', '\t  ', '  \t']

NOTE: 
- To avoid unwanted escaping with \ in a regular expression, use raw
string literals like r"C:\x" instead of the equivalent "C:\\x"
- creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles

In [31]:
# `match` and `search` are closely related to `findall`
# while `findall` returns all matches in a string, 
# `search` returns only the first match
# and `match` only matches at the beginning of the string

# example
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}"


# re.IGNORECASE makes the regex case insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

# using findall on the text produces a list of the email addresses
regex.findall(text)


['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [32]:
# search returns a special match object for the first email address in the text
# For the preceding regex, the match object can only tell us the start and end position of the pattern in the string
m = regex.search(text)
print(m)

text[m.start():m.end()]     # slicing the string 'm' manually

# basically, it's just a way to grab the first match from the string

<re.Match object; span=(5, 20), match='dave@google.com'>


'dave@google.com'

In [33]:
# regex.match returns None, 
# it tries to match the pattern ONLY at the beginning of the string 
# if the pattern is not at the very start, it returns None
print(regex.match(text))

None


In [34]:
# `sub` will return a new string with occurrences of the pattern replaced by a new string
print(regex.sub("BullShit", text))

Dave BullShit
Steve BullShit
Rob BullShit
Ryan BullShit


In [50]:
# Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment

pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
regex = re.compile(pattern, flags=re.IGNORECASE)

# this will return a tuple of the pattern components with its group methods
m = regex.match("sd@dark.hor")
m.groups()

('sd', 'dark', 'hor')

In [39]:
# findall returns a list of tuples when the pattern has () (groups)
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [40]:
# sub also have access to groups in each match using special symbols like \1 and \2
# the symbol \1 corresponds to the first matched group, \2 corresponds to the second,...

print(regex.sub(r"Username: \1, Domain: \2, Suffix: \3", text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com


Regular expression methods
- findall
- finditer
- match
- search
- split
- sub, subn

## [ String Functions in pandas ]

In [79]:
# cleaning up a messy dataset for analysis often requires a lot of string manipulation
# to complicate matters, a column containing strings will sometimes have missing data

data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com", "Rob": "rob@gmail.com", "Wes": np.nan}
data = pd.Series(data)
print(data)
data.isna()

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object


Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [80]:
# string and regular expression methods can be applied to each value using data.map, but it will fail on NA values
# to cope with this, Series has array-oriented methods for string operations that skip over and propagate NA values
# these are accessed through Series's str attribute

# example, 
data.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [81]:
# the result of this operation has an object dtype.
# Pandas has special extension types for strings, integers, and booleans.
# These types handle missing values better than the regular data types, which used to have problems when data was incomplete or missing.

data_as_string_ext = data.astype('string')
data_as_string_ext

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string

In [82]:
data_as_string_ext.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes       <NA>
dtype: boolean

In [83]:
# regular expressions can be used, too along with any re options like IGNORECASE

pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
data.str.findall(pattern, flags=re.IGNORECASE)

# returns a list of tuples for each row
# each tuple contains parts of the match (username , domain , tld)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [84]:
# there are couple of ways to do vectorized element retrieval
# either use str.get or index into the str attribute

matches = data.str.findall(pattern, flags=re.IGNORECASE).str[0]
matches

# takes only the first match from each row
# now we have just one tuple per row (not a list)
# easier to work with if you want to split it into columns

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

In [86]:
matches.str.get(1)

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object

In [87]:
# similarly slicing strings using this syntax
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

In [88]:
# str.extract method will return the captured groups of a regular expression as a DataFrame
data.str.extract(pattern, flags=re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,


Partial Listing of `Series.str` String Methods

| Method        | Description |
|---------------|-------------|
| `cat`         | Concatenate strings element-wise with optional delimiter |
| `contains`    | Return Boolean array if each string contains pattern/regex |
| `count`       | Count occurrences of pattern |
| `extract`     | Extract regex groups into DataFrame columns |
| `endswith`    | Check if each string ends with the given pattern |
| `startswith`  | Check if each string starts with the given pattern |
| `findall`     | Find all occurrences of pattern/regex for each string |
| `get`         | Get the i-th element from each string |
| `isalnum`     | Check if all characters are alphanumeric |
| `isalpha`     | Check if all characters are alphabetic |
| `isdecimal`   | Check if all characters are decimals |
| `isdigit`     | Check if all characters are digits |
| `islower`     | Check if all characters are lowercase |
| `isnumeric`   | Check if all characters are numeric |
| `isupper`     | Check if all characters are uppercase |
| `join`        | Join strings in each element with a separator |
| `len`         | Get length of each string |
| `lower`, `upper` | Convert to lowercase or uppercase |
| `match`       | Use `re.match` to check if string matches pattern |
| `pad`         | Add whitespace to left, right, or both sides |
| `center`      | Center the string with padding (same as `pad(side="both")`) |
| `repeat`      | Repeat each string (e.g. `s.str.repeat(3)`) |
| `replace`     | Replace pattern/regex with another string |
| `slice`       | Slice substrings from each string |
| `split`       | Split strings on delimiter or regex |
| `strip`       | Trim whitespace from both sides |
| `rstrip`      | Trim whitespace from right side |
| `lstrip`      | Trim whitespace from left side |
