# String Methods

| Method              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `str[x:y]`          | Slices `str`, returning indices x (inclusive) to y (not inclusive)          |
| `str.lower()`       | Returns a copy of a string with all letters converted to lowercase          |
| `str.replace(a, b)` | Replaces all instances of the substring `a` in `str` with the substring `b` |
| `str.split(a)`      | Returns substrings of `str` split at a substring `a`                        |
| `str.strip()`       | Removes leading and trailing whitespace from `str`                          |

In [0]:
import pandas as pd
state = pd.DataFrame({
    'County': [
        'De Witt County',
        'Lac qui Parle County',
        'Lewis and Clark County',
        'St John the Baptist Parish',
    ],
    'State': [
        'IL',
        'MN',
        'MT',
        'LA',
    ]
})
population = pd.DataFrame({
    'County': [
        'DeWitt  ',
        'Lac Qui Parle',
        'Lewis & Clark',
        'St. John the Baptist',
    ],
    'Population': [
        '16,798',
        '8,067',
        '55,716',
        '43,044',
    ]
})

In [2]:
state

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LA


In [3]:
population

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


In [0]:
state['County'] = (state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

population['County'] = (population['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

In [5]:
state

Unnamed: 0,County,State
0,dewitt,IL
1,lacquiparle,MN
2,lewisandclark,MT
3,stjohnthebaptist,LA


In [6]:
population

Unnamed: 0,County,Population
0,dewitt,16798
1,lacquiparle,8067
2,lewisandclark,55716
3,stjohnthebaptist,43044


In [7]:
state.merge(population, on='County')

Unnamed: 0,County,State,Population
0,dewitt,IL,16798
1,lacquiparle,MN,8067
2,lewisandclark,MT,55716
3,stjohnthebaptist,LA,43044


# Regular Expressions

**Meta Characters**

This table includes most of the important *meta characters*, which help us specify certain patterns we want to match in a string.

| Char   | Description                         | Example                    | Matches        | Doesn't Match |
| ------ | ----------------------------------- | -------------------------- | -------------- | ------------- |
| .      | Any character except \n             | `...`                      | abc            | ab<br>abcd    |
| [ ]    | Any character inside brackets       | `[cb.]ar`                  | car<br>.ar     | jar           |
| [^ ]   | Any character _not_ inside brackets | `[^b]ar`                   | car<br>par     | bar<br>ar     |
| \*     | ≥ 0 or more of last symbol          | `[pb]*ark`                 | bbark<br>ark   | dark          |
| +      | ≥ 1 or more of last symbol          | `[pb]+ark`                 | bbpark<br>bark | dark<br>ark   |
| ?      | 0 or 1 of last symbol               | `s?he`                     | she<br>he      | the           |
| {_n_}  | Exactly _n_ of last symbol          | `hello{3}`                 | hellooo        | hello         |
| &#124; | Pattern before or after bar         | <code>we&#124;[ui]s</code> | we<br>us<br>is | e<br>s        |
| \      | Escapes next character              | `\[hi\]`                   | [hi]           | hi            |
| ^      | Beginning of line                   | `^ark`                     | ark two        | dark          |
| \$     | End of line                         | `ark$`                     | noahs ark      | noahs arks    |

**Shorthand Character Sets**

Some commonly used character sets have shorthands.

| Description                   | Bracket Form       | Shorthand |
| ----------------------------- | ------------------ | --------- |
| Alphanumeric character        | `[a-zA-Z0-9]`      | `\w`      |
| Not an alphanumeric character | `[^a-zA-Z0-9]`     | `\W`      |
| Digit                         | `[0-9]`            | `\d`      |
| Not a digit                   | `[^0-9]`           | `\D`      |
| Whitespace                    | `[\t\n\f\r\p{Z}]`  | `\s`      |
| Not whitespace                | `[^\t\n\f\r\p{z}]` | `\S`      |

In [0]:
import re

# re.findall(pattern, string)

In [9]:
import re
gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)

['email1@gmail.com', 'email3@gmail.com']

In [10]:
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)

['382-384-3840', '123-456-7890']

In [11]:
# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
list = re.findall(phone_re, text)
list

[('382', '384', '3840'), ('123', '456', '7890')]

In [12]:
list[0][2]

'3840'

# re.sub(pattern, replacement, string)

In [13]:
messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
string = re.sub(regex, '-', messy_dates)
string

'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'

In [14]:
string[12:20]

'03-13-18'

# re.split(pattern, string)

In [15]:
toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''
toc 



In [16]:
toc.strip()



In [17]:
# First, split into individual lines
lines = re.split('\n', toc.strip())
lines



In [18]:
# Then, split into chapter title and page number
# Matches any sequence of = characters
split_re = r'=+' 
[re.split(split_re, line) for line in lines]

[['PLAYING PILGRIMS', '3'],
 ['A MERRY CHRISTMAS', '13'],
 ['THE LAURENCE BOY', '31'],
 ['BURDENS', '55'],
 ['BEING NEIGHBORLY', '76']]

In [19]:
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({'sentences': text.split('\n')})
little

Unnamed: 0,sentences
0,"""Christmas won't be Christmas without any pres..."
1,"""It's so dreadful to be poor!"" sighed Meg, loo..."
2,"""I don't think it's fair for some girls to hav..."
3,"""We've got Father and Mother, and each other,""..."
4,The four young faces on which the firelight sh...


In [20]:
# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
little['dialog'] = spoken
little

Unnamed: 0,sentences,dialog
0,"""Christmas won't be Christmas without any pres...",Christmas won't be Christmas without any prese...
1,"""It's so dreadful to be poor!"" sighed Meg, loo...",It's so dreadful to be poor!
2,"""I don't think it's fair for some girls to hav...",I don't think it's fair for some girls to have...
3,"""We've got Father and Mother, and each other,""...","We've got Father and Mother, and each other,"
4,The four young faces on which the firelight sh...,"We haven't got Father, and shall not have him ..."
