# Table of Contents
 <p><div class="lev1 toc-item"><a href="#String-Basics" data-toc-modified-id="String-Basics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>String Basics</a></div><div class="lev1 toc-item"><a href="#String-Formatting" data-toc-modified-id="String-Formatting-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>String Formatting</a></div><div class="lev2 toc-item"><a href="#F-strings" data-toc-modified-id="F-strings-21"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>F-strings</a></div><div class="lev2 toc-item"><a href="#String-formatting" data-toc-modified-id="String-formatting-22"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>String formatting</a></div><div class="lev2 toc-item"><a href="#C-style-(older)-formatting" data-toc-modified-id="C-style-(older)-formatting-23"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>C-style (older) formatting</a></div><div class="lev1 toc-item"><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Regular Expressions</a></div>

Most of your data cleaning steps will involve working with strings since many datasets
will contain unstructured text.
Other times, when multiple variables are stored in a single column, you will need to
parse the variable into separate columns using string manipulation

# String Basics

In [1]:
# create a string
word = 'grail'

In [2]:
# get the first letter of the string
# this is similar to how we get items out of a list
word[0]

'g'

In [3]:
# negative numbers count from the back
word[-1]

'l'

In [4]:
# strings have methods that perform certain tasks
# for example we can upper-case all the characters
word.upper()

'GRAIL'

In [5]:
# notice the original string does not change
# we would have to reassign a new variable when performing the upper method
word

'grail'

In [6]:
# count number of times another string appears
word.count('i')

1

In [7]:
# create a zipcode without leading zeros
zipcode = '1234'

In [8]:
# zfill will fill leading zeros
zipcode.zfill(5)

'01234'

# String Formatting

## F-strings

In [9]:
var = 'flesh wound'

In [10]:
# new to Python 3.6, f-strings!
# notice the f before the quotation
s = f"It's just a {var}"

In [11]:
s

"It's just a flesh wound"

In [12]:
# if no variable is in the scope, it will error
f"It's just a {foo}"

NameError: name 'foo' is not defined

## String formatting

In [13]:
# formatted strings without the f-string notation
var = 'flesh wound'
s = "It's just a {}"

In [14]:
s.format(var)

"It's just a flesh wound"

In [15]:
"my ID is: {0:010d}".format(123456)

'my ID is: 0000123456'

## C-style (older) formatting

In [16]:
s = 'I only know %d of pi' % 7

In [17]:
s

'I only know 7 of pi'

# Regular Expressions

In [18]:
import re

In [19]:
tele_num = '1234'

In [20]:
# find the pattern of 5 digits in the string
m = re.match('\d\d\d\d\d', tele_num)

In [21]:
m

In [22]:
# the match object will return True/False if you take the bool
bool(m)

False

In [23]:
'123-456-7890'

'123-456-7890'

In [24]:
'(123) 456-7890'

'(123) 456-7890'

In [25]:
'+1 (123) 456-7890'

'+1 (123) 456-7890'

In [26]:
# is there a sequence of 3 digits in the string?
# yes
p = re.compile('\d\d\d')

In [27]:
bool(p.match('124'))

True