**Strings**
+ a string is a series of characters
+ they are created by a set of opening and matching single, or double quotes


In [5]:
word = 'grail'
sent = 'a scratch'

+ strings can be thought of as a container or characters
+ you can subset a string like any other container

In [2]:
print(word[0])

g


In [6]:
print(sent[0])

a


In [7]:
# get the first 3 characters
# note that index 3 is really the 4th character
print(word[0:3])

gra


+ Slicing notation in Python is left-side inclusive, right-side exclusive


In [8]:
# passing in a negative index stats the count from the end of a container
# get the last letter
print(sent[-1])

h


In [9]:
# get 'a'
print(sent[-9:-8])

a


In [10]:
# get 'a'
print(sent[0:-8])

a


In [11]:
# scratch
print(sent[2:-1])

scratc


In [12]:
# scratch
print(sent[-7:-1])

scratc


**Getting the Last Character in a String**
+ Python is right side exclusive
+ Need to specify an index position that is 1 greater than the last index
+ To do this, need to specify an index position that is 1 greater than the index
+ To do so, need to get the 'len'(length) of the string
    + and then pass that value into the slicing notation


In [13]:
# note that the last index is 1 position smaller than the # returned for len
s_len = len(sent)
print(s_len)

9


In [15]:
print(sent[2:s_len])

scratch


**Slicing from the Beginning or to the End**
+ a very common task is to slice a value from the beginning, to a certain point in the string (or container)
+ the 1st element will always be 0, so can always write 'word[0:3]' to get the 1st three elements
+ or can also do 'word[-3:length(word)]' to get the last 3 elements

In [16]:
print(word[0:3])

gra


In [17]:
print(word[:3])

gra


In [18]:
print(sent[2:len(sent)])

scratch


In [19]:
print(sent[2:])

scratch


In [20]:
print(sent[:])

a scratch


**Slicing in increments**

In [21]:
print(sent[::2])

asrth


In [22]:
print(sent[::3])

act


**Join**
+ the 'join' method takes a container, and returns a new string containing each element in the list

In [24]:
d1 = '40°'
m1 = "46'"
s1 = '52.837"'
u1 = 'N'

d2 = '73°'
m2 = "58'"
s2 = '26.302"'
u2 = 'W'

coords = ' '.join([d1, m1, s1, u1, d2, m2, s2, u2])
print(coords)

40° 46' 52.837" N 73° 58' 26.302" W


**Splitlines**
+ the 'splitlines' method is familiar to the 'split' method
+ often used on strings that are multiple lines long - and return a list in which each element of the list, is a line in the mulitple-line string

In [25]:
multi_str = """Guard: What? Ridden on a horse?
King Arthur: Yes!
Guard: You're using coconuts!
King Arthur: What?
Guard: You've got...coconut[s] and you're bangin' 'em together.
"""
print(multi_str)

Guard: What? Ridden on a horse?
King Arthur: Yes!
Guard: You're using coconuts!
King Arthur: What?
Guard: You've got...coconut[s] and you're bangin' 'em together.



In [26]:
# can get every line as a separate element in a list using 'splitlines'
multi_str_split = multi_str.splitlines()
print(multi_str_split)

['Guard: What? Ridden on a horse?', 'King Arthur: Yes!', "Guard: You're using coconuts!", 'King Arthur: What?', "Guard: You've got...coconut[s] and you're bangin' 'em together."]


In [27]:
guard = multi_str_split[::2]
print(guard)

['Guard: What? Ridden on a horse?', "Guard: You're using coconuts!", "Guard: You've got...coconut[s] and you're bangin' 'em together."]


In [28]:
# can use the 'replace' method on the string and replace , then use splitlines

guard = multi_str.replace("Guard: ", "").splitlines()[::2]
print(guard)

['What? Ridden on a horse?', "You're using coconuts!", "You've got...coconut[s] and you're bangin' 'em together."]


**String Formatting**
+ formatting strings allows you to specify a generic template for a string, and then insert variables to the pattern
+ can also handle various ways to visually represent some strings - ex showing 2 decimal values in a float

**Formatting Character Strings**
+ can write a string with special placeholder characters, and use the format method on the string to insert values


In [29]:
var = 'flesh wound'
s = "It's just a {}!"

print(s.format(var))

It's just a flesh wound!


In [31]:
print(s.format('scratch'))

It's just a scratch!


In [32]:
# placeholders can also refer to variables multiple times

# using variables multiple times by index
s = """Black Knight: 'Tis but a {0}.
King Arthur: A {0}? Your arm's off!
"""
print(s.format('scratch'))

Black Knight: 'Tis but a scratch.
King Arthur: A scratch? Your arm's off!



In [33]:
# can also give the placeholders a variable
s = 'Hayden Planetarium Coordinates: {lat}, {lon}'
print(s.format(lat='40.7815° N', lon='73.9733° W'))

Hayden Planetarium Coordinates: 40.7815° N, 73.9733° W


**Formatting Numbers**


In [34]:
print('Some digits of pi: {}'.format(3.14159265359))

Some digits of pi: 3.14159265359


In [36]:
# can format numbers, and use thousands-place comma separators
print('In 2005, Lu Chao of China recited {:,} digits of pi'.format(67890))

In 2005, Lu Chao of China recited 67,890 digits of pi


In [37]:
# numbers can be used to perform a caucluation and be formatted to a certain # of decimal values
# can calculate a proportion, and format it into a %

# the 0 in {0:.4} and {0:4%} refer to the 0 index in this format
# the .4 refers to how many decimal values, 4
# if we provide a %, it will format the decimal as a %

print("I remember {0:.4} or {0:.4%} of what Lu Chao recited".format(7/67890))

I remember 0.0001031 or 0.0103% of what Lu Chao recited


In [38]:
# can use string formatting to pad a # with zeros, similar to how 'zfill' works on strings

# the first 0 refers to the index in this format
# the second zero refers to the character to fill
# the 5 in this case refers to how many characters in total
# the d signals a digit will be used
# Pad the number with 0s so the entire string has 5 characters

print("My ID number is {0:05d}".format(42))

My ID number is 00042


**C printf Style Formatting**
+ another ay to perform string formatting is with the % operator
+ this followed the 'C printf' style formatting


In [39]:
# the d represents an integer digit
s = 'I only know %d digits of pi' % 7
print(s)

I only know 7 digits of pi


In [40]:
# the s represents a string
# and note the string pattern uses round brackets ( )
# instead of curly brackets {}
# the variable passed is a Python dict, which uses {}

print('Some digits of %(cont)s: %(value).2f' %{'cont': 'e', 'value':2.718})

Some digits of e: 2.72


**Formatted Literal Strings in Python 3.6**
+ f-strings is a new feature
+ the string must begin with the letter f
* this syntax tells Python that we have a formatted literal string
* can then use the variable directly in the placeholder {}, without calling the format

In [41]:
var = 'flesh wound'
s = f"It's just a {var}!"
print(s)

It's just a flesh wound!


In [42]:
lat = '40.7815° N'
lon = '73.9733° W'
s = f'Hayden Planetarium Coordinates: {lat}, {lon}'
print(s)

Hayden Planetarium Coordinates: 40.7815° N, 73.9733° W


**8.6 RegEx**
+ Reg Ex provides a way to find and match pattenrs in strings
+ but the syntax can be difficult to read
+ might want to use 'https://regex101.com/' as a resource
+ Reg expressions use the 're' module
+ to use regular expressions, write a string that contains RegEx and provide a string for the pattern to match

In [1]:
import re
tele_num = '1234567890'
m = re.match(pattern='\d\d\d\d\d\d\d\d\d\d', string=tele_num)
print(type(m))

<class 're.Match'>


In [2]:
print(m)

<re.Match object; span=(0, 10), match='1234567890'>


In [3]:
# in the printed object above 'span', span identifies the index of the string where the matches occured
# the match identifies the exact string that got matched

print(bool(m))

True


In [4]:
# should print match
if m:
    print('match')
else:
    print('no match')

match


In [5]:
# if wanted to extract some of the match object values, such as the index position, or the actual string
# can use a few methods

# get the first index of the string match
print(m.start())

0


In [6]:
# get the last index of the string match
print(m.end())

10


In [7]:
# get the first and last index of the string match
print(m.span())

(0, 10)


In [8]:
# the string that matched the pattern
print(m.group())

1234567890


In [9]:
# telephone numbers can be a little more complex than a series of 10 consecutive digits

tele_num_spaces = '123 456 7890'

In [10]:
# can simplify the previous pattern
m = re.match(pattern='\d{10}', string=tele_num_spaces)
print(m)

None


In [11]:
if m:
    print('match')
else:
    print('no match')

no match


In [15]:
# now assume the pattern has 3 digits, a space, another 3 digits, another space, followed by 4 digits

# you may see the RegEx pattern as a separate variable
# bc it can get long 
# and make the actual match function call had to read
p = '\d{3}\s?\d{3}\s?\d{4}'
m = re.match(pattern=p, string=tele_num_spaces)
print(m)

<re.Match object; span=(0, 12), match='123 456 7890'>


In [19]:
# area codes can be surrounded by parentheses and a dash between the 7 digits
tele_num_space_paren_dash = '(123) 456-7890'
p = '\(?\d{3}\)?\s?\d{3}\s?-?\d{4}'
m = re.match(pattern=p, string=tele_num_space_paren_dash)
print(m)

<re.Match object; span=(0, 14), match='(123) 456-7890'>


In [21]:
# finally there could be a country code before the #
cnty_tele_num_space_paren_dash = '+1 (123) 456-7890'
p = '\+?1\s?\(?\d{3}\)?\s?\d{3}\s?-?\d{4}'
m = re.match(pattern=p, string=cnty_tele_num_space_paren_dash)
print(m)

<re.Match object; span=(0, 17), match='+1 (123) 456-7890'>


**Find a Pattern**
+ can use the 'findall' function to find all matches within a pattern

In [22]:
p = '\d+'
# python will concatenate 2 strings next to each other

s = "13 Jodie Whitaker, war John Hurt, 12 Peter Capaldi, 11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
m = re.findall(pattern=p, string=s)
print(m)

['13', '12', '11', '10', '9']


**Compiling a Pattern**
+ python's 're' model allows to compile a pattern so it can be reused
    + this helps when data operations occur on a column-by-column or row-by-row basis


In [23]:
p = re.compile('\d{10}')
s = '1234567890'

# note: calling match on the compiled pattern;
# not using the re.match function
m = p.match(s)
print(m)

<re.Match object; span=(0, 10), match='1234567890'>


In [24]:
p = re.compile('\d+')
s = "13 Jodie Whitaker, war John Hurt, 12 Peter Capaldi, 11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
m = p.findall(s)
print(m)

['13', '12', '11', '10', '9']


In [25]:
p = re.compile('\w+\s?\w+"\s?')
s = "Guard: You're using coconuts!"
m = p.sub(string=s, repl='')
print(m)

Guard: You're using coconuts!
