## Strings, Regex, and Serialization 

#### Will Norris

In [21]:
a = '''multi
line 
string'''
print(a)

multi
line 
string


In [5]:
concat = ("Add" "These" "Strings")
print(concat)

AddTheseStrings


In [7]:
# see all of the string methods available 
dir(str)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

### Accessing String Methods: 

In [34]:
# Boolean methods 
test_str = "I like to eat pizza"

# get it straight from the class 
print(str.isalpha(test_str))
# or use the inherited method in our subclass!
print(test_str.isalpha())

False
False


In [15]:
title = "I Like To Eat Pizza"
title.istitle()

True

#### Numbers and Strings, Unicode is weird 
- The period, ".", is not the decimal character in unicode  

In [18]:
# numbers and strings 
decimal = '45.2'
print(decimal.isdecimal())

decimal2 = '45\u06602'
print(decimal2.isdecimal())

False
True


##### Python doesn't like this though:

In [20]:
print(float(decimal))
print(float(decimal2))

45.2
4502.0


### Built in Pattern Finding & Manipulation: 

In [24]:
s = "Hi my name is Will"

print(s.count('Will'))
print(s.count('will'))

print(s.lower().count('will'))

1
0
1


In [44]:
s = "Use Arcpy to buffer the points in your shapefile"

split = s.split()

print(split)
print()
print(" ".join(split))
print()
print(s.replace("Arcpy", "Geopandas"))

['Use', 'Arcpy', 'to', 'buffer', 'the', 'points', 'in', 'your', 'shapefile']

Use Arcpy to buffer the points in your shapefile

Use Geopandas to buffer the points in your shapefile


### Quick Note: 
- Every time we manipulate a string and assign to a new variable name, we use more memory 
- If you can, reuse the same variable name to preserve memory 

In [33]:
s = "ArcMap 10.x still uses python 2.x"
s_low = s.lower()
print(s)
print(s_low)

s = s.lower()
print(s)

ArcMap 10.x still uses python 2.x
arcmap 10.x still uses python 2.x
arcmap 10.x still uses python 2.x


### String Formatting: 
- Uses ```*args``` and ```**kwargs``` to pass information 
    - This means it doesn't really care if we pass too much information 

In [135]:
template = "Please {} the points in your shapefile using {}"
print(template.format('clip ', 'Shapely'))
print(template.format('buffer', 'Geopandas', 'whatamidoinghere?'))

Please clip  the points in your shapefile using Shapely
Please buffer the points in your shapefile using Geopandas


 ### Using formatting parameters out of order or more than once: 

In [48]:
template = "Please {0} the points using {1}. Make sure to use {1} and not {2}."
print(template.format('clip', 'shapely', 'arcpy'))

Please clip the points using shapely. Make sure to use shapely and not arcpy.


### Escaping Braces: 
- sometimes we want to print '{' and not have them be purely formatting objects
- double braces '{{' achieves this 

In [49]:
template = """
   public class {0} {{
       public static void main(String[] args) {{
           System.out.println("{1}");
}} }}"""
print(template.format("MyClass", "print('hello world')"))


   public class MyClass {
       public static void main(String[] args) {
           System.out.println("print('hello world')");
} }


### Keyword Arguments: 
- Formatting complex strings can be a pain to keep organized 
    - So, python lets us pass keyword arguments instead (parameters)

In [53]:
template = """
   From: <{from_email}>
   To: <{to_email}>
   Subject: {subject}
   {message}"""

print(template.format(
    from_email = "will@example.com",
    to_email = "bob@example.com",
    message = "Did you know ArcMap 10.x still uses python 2.x?? ",
    subject = "2.7?!?!"))


   From: <a@example.com>
   To: <b@example.com>
   Subject: 2.7?!?!
   Did you know ArcMap 10.x still uses python 2.x?? 


### Mixing Keyword and Positional arguments: 
- Works fine as long as we put positional arguments first!

In [9]:
print("{} like {soft} {} than {soft2}".format("I","more", soft="geopandas", soft2='arcpy'))

I like geopandas more than arcpy


### Container Lookups:
- We can put so much more than strings into our ```format``` 💕
    - Lists and Dicts can be indexed
        - as both positional and keyword arguments
    - Floats and Ints can be converted 

In [10]:
emails = ("will@example.com", "bob@example.com")

message = {
           'subject': "Arcpy on 2.7?!?",
           'message': "It will be deprecated this year!"
           }
    
template = """
    From: <{0[0]}>
    To: <{0[1]}>
    Subject: {message[subject]} {message[message]}
    """ 

print(template.format(emails, message=message))


    From: <will@example.com>
    To: <bob@example.com>
    Subject: Arcpy on 2.7?!? It will be deprecated this year!
    


#### We can go nest crazy if we want to:

In [57]:
emails = ("will@example.com", "bob@example.com")
message = {
           'emails': emails,
           'subject': "Arcpy on 2.7?!?",
           'message': "Maybe we should try open sourced software!"
           }
template = """
From: <{0[emails][0]}>
To: <{0[emails][1]}>
Subject: {0[subject]}
{0[message]}"""
print(template.format(message))


From: <will@example.com>
To: <bob@example.com>
Subject: Arcpy on 2.7?!?
Maybe we should try open sourced software!


### Object Lookups: 
- Just like Containers, we can use objects in ```format```'s!
    - Access attributes via ```object.attribute```
- Slightly more readable than container lookups, but making a class for formatting has too much overhead 
    - only do this if the class already exists 

In [60]:
class EMail:
    def __init__(self, from_addr, to_addr, subject, message):
       self.from_addr = from_addr
       self.to_addr = to_addr
       self.subject = subject
       self.message = message

email = EMail("will@example.com", "bob@example.com",
        "We could dump windows if we drop Arc!",
        "What are we waiting for exactly???")

template = """
From: <{0.from_addr}>
To: <{0.to_addr}>
Subject: {0.subject}
{0.message}"""
print(template.format(email))


From: <will@example.com>
To: <bob@example.com>
Subject: We could dump windows if we drop Arc!
What are we waiting for exactly???


### Making Things Pretty

In [61]:
subtotal = 12.32
tax = subtotal * 0.07
total = subtotal + tax
print("Sub: ${0} Tax: ${1} Total: ${total}".format(
   subtotal, tax, total=total))

Sub: $12.32 Tax: $0.8624 Total: $13.182400000000001


In [62]:
print("Sub: ${0:0.2f} Tax: ${1:0.2f} "
       "Total: ${total:0.2f}".format(
           subtotal, tax, total=total))

Sub: $12.32 Tax: $0.86 Total: $13.18


In [69]:
orders = [('burger', 2, 5),
           ('fries', 3.5, 1),
           ('cola', 1.75, 3)]

print("PRODUCT    QUANTITY    PRICE    SUBTOTAL")

for product, price, quantity in orders:
    subtotal = price * quantity
    print("{0:10s}{1: ^9d}    ${2: <8.2f}${3: >6.2f}".format(
       product, quantity, price, subtotal))

PRODUCT    QUANTITY    PRICE    SUBTOTAL
burger        5        $2.00    $ 10.00
fries         1        $3.50    $  3.50
cola          3        $1.75    $  5.25


- ```10s``` means it should take up 10 character spaces 
- ```^9d``` means it should be an int and take 8 characters 
    - ```^``` means align to center of padding 
- ```<``` and ```>``` tell us to align to the left or right
- The ```6.2f``` means use 6 spaces and 2 decimal places 

### Formatter Types: 
- ```s,d,f``` = strings, integers, floats
- ```o``` octal format
- ```X``` hexidecimal 
- ```%``` multiply by 100 and format as percentage 

#### Datetime (and other libraries)have their own format syntax: 

In [70]:
import datetime
print("{0:%Y-%m-%d %I:%M%p }".format(
       datetime.datetime.now()))

2019-02-16 11:55AM 


### Strings are Unicode: 
- Strings in python are unicode characters
- But when we receive strings of information they come in bytes!
    - Bytes are the lowest-level storage format in computing 
    - 8 bits (either int 0-255 or hex 0-FF)
   
"a" => 97 => 0x61 => 01100001

### Converting Bytes to Text: 
- Here we have hex ASCII characters and one non ASCII character 
    - Our shell understands latin-1, so it will auto-convert the ASCII characters

In [1]:
characters = b'\x63\x6c\x69\x63\x68\xe9' 
print(characters) 
print(characters.decode("latin-1"))

b'clich\xe9'
cliché


### Converting Text to Bytes: 
- We can easily ```encode``` characters to bytes in a given format

In [2]:
characters = "cliché"
print(characters.encode("UTF-8"))
print(characters.encode("latin-1"))
print(characters.encode("CP437"))
print(characters.encode("ascii"))

b'clich\xc3\xa9'
b'clich\xe9'
b'clich\x82'


UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 5: ordinal not in range(128)

## Regular Expressions:
- Given a string, determine whether that string matches a given pattern and, optionally, collect substrings that contain relevant information. 
- The real tool for parsing strings and pattern finding 👨🏼‍🔧
- Hard to read, but really useful!!
    - checking if a URL is valid
    - Which available files are on the given band we are looking for 
    - etc, etc, etc (Strings have lots of useful information)

In [128]:
def match(search_str, pattern):
    match = re.match(pattern, search_str)
    if match:
        return "It Matches!"
    else:
        return "No Match!"

### Matching Patterns: 


In [129]:
import re
search_string = "hello world"
pattern1 = "hello"
pattern2 = "world"
print(match(search_string, pattern1))
print(match(search_string, pattern2))

It Matches!
No Match!


#### Notes: 
- ```match``` matches based on the beggining of the string 
    - We can use ```^``` and ```$``` to represent the start and end of a string

In [31]:
pat = "^hello"
print(match(search_string, pat))
# But, hello is not at the end of the string 
print(match(search_string, pat+'$'))

It Matches!
No Match!


### Matching a Selection of Characters: 
- In regex, the ```.``` represents any character wildcard 

In [36]:
pat = "hel.o"
print(match("hello world", pat))

print(match("hell world", pat))

It Matches!
No Match!


#### Matching on Specific Sets of Characters: 

In [45]:
# can be either "l" or "p"
pat = "hel[lp]o"
print(match("hello world", pat))

# optional for any letter in that spot (caps or not)
pat = "hel[a-zA-Z]o world" 
print(match("hello world", pat))

It Matches!
It Matches!


### Characters That Must be Escaped: 
- ```. ^ $ * + > {} [] \ | ()```
- We will also use the ```finditer``` method to match patterns 
    - Before we only checked if a match occured, now we can store the match and where it is in the text

In [92]:
text = '''
0.99
0.00
Mr. William Norris
Mrs. Josey Joserson 
Mr. Finch Fincherson 
Ms. Misty Misterson
Mr Will 
Mrs Sanfransisco

ArcMap 10.x
ArcMap 10.9

719-333-9095
720-303-4242
303.720.6767
800-777-9494
900-333-5435
800-202-5454
'''

# we must escape the "."
pattern = re.compile("0\.[0-9][0-9]")
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(1, 5), match='0.99'>
<_sre.SRE_Match object; span=(6, 10), match='0.00'>
<_sre.SRE_Match object; span=(177, 181), match='0.67'>


In [74]:
print(text[1:5])
print(text[6:10])

0.99
0.00


### Pattern Matching with Regex: 
- ```.```  $\rightarrow$   Any character except newline 
- ```\d```  $\rightarrow$  Digit (0-9)
- ```\D```  $\rightarrow$  Not a Digit (0-9)
- ```\w```  $\rightarrow$  Word Character 
- ```\W```  $\rightarrow$  Not a Word Character 
- ```\s```  $\rightarrow$  Whitespace
- ```\S```  $\rightarrow$  Take a wild guess!
- ```\b```  $\rightarrow$  Word Boundary 
- etc, etc, etc, refer to [the docs](https://docs.python.org/3/library/re.html)

In [67]:
pattern = re.compile(r'\bMr')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(11, 13), match='Mr'>
<_sre.SRE_Match object; span=(30, 32), match='Mr'>
<_sre.SRE_Match object; span=(51, 53), match='Mr'>


In [68]:
# To exclude Mrs. we need to ensure we only select men
pattern = re.compile(r'\bMr\.')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(11, 14), match='Mr.'>
<_sre.SRE_Match object; span=(51, 54), match='Mr.'>


In [76]:
# lets grab all the phone numbers in our text 
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(119, 131), match='719-333-9095'>
<_sre.SRE_Match object; span=(132, 144), match='720-303-4242'>
<_sre.SRE_Match object; span=(145, 157), match='303.720.6767'>


In [81]:
# only find 800 or 900 numbers
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(158, 170), match='800-777-9494'>
<_sre.SRE_Match object; span=(171, 183), match='900-333-5435'>
<_sre.SRE_Match object; span=(184, 196), match='800-202-5454'>


### Character Sets Behave a Little Differently: 

In [84]:
# '^' negates in a character set instead of setting to begining 
pattern = re.compile(r'[^a-zA-Z]')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(1, 2), match='0'>
<_sre.SRE_Match object; span=(2, 3), match='.'>
<_sre.SRE_Match object; span=(3, 4), match='9'>
<_sre.SRE_Match object; span=(4, 5), match='9'>
<_sre.SRE_Match object; span=(5, 6), match='\n'>
<_sre.SRE_Match object; span=(6, 7), match='0'>
<_sre.SRE_Match object; span=(7, 8), match='.'>
<_sre.SRE_Match object; span=(8, 9), match='0'>
<_sre.SRE_Match object; span=(9, 10), match='0'>
<_sre.SRE_Match object; span=(10, 11), match='\n'>
<_sre.SRE_Match object; span=(13, 14), match='.'>
<_sre.SRE_Match object; span=(14, 15), match=' '>
<_sre.SRE_Match object; span=(22, 23), match=' '>
<_sre.SRE_Match object; span=(29, 30), match='\n'>
<_sre.SRE_Match object; span=(33, 34), match='.'>
<_sre.SRE_Match object; span=(34, 35), match=' '>
<_sre.SRE_Match object; span=(40, 41), match=' '>
<_sre.SRE_Match object; span=(49, 50), match=' '>
<_sre.SRE_Match object; span=(50, 51), match='\n'>
<_sre.SRE_Matc

### Matching Multiple Characters: 
Quantifiers: 
- ```*```  $\rightarrow$  0 or more 
- ```+```  $\rightarrow$  1 or more 
- ```?```   $\rightarrow$ 0 or 1
- ```{3}```   $\rightarrow$ exactly 3 things
- ```{3,4}```  $\rightarrow$  range of 3-4 things 

In [86]:
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(119, 131), match='719-333-9095'>
<_sre.SRE_Match object; span=(132, 144), match='720-303-4242'>
<_sre.SRE_Match object; span=(145, 157), match='303.720.6767'>
<_sre.SRE_Match object; span=(158, 170), match='800-777-9494'>
<_sre.SRE_Match object; span=(171, 183), match='900-333-5435'>
<_sre.SRE_Match object; span=(184, 196), match='800-202-5454'>


In [96]:
# match the names of all the "Mr"s w/ or w/o period  
pattern = re.compile(r'Mr\.?\s[A-Z]\w*') # can also be (r'Mr\.?\s[A-Z][a-z]*')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(11, 22), match='Mr. William'>
<_sre.SRE_Match object; span=(51, 60), match='Mr. Finch'>
<_sre.SRE_Match object; span=(93, 100), match='Mr Will'>


In [99]:
# match all of our names 
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')
matches = pattern.finditer(text)
for match in matches: print(match)

<_sre.SRE_Match object; span=(11, 22), match='Mr. William'>
<_sre.SRE_Match object; span=(30, 40), match='Mrs. Josey'>
<_sre.SRE_Match object; span=(51, 60), match='Mr. Finch'>
<_sre.SRE_Match object; span=(73, 82), match='Ms. Misty'>
<_sre.SRE_Match object; span=(93, 100), match='Mr Will'>
<_sre.SRE_Match object; span=(102, 118), match='Mrs Sanfransisco'>


In [110]:
emails = '''
WillN@gmail.com
will_norris@icloud.com
wino6687@colorado.edu
ilikepizza@pizza.org
will.norris@icloud.com
'''

In [111]:
pattern = re.compile(r'[a-zA-Z]+@[a-zA-z]+\.[a-z]+')
matches = pattern.finditer(emails)
for match in matches: print(match)

<_sre.SRE_Match object; span=(1, 16), match='WillN@gmail.com'>
<_sre.SRE_Match object; span=(22, 39), match='norris@icloud.com'>
<_sre.SRE_Match object; span=(62, 82), match='ilikepizza@pizza.org'>
<_sre.SRE_Match object; span=(88, 105), match='norris@icloud.com'>


In [112]:
# we aren't allowing for characters other than text
pattern = re.compile(r'[a-zA-Z0-9_.]+@[a-zA-z]+\.[a-z]+')
matches = pattern.finditer(emails)
for match in matches: print(match)

<_sre.SRE_Match object; span=(1, 16), match='WillN@gmail.com'>
<_sre.SRE_Match object; span=(17, 39), match='will_norris@icloud.com'>
<_sre.SRE_Match object; span=(40, 61), match='wino6687@colorado.edu'>
<_sre.SRE_Match object; span=(62, 82), match='ilikepizza@pizza.org'>
<_sre.SRE_Match object; span=(83, 105), match='will.norris@icloud.com'>


### Creating Groups in Regex

In [130]:
urls = '''
https://www.google.com
http://willmnorris.com
https://youtube.com
https://www.nasa.gov
'''

In [131]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches: 
    print(match.group(0))
    print(match.group(2))
    print()

https://www.google.com
google

http://willmnorris.com
willmnorris

https://youtube.com
youtube

https://www.nasa.gov
nasa



In [132]:
# we can extract information using grouping (group 2 and 3)
subbed_urls = pattern.sub(r'\2\3', urls)
print(subbed_urls)


google.com
willmnorris.com
youtube.com
nasa.gov



## Serializing Objects: 
- Often, we want to store code to use it later 
- When we transfer data, it isn't in its complete structured format
    - We encode data down to bytes and send binary or other broken down data formats over transmission lines
- In python ```pickle``` allows us to store objects in a serialized manner 
    - Converts code down to bytes that can be transported easily
- Serialization is very similar in concept to compression 

In [134]:
import pickle

some_data = ['a list', 'containing', 5, 
             'values including another list',['inner', 'list']]

with open("pickled_list", 'wb') as file:
    pickle.dump(some_data, file)
    
with open("pickled_list", 'rb') as file:
    loaded_data = pickle.load(file)

print(loaded_data)
assert loaded_data == some_data

['a list', 'containing', 5, 'values including another list', ['inner', 'list']]


Contents of Pickled List: ![](https://i.ibb.co/S05Vtcn/Screen-Shot-2019-02-19-at-4-04-32-PM.png)
