# Advanced Modules Exercise Solutions
It's time to test your new skills, this puzzle project will combine multiple skills sets, including unzipping files with Python, using os module to automatically search through lots of files.

#Your Goal
This is a puzzle, so we don't want to give you too much guidance and instead have you figure out things on your own.

There is a .zip file called 'unzip_me_for_instructions.zip', unzip it, open the .txt file with Python, read the instructions and see if you can figure out what you need to do!

# If you get stuck or don't know where to start, here is a guide/hints

# Step 1: Unzipping the File
We can easily use the shutil library to extract and unzip the contents of the .zip file

In [4]:
import shutil

In [2]:
shutil.unpack_archive('unzip_me_for_instructions.zip','','zip')

ReadError: unzip_me_for_instructions.zip is not a zip file

# Step 2: Read the instructions file
Let's figure out what we need to do, open the instructions.txt file.

In [3]:
with open('extracted_content/Instructions.txt') as f:
    content = f.read()
    print(content)

FileNotFoundError: [Errno 2] No such file or directory: 'extracted_content/Instructions.txt'

# Step 3: Regular Expression to Find the Link
There are many approaches to take here, but since we know we are looking for a phone number, there should be a digits in the form ###-###-####, so we can easily create a regex expression for this and test it. Once its tested and working, we can figure out how to run it through all the txt documents.

In [5]:
import re

In [6]:
pattern = r'\d{3}-\d{3}-\d{4}'

In [7]:
test_string = "here is a random number 1231231234 , here is phone number formatted 123-123-1234"


In [8]:
re.findall(pattern,test_string)

['123-123-1234']

# Step 4: Create a function for regex
Let's put this inside a function that applies it to the contents of a .txt file, this way we can apply this function to all the txt files in the extracted_content folder.

In [9]:
def search(file,pattern= r'\d{3}-\d{3}-\d{4}'):
    f = open(file,'r')
    text = f.read()
    
    if re.search(pattern,text):
        return re.search(pattern,text)
    else:
        return ''

# Step 5: OS Walk through the Files to Get the Link¶
Now that we have a basic function to search through the text of the files, let's perform an os.walk through the unzipped directory to find the links hidden somewhere in one of the text files.

In [15]:
import os

In [16]:
results = []
for folder , sub_folders , files in os.walk(os.getcwd()+"\\extracted_content"):
    
    for f in files:
        full_path = folder+'\\'+f
         
        results.append(search(full_path))

In [17]:
for r in results:
    if r != '':
        print(r.group())

In [18]:
719-266-2837

-2384

# Collections Module
The collections module is a built-in module that implements specialized container data types providing alternatives to Python’s general purpose built-in containers. We've already gone over the basics: dict, list, set, and tuple.

Now we'll learn about the alternatives that the collections module provides.

# Counter
Counter is a dict subclass which helps count hashable objects. Inside of it elements are stored as dictionary keys and the counts of the objects are stored as the value.

Let's see how it can be used:

In [19]:
from collections import Counter

# Counter() with lists

In [21]:
lst = [1,2,2,2,2,3,3,3,1,2,1,12,3,2,32,1,21,1,223,1]

Counter(lst)

Counter({1: 6, 2: 6, 3: 4, 12: 1, 32: 1, 21: 1, 223: 1})

# Counter with strings

In [22]:
Counter('aabsbsbsbhshhbbsbs')

Counter({'a': 2, 'b': 7, 's': 6, 'h': 3})

In [23]:
s = 'How many times does each word show up in this sentence word times each each word'

words = s.split()

Counter(words)

Counter({'How': 1,
         'many': 1,
         'times': 2,
         'does': 1,
         'each': 3,
         'word': 3,
         'show': 1,
         'up': 1,
         'in': 1,
         'this': 1,
         'sentence': 1})

In [24]:
# Methods with Counter()
c = Counter(words)

c.most_common(2)

[('each', 3), ('word', 3)]

# Common patterns when using the Counter() object
sum(c.values())                 # total of all counts
c.clear()                       # reset all counts
list(c)                         # list unique elements
set(c)                          # convert to a set
dict(c)                         # convert to a regular dictionary
c.items()                       # convert to a list of (elem, cnt) pairs
Counter(dict(list_of_pairs))    # convert from a list of (elem, cnt) pairs
c.most_common()[:-n-1:-1]       # n least common elements
c += Counter()                  # remove zero and negative counts
# defaultdict
defaultdict is a dictionary-like object which provides all methods provided by a dictionary but takes a first argument (default_factory) as a default data type for the dictionary. Using defaultdict is faster than doing the same using dict.set_default method.

# A defaultdict will never raise a KeyError. Any key that does not exist gets the value returned by the default factory.

In [25]:
from collections import defaultdict

In [26]:
d ={}

In [27]:
d['one']

KeyError: 'one'

In [28]:
d  = defaultdict(object)

In [29]:
d['one']

<object at 0x26f45a315c0>

In [30]:
for item in d:
    print(item)

one


Can also initialize with default values:

In [31]:
d = defaultdict(lambda: 0)

In [32]:
d['one']

0

# namedtuple
The standard tuple uses numerical indexes to access its members, for example:

In [33]:
t = (12,13,14)

In [34]:
t[0]

12

For simple use cases, this is usually enough. On the other hand, remembering which index should be used for each value can lead to errors, especially if the tuple has a lot of fields and is constructed far from where it is used. A namedtuple assigns names, as well as the numerical index, to each member.

Each kind of namedtuple is represented by its own class, created by using the namedtuple() factory function. The arguments are the name of the new class and a string containing the names of the elements.

You can basically think of namedtuples as a very quick way of creating a new object/class type with some attribute fields. For example:

In [35]:
from collections import namedtuple

In [36]:
Dog = namedtuple('Dog',['age','breed','name'])

sam = Dog(age=2,breed='Lab',name='Sammy')

frank = Dog(age=2,breed='Shepard',name="Frankie")

We construct the namedtuple by first passing the object type name (Dog) and then passing a string with the variety of fields as a string with spaces between the field names. We can then call on the various attributes:

In [37]:
sam

Dog(age=2, breed='Lab', name='Sammy')

In [38]:
sam.age

2

In [39]:
sam.breed

'Lab'

# Conclusion
Hopefully you now see how incredibly useful the collections module is in Python and it should be your go-to module for a variety of common tasks!

# Opening and Reading Files
So far we've discussed how to open files manually, one by one. Let's explore how we can open files programatically.

# Review: Understanding File Paths

In [40]:
pwd

'C:\\Users\\Madhu'

# Create Practice File
We will begin by creating a practice text file that we will be using for demonstration.

In [41]:
f = open('practice.txt','w+')

In [42]:
f.write('test')
f.close()

# Getting Directories
Python has a built-in os module that allows us to use operating system dependent functionality.

You can get the current directory:

In [43]:
import os

In [44]:
os.getcwd()

'C:\\Users\\Madhu'

# Listing Files in a Directory
You can also use the os module to list directories.

In [45]:
# In your current directory
os.listdir()

['.anaconda',
 '.astropy',
 '.atom',
 '.conda',
 '.condarc',
 '.config',
 '.eclipse',
 '.gitconfig',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.m2',
 '.matplotlib',
 '.node_repl_history',
 '.p2',
 '.pylint.d',
 '.python_history',
 '.spyder-py3',
 '.tooling',
 '.viminfo',
 '.vscode',
 '.webclipse',
 '.webclipse.properties',
 '3D Objects',
 'abcdefghijk.txt',
 'Advanced-Python Modules.ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'cap.py',
 'Contacts',
 'Cookies',
 'Desktop',
 'dfghntnaathajth.txt',
 'Documents',
 'Downloads',
 'eclipse',
 'Favorites',
 'github',
 'IntelGraphicsProfiles',
 'Jupyter.IPYNB',
 'Links',
 'Local Settings',
 'MicrosoftEdgeBackups',
 'Music',
 'My Documents',
 'My Python Stuff',
 'myfile.txt',
 'my_new_file.txt',
 'NetHood',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TM.blf',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TMContainer00000000000000000001.regtrans-ms',
 'NTUSER.

In [46]:
# In any directory you pass
os.listdir("C:\\Users")

['All Users',
 'Default',
 'Default User',
 'Default.migrated',
 'desktop.ini',
 'Madhu',
 'Public']

# Moving Files
You can use the built-in shutil module to to move files to different locations. Keep in mind, there are permission restrictions, for example if you are logged in a User A, you won't be able to make changes to the top level Users folder without the proper permissions, more info

In [47]:
import shutil

In [49]:
shutil.move('practice.txt','C:\\Users\\Madhu')

Error: Destination path 'C:\Users\Madhu\practice.txt' already exists

In [50]:
os.listdir()

['.anaconda',
 '.astropy',
 '.atom',
 '.conda',
 '.condarc',
 '.config',
 '.eclipse',
 '.gitconfig',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.m2',
 '.matplotlib',
 '.node_repl_history',
 '.p2',
 '.pylint.d',
 '.python_history',
 '.spyder-py3',
 '.tooling',
 '.viminfo',
 '.vscode',
 '.webclipse',
 '.webclipse.properties',
 '3D Objects',
 'abcdefghijk.txt',
 'Advanced-Python Modules.ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'cap.py',
 'Contacts',
 'Cookies',
 'Desktop',
 'dfghntnaathajth.txt',
 'Documents',
 'Downloads',
 'eclipse',
 'Favorites',
 'github',
 'IntelGraphicsProfiles',
 'Jupyter.IPYNB',
 'Links',
 'Local Settings',
 'MicrosoftEdgeBackups',
 'Music',
 'My Documents',
 'My Python Stuff',
 'myfile.txt',
 'my_new_file.txt',
 'NetHood',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TM.blf',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TMContainer00000000000000000001.regtrans-ms',
 'NTUSER.

# Deleting Files
# NOTE: The os module provides 3 methods for deleting files:

os.unlink(path) which deletes a file at the path your provide
os.rmdir(path) which deletes a folder (folder must be empty) at the path your provide
shutil.rmtree(path) this is the most dangerous, as it will remove all files and folders contained in the path. All of these methods can not be reversed! Which means if you make a mistake you won't be able to recover the file. Instead we will use the send2trash module. A safer alternative that sends deleted files to the trash bin instead of permanent removal.


In [51]:
import send2trash

In [52]:
os.listdir()

['.anaconda',
 '.astropy',
 '.atom',
 '.conda',
 '.condarc',
 '.config',
 '.eclipse',
 '.gitconfig',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.m2',
 '.matplotlib',
 '.node_repl_history',
 '.p2',
 '.pylint.d',
 '.python_history',
 '.spyder-py3',
 '.tooling',
 '.viminfo',
 '.vscode',
 '.webclipse',
 '.webclipse.properties',
 '3D Objects',
 'abcdefghijk.txt',
 'Advanced-Python Modules.ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'cap.py',
 'Contacts',
 'Cookies',
 'Desktop',
 'dfghntnaathajth.txt',
 'Documents',
 'Downloads',
 'eclipse',
 'Favorites',
 'github',
 'IntelGraphicsProfiles',
 'Jupyter.IPYNB',
 'Links',
 'Local Settings',
 'MicrosoftEdgeBackups',
 'Music',
 'My Documents',
 'My Python Stuff',
 'myfile.txt',
 'my_new_file.txt',
 'NetHood',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TM.blf',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TMContainer00000000000000000001.regtrans-ms',
 'NTUSER.

In [55]:
Send2trash.send2trash('RECYCLEBIN')

NameError: name 'Send2trash' is not defined

In [56]:
os.listdir()

['.anaconda',
 '.astropy',
 '.atom',
 '.conda',
 '.condarc',
 '.config',
 '.eclipse',
 '.gitconfig',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.m2',
 '.matplotlib',
 '.node_repl_history',
 '.p2',
 '.pylint.d',
 '.python_history',
 '.spyder-py3',
 '.tooling',
 '.viminfo',
 '.vscode',
 '.webclipse',
 '.webclipse.properties',
 '3D Objects',
 'abcdefghijk.txt',
 'Advanced-Python Modules.ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'cap.py',
 'Contacts',
 'Cookies',
 'Desktop',
 'dfghntnaathajth.txt',
 'Documents',
 'Downloads',
 'eclipse',
 'Favorites',
 'github',
 'IntelGraphicsProfiles',
 'Jupyter.IPYNB',
 'Links',
 'Local Settings',
 'MicrosoftEdgeBackups',
 'Music',
 'My Documents',
 'My Python Stuff',
 'myfile.txt',
 'my_new_file.txt',
 'NetHood',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TM.blf',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TMContainer00000000000000000001.regtrans-ms',
 'NTUSER.

# Walking through a directory
Often you will just need to "walk" through a directory, that is visit every file or folder and check to see if a file is in the directory, and then perhaps do something with that file. Usually recursively walking through every file and folder in a directory would be quite tricky to program, but luckily the os module has a direct method call for this called os.walk(). Let's explore how it works.

In [57]:
os.getcwd()

'C:\\Users\\Madhu'

In [58]:
os.listdir()

['.anaconda',
 '.astropy',
 '.atom',
 '.conda',
 '.condarc',
 '.config',
 '.eclipse',
 '.gitconfig',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.m2',
 '.matplotlib',
 '.node_repl_history',
 '.p2',
 '.pylint.d',
 '.python_history',
 '.spyder-py3',
 '.tooling',
 '.viminfo',
 '.vscode',
 '.webclipse',
 '.webclipse.properties',
 '3D Objects',
 'abcdefghijk.txt',
 'Advanced-Python Modules.ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'cap.py',
 'Contacts',
 'Cookies',
 'Desktop',
 'dfghntnaathajth.txt',
 'Documents',
 'Downloads',
 'eclipse',
 'Favorites',
 'github',
 'IntelGraphicsProfiles',
 'Jupyter.IPYNB',
 'Links',
 'Local Settings',
 'MicrosoftEdgeBackups',
 'Music',
 'My Documents',
 'My Python Stuff',
 'myfile.txt',
 'my_new_file.txt',
 'NetHood',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TM.blf',
 'NTUSER.DAT{0c2b14a3-e1d2-11e9-a60b-d008b304490f}.TMContainer00000000000000000001.regtrans-ms',
 'NTUSER.

In [59]:
for folder , sub_folders , files in os.walk("Example_Top_Level"):
    
    print("Currently looking at folder: "+ folder)
    print('\n')
    print("THE SUBFOLDERS ARE: ")
    for sub_fold in sub_folders:
        print("\t Subfolder: "+sub_fold )
    
    print('\n')
    
    print("THE FILES ARE: ")
    for f in files:
        print("\t File: "+f)
    print('\n')
    
    # Now look at subfolders

# DATETIME MODULE
Python has the datetime module to help deal with timestamps in your code. Time values are represented with the time class. Times have attributes for hour, minute, second, and microsecond. They can also include time zone information. The arguments to initialize a time instance are optional, but the default of 0 is unlikely to be what you want.
# Time
Let's take a look at how we can extract time information from the datetime module. We can create a timestamp by specifying datetime.time(hour,minute,second,microsecond)

In [60]:
import datetime

t = datetime.time(4, 20, 1)

# Let's show the different components
print(t)
print('hour  :', t.hour)
print('minute:', t.minute)
print('second:', t.second)
print('microsecond:', t.microsecond)
print('tzinfo:', t.tzinfo)

04:20:01
hour  : 4
minute: 20
second: 1
microsecond: 0
tzinfo: None


Note: A time instance only holds values of time, and not a date associated with the time.

We can also check the min and max values a time of day can have in the module:

In [61]:
print('Earliest  :', datetime.time.min)
print('Latest    :', datetime.time.max)
print('Resolution:', datetime.time.resolution)

Earliest  : 00:00:00
Latest    : 23:59:59.999999
Resolution: 0:00:00.000001


The min and max class attributes reflect the valid range of times in a single day.

# Dates
datetime (as you might suspect) also allows us to work with date timestamps. Calendar date values are represented with the date class. Instances have attributes for year, month, and day. It is easy to create a date representing today’s date using the today() class method.

Let's see some examples:

In [62]:
today = datetime.date.today()
print(today)
print('ctime:', today.ctime())
print('tuple:', today.timetuple())
print('ordinal:', today.toordinal())
print('Year :', today.year)
print('Month:', today.month)
print('Day  :', today.day)

2021-04-05
ctime: Mon Apr  5 00:00:00 2021
tuple: time.struct_time(tm_year=2021, tm_mon=4, tm_mday=5, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=95, tm_isdst=-1)
ordinal: 737885
Year : 2021
Month: 4
Day  : 5


As with time, the range of date values supported can be determined using the min and max attributes.

print('Earliest  :', datetime.date.min)
print('Latest    :', datetime.date.max)
print('Resolution:', datetime.date.resolution)

Another way to create new date instances uses the replace() method of an existing date. For example, you can change the year, leaving the day and month alone.

In [64]:
d1 = datetime.date(2015, 3, 11)
print('d1:', d1)

d2 = d1.replace(year=1990)
print('d2:', d2)

d1: 2015-03-11
d2: 1990-03-11


# Arithmetic
We can perform arithmetic on date objects to check for time differences. For example:

In [65]:
d1

datetime.date(2015, 3, 11)

In [66]:
d2

datetime.date(1990, 3, 11)

In [67]:
d1-d2

datetime.timedelta(days=9131)

This gives us the difference in days between the two dates. You can use the timedelta method to specify various units of times (days, minutes, hours, etc.)

Great! You should now have a basic understanding of how to use datetime with Python to work with timestamps in your code!

# Math and Random Modules
Python comes with a built in math module and random module. In this lecture we will give a brief tour of their capabilities. Usually you can simply look up the function call you are looking for in the online documentation.

*Math Module

*Random Module

We won't go through every function available in these modules since there are so many, but we will show some useful ones.

#Useful Math Functions

In [68]:
import math

In [69]:
help(math)

Help on built-in module math:

NAME
    math

DESCRIPTION
    This module provides access to the mathematical functions
    defined by the C standard.

FUNCTIONS
    acos(x, /)
        Return the arc cosine (measured in radians) of x.
    
    acosh(x, /)
        Return the inverse hyperbolic cosine of x.
    
    asin(x, /)
        Return the arc sine (measured in radians) of x.
    
    asinh(x, /)
        Return the inverse hyperbolic sine of x.
    
    atan(x, /)
        Return the arc tangent (measured in radians) of x.
    
    atan2(y, x, /)
        Return the arc tangent (measured in radians) of y/x.
        
        Unlike atan(y/x), the signs of both x and y are considered.
    
    atanh(x, /)
        Return the inverse hyperbolic tangent of x.
    
    ceil(x, /)
        Return the ceiling of x as an Integral.
        
        This is the smallest integer >= x.
    
    comb(n, k, /)
        Number of ways to choose k items from n items without repetition and without order

# Rounding Numbers

In [70]:
value = 4.35

In [71]:
math.floor(value)

4

In [72]:
math.ceil(value)

5

In [73]:
round(value)

4

# Mathematical Constants

In [74]:
math.pi

3.141592653589793

In [75]:
from math import pi

In [76]:
pi

3.141592653589793

In [77]:
math.e

2.718281828459045

In [78]:
math.tau

6.283185307179586

In [79]:
math.inf

inf

# Logarithmic Values

In [80]:
math.e

2.718281828459045

In [81]:
# Log Base e
math.log(math.e)

1.0

In [82]:
# Will produce an error if value does not exist mathmatically
math.log(0)

ValueError: math domain error

In [83]:
math.log(10)

2.302585092994046

In [84]:
math.e ** 2.302585092994046

10.000000000000002

# Custom Base

In [85]:
# math.log(x,base)
math.log(100,10)

2.0

In [86]:
10**2

100

# Trigonometrics Functions

In [87]:
# Radians
math.sin(10)

-0.5440211108893698

In [88]:
math.degrees(pi/2)

90.0

In [89]:
math.radians(180)

3.141592653589793

# Random Module
Random Module allows us to create random numbers. We can even set a seed to produce the same random set every time.

The explanation of how a computer attempts to generate random numbers is beyond the scope of this course since it involves higher level mathmatics. But if you are interested in this topic check out:

https://en.wikipedia.org/wiki/Pseudorandom_number_generator
https://en.wikipedia.org/wiki/Random_seed
# Understanding a seed
Setting a seed allows us to start from a seeded psuedorandom number generator, which means the same random numbers will show up in a series. Note, you need the seed to be in the same cell if your using jupyter to guarantee the same results each time. Getting a same set of random numbers can be important in situations where you will be trying different variations of functions and want to compare their performance on random values, but want to do it fairly (so you need the same set of random numbers each time).

In [90]:
import random

In [91]:
random.randint(0,100)

7

In [92]:
random.randint(0,100)

60

In [93]:
# The value 101 is completely arbitrary, you can pass in any number you want
random.seed(101)
# You can run this cell as many times as you want, it will always return the same number
random.randint(0,100)

74

In [94]:
random.randint(0,100)

24

In [95]:
# The value 101 is completely arbitrary, you can pass in any number you want
random.seed(101)
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))

74
24
69
45
59


# Random Integers

In [96]:
random.randint(0,100)

6

# Random with Sequences
Grab a random item from a list

In [97]:
mylist = list(range(0,20))

In [98]:
mylist

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [99]:
random.choice(mylist)

16

In [100]:
mylist

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

# Sample with Replacement
Take a sample size, allowing picking elements more than once. Imagine a bag of numbered lottery balls, you reach in to grab a random lotto ball, then after marking down the number, you place it back in the bag, then continue picking another one.

In [101]:
random.choices(population=mylist,k=10)

[4, 4, 5, 13, 4, 19, 1, 3, 1, 15]

# Sample without Replacement
Once an item has been randomly picked, it can't be picked again. Imagine a bag of numbered lottery balls, you reach in to grab a random lotto ball, then after marking down the number, you leave it out of the bag, then continue picking another one.

In [104]:
random.sample(population=mylist,k=10)

[2, 13, 14, 9, 6, 1, 10, 16, 18, 5]

# Shuffle a list
Note: This effects the object in place!

In [105]:
# Don't assign this to anything!
random.shuffle(mylist)

In [106]:
mylist

[19, 17, 14, 5, 2, 9, 4, 3, 16, 0, 15, 10, 11, 6, 1, 13, 7, 18, 12, 8]

# Random Distributions
Uniform Distribution

In [110]:
# Continuous, random picks a value between a and b, each value has equal change of being picked.
random.uniform(a=0,b=100)

18.084194718110446

# https://en.wikipedia.org/wiki/Normal_distribution

In [109]:
random.gauss(mu=0,sigma=1)

0.15136969676306386

Final Note: If you find yourself using these libraries a lot, take a look at the NumPy library for Python, covers all these capabilities with extreme efficiency. We cover this library and a lot more in our data science and machine learning courses.

# Python Debugger
You've probably used a variety of print statements to try to find errors in your code. A better way of doing this is by using Python's built-in debugger module (pdb). The pdb module implements an interactive debugging environment for Python programs. It includes features to let you pause your program, look at the values of variables, and watch program execution step-by-step, so you can understand what your program actually does and find bugs in the logic.

This is a bit difficult to show since it requires creating an error on purpose, but hopefully this simple example illustrates the power of the pdb module.
Note: Keep in mind it would be pretty unusual to use pdb in an Jupyter Notebook setting.

Here we will create an error on purpose, trying to add a list to an integer

In [112]:
x = [1,3,4]
y = 2
z = 3

result = y + z
print(result)
result2 = y+x
print(result2)

5


TypeError: unsupported operand type(s) for +: 'int' and 'list'

Hmmm, looks like we get an error! Let's implement a set_trace() using the pdb module. This will allow us to basically pause the code at the point of the trace and check if anything is wrong.

In [None]:
import pdb

x = [1,3,4]
y = 2
z = 3

result = y + z
print(result)

# Set a trace using Python Debugger
pdb.set_trace()

result2 = y+x
print(result2)

5
--Return--
None
> [1;32m<ipython-input-113-6c36e8161fda>[0m(11)[0;36m<module>[1;34m()[0m
[1;32m      9 [1;33m[1;33m[0m[0m
[0m[1;32m     10 [1;33m[1;31m# Set a trace using Python Debugger[0m[1;33m[0m[1;33m[0m[1;33m[0m[0m
[0m[1;32m---> 11 [1;33m[0mpdb[0m[1;33m.[0m[0mset_trace[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[0m[1;32m     12 [1;33m[1;33m[0m[0m
[0m[1;32m     13 [1;33m[0mresult2[0m [1;33m=[0m [0my[0m[1;33m+[0m[0mx[0m[1;33m[0m[1;33m[0m[0m
[0m
ipdb> Traceback
*** NameError: name 'Traceback' is not defined


Great! Now we could check what the various variables were and check for errors. You can use 'q' to quit the debugger. For more information on general debugging techniques and more methods, check out the official documentation: https://docs.python.org/3/library/pdb.html

# Overview of Regular Expressions
Regular Expressions (sometimes called regex for short) allows a user to search for strings using almost any sort of rule they can come up. For example, finding all capital letters in a string, or finding a phone number in a document.

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Let's begin by explaining how to search for basic patterns in a string!

# Searching for Basic Patterns
Let's imagine that we have the following string:

In [2]:
text = "The person's phone number is 408-555-1234. Call soon!"

We'll start off by trying to find out if the string "phone" is inside the text string. Now we could quickly do this with:

In [3]:
'phone' in text

True

But let's show the format for regular expressions, because later on we will be searching for patterns that won't have such a simple solution.

In [4]:
import re

In [5]:
pattern = 'phone'

In [6]:
re.search(pattern,text)

<re.Match object; span=(13, 18), match='phone'>

In [8]:
pattern = "NOT IN TEXT"

In [9]:
re.search(pattern,text)

Now we've seen that re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned (in Jupyter Notebook this just means that nothing is output below the cell).

Let's take a closer look at this Match object.

In [10]:
pattern = 'phone'

In [11]:
match = re.search(pattern,text)

In [12]:
match

<re.Match object; span=(13, 18), match='phone'>

Notice the span, there is also a start and end index information.

In [13]:
match.span()

(13, 18)

In [14]:
match.start()

13

In [15]:
match.end()

18

But what if the pattern occurs more than once?

In [16]:
text = "my phone is a new phone"

In [17]:
match = re.search("phone",text)

In [18]:
match.span()

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:

In [19]:
matches = re.findall("phone",text)

In [20]:
matches

['phone', 'phone']

In [21]:
len(matches)

2

To get actual match objects, use the iterator:

In [22]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


If you wanted the actual text that matched, you can use the .group() method.

In [23]:
match.group()

'phone'

# Patterns
So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this, often its just a matter of looking up the pattern code.

Let' begin!

# Identifiers for Characters in Patterns
Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

r'mypattern'

placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

</table>
For example:

In [24]:
text = "My telephone number is 408-555-1234"

In [27]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [28]:
phone.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [29]:
'408-555-1234'

'408-555-1234'

# Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

Quantifiers
Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

Let's rewrite our pattern using these quantifiers:

In [31]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<_sre.SRE_Match object; span=(23, 35), match='408-555-1234'>

# Groups
What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

Using the phone number example, we can separate groups of regular expressions using parenthesis:

In [32]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [36]:
results = re.search(phone_pattern,text)

In [37]:
# The entire result
results.group()

AttributeError: 'NoneType' object has no attribute 'group'

'408-555-1234'

In [39]:
# Can then also call by group position.
# remember groups were separated by parenthesis ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1)

AttributeError: 'NoneType' object has no attribute 'group'

'408'

In [40]:
results.group(2)

AttributeError: 'NoneType' object has no attribute 'group'

'555'

In [None]:
results.group(3)

'1234'

In [41]:
# We only had three groups of parenthesis
results.group(4)

AttributeError: 'NoneType' object has no attribute 'group'

# Additional Regex Syntax
# Or operator |
Use the pipe operator to have an or statment. For example

In [None]:
re.search(r"man|woman","This man was here.")

<_sre.SRE_Match object; span=(5, 8), match='man'>

In [None]:

re.search(r"man|woman","This woman was here.")

<_sre.SRE_Match object; span=(5, 10), match='woman'>

# The Wildcard Character
Use a "wildcard" as a placement that will match any character placed there. You can use a simple period . for this. For example:

In [42]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [43]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

Notice how we only matched the first 3 letters, that is because we need a . for each wildcard letter. Or use the quantifiers described above to set its own rules.

In [44]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".

In [45]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

# Starts with and Ends With
We can use the ^ to signal starts with, and the $ to signal ends with:

In [46]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

In [47]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

Note that this is for the entire string, not individual words!

# Exclusion
To exclude characters, we can use the ^ symbol in conjunction with a set of brackets []. Anything inside the brackets is excluded. For example:

In [48]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [49]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign

In [50]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

We can use this to remove punctuation from a sentence.

In [51]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [52]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [53]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [54]:
clean

'This is a string But it has punctuation How can we remove it'

# Brackets for Grouping
As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [55]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [56]:
re.findall(r'[\w]+-[\w]+',text)



['hypen-words', 'long-ish']

# Parenthesis for Multiple Options
If we have multiple options for matching, we can use parenthesis to list out these options. For Example:

In [57]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [58]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [59]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [60]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)


# Conclusion
Excellent work! For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html