Today we'll cover:

1. [Running OS (operating system) commands](#OS-interface)
2. [Handling command line arguments](#Command-line-arguments)
3. [Math](#Math)
4. [Pattern matching](#Pattern-matching)
5. [Dates and times](#Dates-and-times)
6. [Performance measurement](#Performance-measurement)

# OS interface

In [1]:
import os

In [2]:
os.getcwd()

'/Users/tewaria/Development/stats607a-fall2015'

In [3]:
os.chdir('homeworks')

In [4]:
os.getcwd()

'/Users/tewaria/Development/stats607a-fall2015/homeworks'

In [5]:
os.chdir('..') # change directory to parent directory

In [6]:
os.getcwd()

'/Users/tewaria/Development/stats607a-fall2015'

In [7]:
[os.getenv(var) for var in ['HOME', 'PATH']] # return the environment variables HOME and PATH

['/Users/tewaria',
 '/Users/tewaria/anaconda/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/opt/X11/bin:/usr/local/munki:/usr/texbin:/Users/tewaria/bin']

You can list files in a directory using `listdir`.

In [8]:
os.listdir('homeworks') # list files in directory homeworks

['assignment_one_kmeans.py',
 'assignment_one_nearest_neighbor.py',
 'assignment_one_optimization.py',
 'datasets',
 'hw1.aux',
 'hw1.log',
 'hw1.out',
 'hw1.pdf',
 'hw1.synctex.gz',
 'hw1.tex',
 'hw1_ver1.0.pdf',
 'losses.py',
 'stats607-header.tex']

# Command line arguments

When you run a command on the OS shell, such as `ls -l -a`, the arguments supplied to it (`-l` and `-a` in this case) are called command line arguments.

Now let us put the following python commands in a script called `print_cmdline_args.py`:

    import sys
    
    print 'This python\'s script name is ' + sys.argv[0]
    
    if len(sys.argv) > 1:
        print 'It was called with the following arguments:',
        for arg in sys.argv[1:]:
            print arg,
        print
    else:
        print 'It was called with no arguments.'

In [9]:
os.system('python print_cmdline_args.py')

0

Well... That just told us that the command executed successfully. What if we want to see the output?

In [10]:
print os.popen('python print_cmdline_args.py').read()

This python's script name is print_cmdline_args.py
It was called with no arguments.



In Python 2.7 or later, one can also use the `subprocess.check_output()` function from the `subprocess` module.
Let us now call the script with some arguments.

In [11]:
print os.popen('python print_cmdline_args.py having arguments that enlighten are fun in scripts as well as in life!').read()

This python's script name is print_cmdline_args.py
It was called with the following arguments: having arguments that enlighten are fun in scripts as well as in life!



# Math

In [12]:
import math

In [13]:
print dir(math)

['__doc__', '__file__', '__name__', '__package__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'hypot', 'isinf', 'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'trunc']


In [14]:
import random

In [15]:
print 'A randomly picked name from the math module directory: ' \
        + random.choice(dir(math)[4:]) # choose a random element from a sequence

A randomly picked name from the math module directory: erf


In [16]:
random.choice("STATS 607A") # also works for strings

'A'

In [17]:
my_list = range(-5,5); print my_list

[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]


In [18]:
random.shuffle(my_list) # shuffles list in place

In [19]:
print my_list

[3, -1, -4, 4, 1, 2, -2, -5, -3, 0]


In [20]:
random.sample(my_list, 5) # random sample of 5 (drawn without replacement)

[1, -5, 4, 3, -4]

In [21]:
random.randrange(10,20) # equivalent to, but faster than, random.choice(range(10,20))

15

In [22]:
random.random() # floating point number in [0,1)

0.94125121147423

In [23]:
random.uniform(0,10) # returns a + (b-a)*random()

4.883736359235109

In [24]:
random.gauss(0, 1) # the ubiquitious standard normal (args are mean, std dev)

-0.1199821153557051

# Pattern matching

Regular expressions are a powerful way to find and extract complicated patterns in string. The Python module `re` offers powerful regular expression matching operations.

In [25]:
import re

Let's create a list with valid and invalid email addresses.

In [26]:
addr_list = ['someone@somewhere.com', 'prof@university.edu', \
             'employee@company,com', 'student@school.edu', \
             'silly@com', 'missing_at#server.org']

What is a valid email address anyways? The actual answer can get a bit complicated (see [this question](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address)), but for today's lecture, we'll define a valid email address as anything of the form:

`username@server.domain`

where `username` can be any non-empty string of letters and numbers, `server` can be any non-empty string of letters and numbers, and `domain` is any non-empty string of letters.

In [27]:
email_pattern = r'[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]+' # a raw string with a regular expression

There are several things to note about this regular expression:

1. `[]` is used to match a set of characters
2. `+` means "one or more repetitions of"
3. `@` matches the character `@` (and nothing else)
4. `.` matches any single character. So, if we're actually looking for a period we have to escape it.
5. But Python itself interprets escape sequences in strings. So we created a raw string (that starts with `r`)

In [28]:
re.match(email_pattern, addr_list[0])

<_sre.SRE_Match at 0x1067ea168>

In [29]:
re.match(email_pattern, addr_list[1])

<_sre.SRE_Match at 0x1067ea238>

Hmmm... What's going on? `re.match()` just seems to return objects but we want to know whether the pattern matched or not. Let's try on an invalid email address.

In [30]:
re.match(email_pattern, addr_list[2])

Okay, so re.match returns `None` if no match occurs.

So, now let us validate all addresses. When you know a pattern will be matched a lot, it is useful (for efficiency) to compile the regular expression.

In [31]:
email_prog = re.compile(email_pattern)

`re.compile()` returns a `RegexObject` whose `match()` method can be called later.

In [32]:
email_prog.match('a@b.c')

<_sre.SRE_Match at 0x1067ea3d8>

Remember our list of addresses...

In [33]:
print addr_list

['someone@somewhere.com', 'prof@university.edu', 'employee@company,com', 'student@school.edu', 'silly@com', 'missing_at#server.org']


In [34]:
[addr for addr in addr_list if email_prog.match(addr)] # pick valid ones

['someone@somewhere.com', 'prof@university.edu', 'student@school.edu']

In [35]:
[bool(email_prog.match(addr)) for addr in addr_list] # print True for valid one, False for invalid one

[True, True, False, True, False, False]

In [36]:
bool(email_prog.match('a_user@university.edu'))

False

We would like to allow for underscores in the username. In Python regular expressions `\w` is equivalent to `[a-zA-Z0-9_]`.

In [37]:
email_prog = re.compile(r'\w+@[a-zA-Z0-9]+\.[a-zA-Z]+')

In [38]:
bool(email_prog.match('a_user@university.edu'))

True

However, what about the string `'a_user@university.edu and a lot of other stuff`?

In [39]:
bool(email_prog.match('a_user@university.edu and a lot of other stuff'))

True

Actually `match` just tells you whether the pattern occurs at the beginning of the string. To make sure, there's nothing after the pattern we meed the regular expression to match the entire string not just a prefic of the string. The character `$` has a special meaning in regular expression and matches the end of the string.

In [40]:
email_prog = re.compile(r'\w+@[a-zA-Z0-9]+\.[a-zA-Z]+$')

In [41]:
bool(email_prog.match('a_user@university.edu and a lot of other stuff'))

False

Now what about `a_user@dept.university.edu`?

In [42]:
bool(email_prog.match('a_user@dept.university.edu'))

False

We can allow the `server` pattern `[a-zA-Z0-9]+\.` itself to repeat 1 or more times!

In [43]:
email_prog = re.compile(r'\w+@([a-zA-Z0-9]+\.)+[a-zA-Z]+$')

In [44]:
bool(email_prog.match('a_user@dept.university.edu'))

True

Let's extract the users from all email addresses.

In [45]:
addr_list.append('a_user@dept.university.edu'); print addr_list

['someone@somewhere.com', 'prof@university.edu', 'employee@company,com', 'student@school.edu', 'silly@com', 'missing_at#server.org', 'a_user@dept.university.edu']


In [46]:
email_prog = re.compile(r'(\w+)@([a-zA-Z0-9]+\.)+[a-zA-Z]+$') # note how we surrounded the user pattern in parentheses

The `group` method returns the subgroups of the match. `group(0)` returns the entire string. `group(1)` returns the first subgroup. `group(2)` returns the second subgroup, and so on.

In [47]:
email_prog.match(addr_list[0]).group(1)

'someone'

In [48]:
email_prog.match(addr_list[0]).group(2)

'somewhere.'

In [49]:
[email_prog.match(addr).group(1) for addr in addr_list if email_prog.match(addr)] # extract users from valid email addresses

['someone', 'prof', 'student', 'a_user']

# Dates and times

In [50]:
from datetime import date

In [51]:
now = date.today()

In [52]:
next_new_year_day = date(now.year+1, 1, 1)

In [53]:
print next_new_year_day

2016-01-01


In [54]:
print "There are %d days until the next New Year's Day." % (next_new_year_day - now).days

There are 104 days until the next New Year's Day.


In [55]:
type(next_new_year_day - now) # difference of two dates gives a timedelta object

datetime.timedelta

# Performance measurement

In [56]:
from timeit import Timer

In [57]:
map_and_lambda_timer = Timer('map(lambda x: x**2, mylist)','mylist = range(10)') # args are code to be timed, set up code

In [58]:
list_comprehension_timer = Timer('[x**2 for x in mylist]','mylist = range(10)')

In [59]:
map_and_lambda_timer.timeit()

1.9929089546203613

In [60]:
list_comprehension_timer.timeit()

1.224012851715088

By default, `timeit()` repeats the code a million times. But we can supply an argument.

In [61]:
map_and_lambda_timer.timeit(2*10**6) # time 2 million executions

3.8529388904571533

In [62]:
list_comprehension_timer.timeit(2*10**6) # list comprehension is a little bit faster

2.458868980407715

In [63]:
map_and_builtin_timer = Timer('map(abs, mylist)','mylist = range(-5,5)') # no lambda function, using built-in abs

In [64]:
list_comp_builtin_timer = Timer('[abs(x) for x in mylist]','mylist = range(10)')

In [65]:
map_and_builtin_timer.timeit()

0.9781341552734375

In [66]:
list_comp_builtin_timer.timeit()

1.4141829013824463

Now list comprehension is slower!