# Default arguments

In [8]:
def default_arg(x, exponent=2):
    """Function with a default argument.
    Note how the name of the argument explains what it does.
    Always prefer meaningful names."""
    return x**exponent

In [3]:
default_arg(2) # can be called without the second argument

4

In [4]:
default_arg(2, 3)

8

In [9]:
# you can type out the argument name making
# the call more readable
default_arg(2, exponent = 5)

32

## Returning more than one thing

In [1]:
def return_two():
    return 1, 2 # will automatically turn into a tuple

In [11]:
return_two()

(1, 2)

In [13]:
foo = 12
bar = 42

## Assigning more than one thing

In [14]:
foo, bar = bar, foo

In [15]:
foo, bar

(42, 12)

In [2]:
one, two, three = range(1, 4) # lists work as well

In [17]:
one, two, three

(1, 2, 3)

# Dictionaries

Dictionaries or `dict`s for short are one of the cornerstones of python. You create them with curly brackages `{` and `}`.

In [4]:
my_dict = {1000: 'a', 1024: 'b'}
my_dict

{1000: 'a', 1024: 'b'}

In [5]:
# Dicts' elements are retrieved like elements of a list.
my_dict[1000]

'a'

In [21]:
my_dict.keys() # retrieve only the keys

[1000, 1024]

In [6]:
my_dict.values() # retrieve the values

['a', 'b']

In [7]:
# keys and values are always in sync
my_dict.values()[my_dict.keys().index(1000)] == my_dict[1000]

True

In [8]:
# dict values can be nearly anything
my_dict[500] = []

In [24]:
my_dict

{500: [], 1000: 'a', 1024: 'b'}

Dict **keys** need to be hashable.

In [25]:
hash(12) # hashable

12

In [26]:
hash(2.3) # hashable

2523358617

In [27]:
hash('hello') # hashable

840651671246116861

In [28]:
hash((1, 'foo')) # hashable

6617818165100668875

In [29]:
hash([1, 'foo']) # _not_ hashable

TypeError: unhashable type: 'list'

In [30]:
hash({'foo': 12}) # _not_ hashable

TypeError: unhashable type: 'dict'

Dicts can emulate a sort of `switch` statement that Python lacks.

In [35]:
fns = {'sum': sum,
       'len': len}

In [36]:
fns['sum']([1,2,3])

6

In [38]:
fns['len']([1,2,3])

3

In [9]:
# remove and return the value corresponding
# to the key '500'
my_dict.pop(500)

[]

In [10]:
my_dict # doesn't contain the pair (500, []) anymore

{1000: 'a', 1024: 'b'}

## Avoiding errors

Element access with square brackages (i.e. `my_dict['foo']`) raises an exception when the key (`'foo'` in this example) is not found. The methods `get` and `setdefault` are here to help.

In [43]:
my_dict['foo']

KeyError: 'foo'

In [11]:
# return "Not Here!" if key not found.
my_dict.get('foo', "Not Here!")

'Not Here!'

In [46]:
# still the same
my_dict

{1000: 'a', 1024: 'b'}

In [14]:
# return 5*5 = 25 if key 5 not found.
# in addition, add the key and value to the dict
my_dict.setdefault(5, 5*5)

25

In [76]:
my_dict # now contains (5, 25)

{5: 25, 1000: 'a', 1024: 'b'}

In [15]:
# setdefault calls to a key that's already contained 
# in the dict will return the previously stored value
# and _not_ modify the dictionary
my_dict.setdefault(5, "nope")

25

In [16]:
my_dict

{5: 25, 1000: 'a', 1024: 'b'}

The `in` keyword can be used to check a dict for presence of a key.

In [17]:
5 in my_dict

True

# Dict comprehensions

Dict comprehensions work just like list comprehensions.

In [50]:
[i*2 for i in [1,2,3]]

[2, 4, 6]

In [51]:
{i: i**2 for i in [2,3,4]}

{2: 4, 3: 9, 4: 16}

# Special dictionaries

The `dict` subclasses `defaultdict` and `Counter` are a great way of keeping count of things, i.e. the number of occurences of words in a text.

In [20]:
from collections import defaultdict, Counter

The `defaultdict` class uses a function taking no arguments and returning a default value for dictionary **values**. The built-in `int` function (and type, this a bit strange in Python) is one such example.

In [53]:
int

int

In [54]:
int()

0

By the way, `float` and other types work very similarly.

In [55]:
float()

0.0

In [21]:
# if a key is not present, use the
# int function to create its value
count_dict = defaultdict(int)

In [22]:
# adds the key 'apple' with value int()
# and returns the value
count_dict['apple']

0

In [23]:
count_dict

defaultdict(int, {'apple': 0})

In [58]:
count_dict['orange'] += 1

In [60]:
count_dict

defaultdict(int, {'apple': 0, 'orange': 1})

## Counter

The `Counter` class is initialized with anything iterable (e.g. lists, tuples, etc.) and can give you quick answers to questions like 'what are the 10 most used words in this document'.

In [61]:
counter = Counter([1,2,2,2,3,3,5])

In [62]:
counter

Counter({1: 1, 2: 3, 3: 2, 5: 1})

In [63]:
counter.most_common(2)

[(2, 3), (3, 2)]

# Gotchas

We'll now see some common pitfalls.

In [64]:
def add_one(some_list):
    some_list.append(1)

In [65]:
my_list = []

In Python, function parameters that are not primitive types (like `int`, `float`, `str`) are ['passed by reference'][pbr], which means they can be modified inside function calls.

[pbr]: https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_reference

In [66]:
add_one(my_list)

In [67]:
my_list

[1]

Default values for function parameters are created **just once**. Be careful when modifying them.

In [69]:
def default_list(li = []):
    li.append(1)
    return li

In [70]:
default_list()

[1]

In [71]:
default_list()

[1, 1]

In [72]:
# if you need complex default values, create them like so:
def better_list(li = None):
    if li == None:
        li = []
    li.append(1)
    return li

In [73]:
better_list(), better_list(), better_list()

([1], [1], [1])

In [74]:
my_dict[[1,2]]

TypeError: unhashable type: 'list'

In [75]:
# Break question: can functions be
# dictionary keys? Yes, they can!
{sum: 5}

{<function sum>: 5}

# Argument lists

In [86]:
# functions can have an arbitrary number of arguments
def print_many(*args):
    for i in args:
        print i

In [87]:
print_many(1,4,'foo')

1
4
foo


To better understand the example below, note that you can call the sum function like this:

In [24]:
sum([1,2,3])

6

But not like this:

In [25]:
sum(1,2,3)

TypeError: sum expected at most 2 arguments, got 3

In [81]:
def my_sum(*args):
    # args is a tuple in here
    return sum(args)

In [82]:
my_sum(1)

1

Now, `my_sum` *can* be called like in the **second** example above.

In [83]:
my_sum(1,2,3,4,5,6)

21

In [84]:
range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Lists can be turned into many arguments with a star in the function call.

In [85]:
my_sum(*range(10))

45

## Why?

Argument lists are most often used to pass arguments on to *another* function.

In [26]:
def apply_to_many(fn, *args):
    return fn(args)

In [91]:
apply_to_many(sum, 1, 2, 3)

6

In [27]:
# in the same way, functions can have abritrary
# _named_ arguments, i.e. arguments in the form
#  f(foo = 1, bar = 12)
def many_named_args(**args):
    # args is a dictionary
    print args

In [94]:
many_named_args(arg1=42, arg2=9, name="James")

{'arg1': 42, 'arg2': 9, 'name': 'James'}


In [28]:
# just like argument lists, this technique is often used
# to pass arguments on to another function
def wrapped(fn, *args, **kwargs):
    if fn == sum or fn == len:
        return fn(*args, **kwargs)
    else:
        return None

In [96]:
wrapped(sum, [1,2,3])

6

In [97]:
wrapped(int, 3, key=5)

# Decorators

Argument lists and dictionaries can be used to make new functions of exisiting ones. Python provides a beautiful syntax for this called *decorators*.

In [29]:
def cached(fn):
    """Decorator to cache the result of function calls."""
    result_cache = {}
    def inner(*args):
        print result_cache # for educational purposes
        # HOMEWORK: replace the lines below with _one_ call
        #           to dict.setdefault
        if args in result_cache:
            return result_cache[args]
        else:
            result = fn(*args)
            result_cache[args] = result
            return result
    return inner

The function above can be used to make a new function of an exisiting one.

In [101]:
my_cached_sum = cached(my_sum)

In [102]:
my_cached_sum(1,2,3)

{}


6

In [103]:
my_cached_sum(1,2,3)

{(1, 2, 3): 6}


6

The two cells below are identical but the upper one has a nicer syntax. This is called decorators.

In [104]:
@cached
def my_cached_sum(*args):
    return sum(args)

In [105]:
def my_cached_sum(*args):
    return sum(args)

my_cached_sum = cached(my_cached_sum)

# More on reading .csv

The Python `csv` module makes reading data from `csv` files much easier.

In [106]:
import csv

In [107]:
data = []
with open('data/trends.csv') as trends:
    reader = csv.DictReader(trends)
    for i in reader:
        data.append(i)

In [108]:
data[:5]

[{'Week': '2011-12-04',
  'big data': '14',
  'data science': '9',
  'machine learning': '18'},
 {'Week': '2011-12-11',
  'big data': '13',
  'data science': '9',
  'machine learning': '17'},
 {'Week': '2011-12-18',
  'big data': '13',
  'data science': '6',
  'machine learning': '15'},
 {'Week': '2011-12-25',
  'big data': '11',
  'data science': '5',
  'machine learning': '12'},
 {'Week': '2012-01-01',
  'big data': '15',
  'data science': '8',
  'machine learning': '13'}]

# JSON

Javascript object notation or `JSON` is a standard format used all over the web to pass around data in a structured way. It looks very similar to Python dictionaries and lists.

In [31]:
import json

In [32]:
# make python objects out of a string
json.loads("""
{"foo": 12,
 "bar": [1,2,5]}""")

{u'bar': [1, 2, 5], u'foo': 12}

In [34]:
data = json.loads("""
{"foo": 12,
 "bar": [1,2,5]}""")

In [113]:
data, type(data)

({u'bar': [1, 2, 5], u'foo': 12}, dict)

In [35]:
# make a valid JSON string out of a python object
print json.dumps(data, indent=2)

{
  "foo": 12, 
  "bar": [
    1, 
    2, 
    5
  ]
}


# Objects

Objects are an important concept in modern programming languages. Objects are *instances* of classes. There can be many instances of any given class.

In [38]:
# empty class
class Person(object):
    pass

In [39]:
# create an object, i.e. and _instance_ of that class
kirk = Person()

In [119]:
kirk

<__main__.Person at 0x10b57a350>

In [40]:
# set properties
kirk.firstname = "James"

In [122]:
kirk.middlename = "Tiberius"

In [123]:
kirk.lastname = "Kirk"

In [124]:
# access properties
kirk.firstname

'James'

In [125]:
# create another instance
spock = Person()

In [126]:
# spock has no lastname property, but kirk has
spock.lastname

AttributeError: 'Person' object has no attribute 'lastname'

It is usually not desirable to have objects of the same class having different attributes.

In [41]:
# set a single attribute
class BetterPerson(object):
    is_better = True # attribute

In [128]:
guy = BetterPerson()

In [129]:
guy.is_better

True

In [130]:
guy.is_better = False # change attribute

In [131]:
guy.is_better

False

In [132]:
other_guy = BetterPerson()

In [133]:
# changing the attribute on one object won't
# affect other objects' attrbutes
other_guy.is_better

True

The method above will create all objects of the same class with the **same** *value* for the attribute. Usually, we want an attribute to have a different value for each object, like the name of a person. To this end, we use the special `__init__` function, also known as the *constructor*.

In [134]:
class EvenBetterPerson(object):
    def __init__(self, firstname, lastname):
        self.firstname = firstname
        self.lastname = lastname

In [135]:
kirk_v2 = EvenBetterPerson("James T.", "Kirk")

In [136]:
kirk_v2.firstname, kirk_v2.lastname

('James T.', 'Kirk')

In [137]:
EvenBetterPerson() # can't create without providing info

TypeError: __init__() takes exactly 3 arguments (1 given)

## Inheritance

If we want to add functionality to a class, we could just copy and paste.

In [142]:
class EvenBetterPersonWithPrint(object):
    def __init__(self, firstname, lastname):
        self.firstname = firstname
        self.lastname = lastname
    def print_me(self):
        #print "Person: {me.lastname}, {me.firstname}".format(me=self)
        print "Person: " + self.lastname + ", " + self.firstname

In [143]:
p = EvenBetterPersonWithPrint("Stephen", "Hawking")
p.print_me()

Person: Hawking, Stephen


In [144]:
p.firstname

'Stephen'

In [145]:
p.lastname

'Hawking'

In [146]:
p_prime = p

In [147]:
p_prime.lastname, p_prime.firstname

('Hawking', 'Stephen')

In [148]:
# classroom question: How do I delete things?
del p_prime

In [149]:
p_prime

NameError: name 'p_prime' is not defined

Adding functionality can also be achieved by letting a class *inherit* from a base class.

In [154]:
class PersonWithFullName(EvenBetterPersonWithPrint):
    # PersonWithFullName will have all the attributes and
    # methods of EvenBetterPersonWithPrint, plus everything
    # defined in here
    def get_full_name(self):
        return self.firstname + " " + self.lastname

In [155]:
# the constructor still works
p = PersonWithFullName("Donald", "Trump")

In [156]:
# so does the print_me method
p.print_me()

Person: Trump, Donald


In [157]:
# additional functionality
p.get_full_name()

'Donald Trump'

In [158]:
type(p)

__main__.PersonWithFullName

In [159]:
# classrom question (and answer):
# How do I know if I deal with a subclass?
isinstance(p, EvenBetterPersonWithPrint)

True

# Web scraping

Scraping (sometimes: *crawling*) is a great way of retrieving information from the internet in a strucutred way. But you can also do a lot of harm.

- Be nice.
- Follow the rules.
- Read the terms and conditions.
- Read the robots.txt.

## HTML

HTML is the language of websites.

In [44]:
html = open('example.html').read()

In [163]:
print html

<html>
    <head>
        <title>Test page</title>
    </head>
    <body>
        <h1 id="title">This is an example page</h1>
        <h2 id="list-header">This is a list of temperatures</h2>
        <ul>
            <li>2017-01-20: 5&deg;</li>
            <li>2017-01-21: 7&deg;</li>
            <li>2017-01-22: 4&deg;</li>
            <li>2017-01-23: 2&deg;</li>
        </ul>
        <h2>Links</h2>
        This is a <a href="example2.html">link</a> to a second page.
        <h2>Image</h2>
        This is a histogram (whose axes should be labeled!).
        <img src="histogram.png" width="50%" alt="Histogram of something.">
    </body>
</html>


Learn more about HTML in the [W3 HTML tutorial](http://www.w3schools.com/html/).

In [42]:
# To extract information from HTML pages easily,
# we will use BeautifulSoup
from bs4 import BeautifulSoup

In [45]:
soup = BeautifulSoup(html, 'lxml')

In [166]:
# body tag
soup.body

<body>\n<h1 id="title">This is an example page</h1>\n<h2 id="list-header">This is a list of temperatures</h2>\n<ul>\n<li>2017-01-20: 5\xb0</li>\n<li>2017-01-21: 7\xb0</li>\n<li>2017-01-22: 4\xb0</li>\n<li>2017-01-23: 2\xb0</li>\n</ul>\n<h2>Links</h2>\n        This is a <a href="example2.html">link</a> to a second page.\n        <h2>Image</h2>\n        This is a histogram (whose axes should be labeled!).\n        <img alt="Histogram of something." src="histogram.png" width="50%"/>\n</body>

In [167]:
# head tag
soup.head

<head>\n<title>Test page</title>\n</head>

In [168]:
# _first_ list item
soup.li

<li>2017-01-20: 5\xb0</li>

In [169]:
# find _all_ list itmes
soup('li')

[<li>2017-01-20: 5\xb0</li>,
 <li>2017-01-21: 7\xb0</li>,
 <li>2017-01-22: 4\xb0</li>,
 <li>2017-01-23: 2\xb0</li>]

In [46]:
# get the _first_ list
soup.ul

<ul>\n<li>2017-01-20: 5\xb0</li>\n<li>2017-01-21: 7\xb0</li>\n<li>2017-01-22: 4\xb0</li>\n<li>2017-01-23: 2\xb0</li>\n</ul>

In [171]:
# first li tag in the first ul tag
soup.ul.li

<li>2017-01-20: 5\xb0</li>

In [172]:
soup.li.text # get the text

u'2017-01-20: 5\xb0'

In [173]:
# get the 'src' attribute of the first image tag
soup.img['src']

'histogram.png'

In [174]:
# get the tag name
soup.img.name

'img'

In [175]:
# get the tag name of the image's parent tag
soup.img.parent.name

'body'

In [47]:
# make a list from the list's child tags
list(soup.ul.children)

[u'\n',
 <li>2017-01-20: 5\xb0</li>,
 u'\n',
 <li>2017-01-21: 7\xb0</li>,
 u'\n',
 <li>2017-01-22: 4\xb0</li>,
 u'\n',
 <li>2017-01-23: 2\xb0</li>,
 u'\n']

In [179]:
soup('h2') # find all second-level headlines

[<h2 id="list-header">This is a list of temperatures</h2>,
 <h2>Links</h2>,
 <h2>Image</h2>]

In [180]:
# find (potentially) all second-level headers
# with id 'list-header', though ids should be unique
soup('h2', {'id': 'list-header'})

[<h2 id="list-header">This is a list of temperatures</h2>]

In [182]:
# you'd use this to follow all links on a page ...
soup('a')

[<a href="example2.html">link</a>]

# Scrapy

You will find the scraping examples at https://github.com/dhesse/stk_inf_scraping.

In [183]:
# urlparse is a useful tool to parse URLs
from urlparse import urljoin

In [186]:
urljoin('http://localhost:8888/files/example.html', 'example2.html')

'http://localhost:8888/files/example2.html'