### Python Data Structures, Functions and Flow

In the Introduction to Python lecture we covered a wide range of Python, at certain places we skipped over the specific details of what was going on. Today we will reinforce the topics from the last lecture and continue to gain a deeper understanding of Python programming.

### Data Types

'Base' Python has a variety of 'atomic' data types, and as we saw, it treats them differently.

If you have ever used Python 2, you have probably learned the difference between a float and an int the hard way ( Python the hard way is a good, albeit controversial python resource https://learnpythonthehardway.org/):

In [1]:
#integer division
#in python 2, this is the behaviour of the standard /
2//3

0

We have three major types of numbers. 

1) Complex, which contains `i`, the square root of -1. We call it j in Python for various reasons. We probably won't touch complex numbers again:

In [2]:
complex(1, 2)

(1+2j)

2) Integers, which have abritrary precision in python 3:

In [3]:
12345678999935345235345345344534234353958712981249879235829352335987

12345678999935345235345345344534234353958712981249879235829352335987

3) Floats, which work the same as the floats in SQL:

In [4]:
123.456

123.456

Between these types, we can carry out basic math, which will mostly follow what we expect:

In [5]:
9 * 1.25

11.25

In [6]:
for i in [1,2,3.5,4.5]:
    print(i + 2)

3
4
5.5
6.5


We do have floating point issues in some cases:


In [7]:
-0.1 + 0.2 -0.3

-0.19999999999999998

### Booleans

Booleans, True and False, are represented, as we have seen. They are treated as numeric 1/0 values during math:

In [8]:
True + 5

6

In [9]:
False + 5

5

We can chain them using `and` `or` and `not`:

In [10]:
True and True
True or False
not True

False

We have the standard &, |, and ^ which work fine on booleans:

In [11]:
print(True & True)
print(True | False)
print(True^False)

True
True
True


However, these are bitwise operators, which do not work as expected on some things:

In [12]:
12 ^ 9 # exponentiation is **

5

We also have the value `None` which denotes a missing piece of data:

In [13]:
None

### Strings

We have seen some strings and string operations already:

In [14]:
mystr = 'abcdefghij'

print(mystr + mystr)
print(mystr*3)
print(mystr[0:2])
print(mystr/3)

abcdefghijabcdefghij
abcdefghijabcdefghijabcdefghij
ab


TypeError: unsupported operand type(s) for /: 'str' and 'int'

In addition to basic operations, there are a wide variety of string methods:

In [15]:
print(mystr.upper())
print(mystr.capitalize())
print(mystr.find('a'))
print(mystr.index('h'))
print(mystr[::-1])

ABCDEFGHIJ
Abcdefghij
0
7
jihgfedcba


Strings can contain special characters, denoted by `\n`. The most common we would use are `'\n'` for newline, and `'\t'` for a tab. We can use a \ to escape,and triple quotes to quote multiple lines (like we did in our docstrings).

In [16]:
x = 'abc\ndef'
y = 'abc\tdef'
print(x,y)
x = '''
my long 
string
with lines and stuff'''
print(x)

abc
def abc	def

my long 
string
with lines and stuff


There are a few other basic data types we will not touch, so for now just know that we can represent binary and hex numbers if we really need to.

### Containers 

We have seen lists, and tuples previously.

Lists are mutable ways to contain data:

In [17]:
mylist = [1,True,3,'4']
mylist.append(5)
print(mylist)
out = mylist.pop()
mylist[3] = 8
print(out)
print(mylist)
#see the tab completion for the plethora of methods

[1, True, 3, '4', 5]
5
[1, True, 3, 8]


We can iterate over lists and other iterables:

In [18]:
for i in mylist:
    print(i)

1
True
3
8


Tuples are the immutable version of a list. We can hold any combination of types of data, but we cannot modify it once it is made. This severly reduces the amount of methods available:

In [19]:
mytuple = (1,True,3,'4')
mytuple.count(3)

1

We also have sets and dicts, which serve different purposes.

While we can search for values in lists, we don't have a better alternative to searching each element individually:

In [20]:
x = list(range(10000))
y = set(range(10000))

In [21]:
%%timeit
for i in [10000, 99999, 10001, 5000, 100]:
    i in x

321 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Sets don't allow duplicates, or ordering (though watch this space!). But we hash the values, so that we can look up much faster than a list:

In [22]:
%%timeit
for i in [10000, 99999, 10001, 5000, 100]:
    i in y

230 ns ± 6.75 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Sets are like the mathematical notion of sets, we can take the union, overlap, or intersection. They are mutable, which means that we can add or remove objects from them. As they are unordered, subsetting doesn't work:

In [23]:
myset = {1,2,3,4,5,1,2}
print(myset)
print(type(myset))
myset[0]

{1, 2, 3, 4, 5}
<class 'set'>


TypeError: 'set' object does not support indexing

In [24]:
print(myset.intersection({1,2,6,7,8}))
print(myset.union({1,2,6,7,8}))
print(myset.add(10))
print(myset)

{1, 2}
{1, 2, 3, 4, 5, 6, 7, 8}
None
{1, 2, 3, 4, 5, 10}


As we hash the contents of sets, we cannot contain mutable objects! So, we can't contain lists, only tuples:

In [25]:
x = {(1,2,3),(4,5,6)}
y = {[1,2,3], [4,5,6]}

TypeError: unhashable type: 'list'

### Dictionaries

Python Dictionaries are effectively a mapping from a hashed key to another object. We can think of them as 'hash maps', lookup tables, or just a dict. 

As of python 3.6 they are ordered, and do not contain duplicates.

In [26]:
my_dict = {'a':[1,2,3,4], 'b':[5,6,7,8]}
print(my_dict['a'])

[1, 2, 3, 4]


In [27]:
my_dict.update({'c':[9,10,11,12]})
my_dict['d'] = [13,14,15,16]
print(my_dict)

{'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [9, 10, 11, 12], 'd': [13, 14, 15, 16]}


In [28]:
for i,j in my_dict.items():
    print(i)
    print(j)

a
[1, 2, 3, 4]
b
[5, 6, 7, 8]
c
[9, 10, 11, 12]
d
[13, 14, 15, 16]


### Basic Data Types:

| Type       | Example     | Mutable | Ordered | Duplicates | Subsetting |
|------------|-------------|---------|---------|------------|------------|
| int        | 2           | NA      | NA      | NA         | NA         |
| float      | 2.5         | NA      | NA      | NA         | NA         |
| complex    | 2.5 + 0.6J  | NA      | NA      | NA         | NA         |
| boolean    | True        | NA      | NA      | NA         | NA         |
| string     | 'abcd'      | No      | Yes     | Yes        | x[1]       |
| list       | [1,2,3,'a'] | Yes     | Yes     | Yes        | x[1]       |
| tuple      | (1,2,3,'a') | No      | Yes     | Yes        | x[1]       |
| set        | {1,2,3,4}   | Yes     | No      | No         | No         |
| dictionary | {1:1,2:2}   | Yes     | No      | No         | x['key']   |


### Exercises

* The order of operations in python is the standard BODMAS that you learned in math class.

    Without evaluating, what would you expect the follwing to give us?

    ```
    9 + 5 * 2
    9 + 5 ** 2
    (9 + 5) ** 2
    ```

* Print a string that looks like:
    ```
    \n \n \n " ' " '
    ```
    To the console, using escapes, triple quotes, or any other method.
    

* Using methods, turn the string 'mississippi' to the list `['MI', 'I', 'IPPI']`

* I have the dict:

    ```
    my_dict = {('a','b','c'):[1,2,3,{'b':[[1],[2],[3],[[4],['xyz']]]}]}
    ```

    Subset out the letter 'z' using dict, list and string subsetting.

* We can cast between various data types, using the `int`, `str`, `float` and `bool` functions.

    What do you expect from the following conversions? Does it match what you get?
    ```
    int(True)
    int(3.5)
    int('a string')
    str(8)
    str(None)
    str(True)
    bool(10)
    bool(-1)
    bool(1.5)
    ```

* How does the `del` command work? Can you delete the third item from this list? Does it happen in place?

    ```
    mylist = [1,2,[],[34]]
    ```



### Statements 

We covered for loops:

In [29]:
for i in [1,2,3]:
    print(i)


1
2
3


In this loop, `i` is the loop variable. We could have set it to any other name, as long as we properly referenced it  inside the loop. Outside of a loop, the variable `i` remains, and overwrites any i that we were using locally: 

In [30]:
i = 5
for i in [1,2]:
    print(i)
    
print(i)

1
2
2


For this reason, it is typical to use something like i,j,k,x etc, and to use descriptive variable names. 

We can also use a `comprehension` which is a syntactic sugar, and a very 'pythonic' way of writing simple loops:

In [31]:
[x * 5 for x in range(5)]

[0, 5, 10, 15, 20]

In a list comprehension we surround the outside of the statement in `[`, and give the iterable at the end.

We carry out the statement for each item in the iterable, setting is as the variable we gave, and get a list back out.

In [32]:
[str(i + j) for i,j in {"a":'b', 'c':'d'}.items()]

['ab', 'cd']

Dictionary comprehensions create a dictionary:

In [33]:
{i: j for i,j in ((1,2),(3,4))}

{1: 2, 3: 4}

Ternary expressions are a shortcut for if/else:

In [34]:
a = 6
b = 7

a if a > b else b

7

And we can use ternary expressions in list comprehensions to filter or alter our iterables: 

In [35]:
print([x for x in range(10) if x%2 == 0])
print([x if x%2 == 0 else 5 for x in range(10)])

[0, 2, 4, 6, 8]
[0, 5, 2, 5, 4, 5, 6, 5, 8, 5]


We don't overwrite or alter any variables outside list comprehensions. Other than this they behave exactly as if statements, and should be seen as a `syntactic sugar`. Don't worry if you don't feel comfortable writing them for now - it takes practice.

### Exercise

Rewrite the two list comprehensions above to be for loops with if/else clauses. Which do you find easier?

### Conditionals

We have seen if/else conditionals several times, there is another similar statement called while:

In [36]:
x = 0
while x < 5:
    print(x)
    x+=1

0
1
2
3
4


While works in a similar way as if and else. It evaluates the statement to its right and if it is true, carries out it's clause, and loops until it is no longer True.

Loops are a little dangerous, it is very easy to send your program into an infinite loop with a poorly crafted conditional.

We can modify the control flow of our loops and clauses using the statements `break`, `pass` and `continue`:

In [37]:
mylist = [1,'num',3,4]

print('break')
for i in mylist:
    if type(i) == int:
        print(i)
    else:
        break
        print('hi')

print('pass')
for i in mylist:
    if type(i) == int:
        print(i)
    else:
        pass
        print('hi')

print('continue')
for i in mylist:
    if type(i) == int:
        print(i)
    else:
        continue
        print('hi')

break
1
pass
1
hi
3
4
continue
1
3
4


`break` stops the for loop, `pass` does nothing, and `continue` terminates the current iteration of the loop

### Functions

Python functions to allow us to make reusable commands to carry out our analysis. We can define our own functions and use them in the same way as built in or imported functions.

Let's look again at a simple function:

In [38]:
def my_adder(x,y):
    '''
    adds two numbers together and returns the output
    '''
    return x + y

print(my_adder(1,2))
print(my_adder('a','b'))

3
ab


That's probably not what we meant to happen! Recall that python looks at how an operation is implemented for a particular type of object, and applies it.

We might want our function to be a little more restrictive. There are a variety of ways we can do this:

In [39]:
### type hinting:
def my_adder1(x:int,y:int) -> int:
    '''
    adds two numbers together and returns the output
    '''
    return x + y

print(my_adder1('a','b'))

###assertions
def my_adder2(x,y):
    '''
    adds two numbers together and returns the output
    '''
    assert isinstance(x, (float, int)), 'x must be numeric'
    assert isinstance(y, (float, int)), 'y must be numeric'

    return x + y

my_adder2('a','b')

ab


AssertionError: x must be numeric

Type hinting does not force the inputs to be of the correct type - it just 'hints' to anyone reading the code that the inputs and output are of a certian type. They are relatively new, so are rarely seen.

Assertion is best practice - we can tell python to throw an error if we get an unexpected input, which is better than returning us unexpected output, like 'ab'.

This kind of programming is often referred to as 'defensive programming': In a lot of cases it is easier to prevent a misuse of a function than catch the bug later.

We will cover some more defensive programming techniques as we continue through the course.

Recall that our docstring is printed when we use the help function - how can we make it more useful?

One great way is to follow the style [used in numpy and pandas](https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/tools/numeric.py#L15-L177). In this example, the function is 162 lines of code, of which 79 lines are the docstring, and several more are inline comments. The docstring includes parameters, input types and descriptions, output descriptions, examples and changes made from previous versions.

This helps a lot when reading back through code some time later, (though it is a little bit of overkill for our adder):

In [42]:
def my_adder(x,y):
    '''
    Adds two numerics and returns the output
    
    Parameters
    ----------
    x: A numeric input
    y: A numeric input
    
    Returns
    -------
    ret: A numeric output, the sum of x and y
    
    Examples
    --------
    >>>> x, y = 4, 5
    >>>> my_adder(4, 5)
    9
    '''
    assert isinstance(x, (float, int)), 'x must be numeric'
    assert isinstance(y, (float, int)), 'y must be numeric'
    #inline comments can help if we have a tricky piece of logic
    #and are denoted by a # in front of the line
    ret = x + y
    #maybe we could add an assertion here?
    return ret

In [43]:
help(my_adder)

Help on function my_adder in module __main__:

my_adder(x, y)
    Adds two numerics and returns the output
    
    Parameters
    ----------
    x: A numeric input
    y: A numeric input
    
    Returns
    -------
    ret: A numeric output, the sum of x and y
    
    Examples
    --------
    >>>> x, y = 4, 5
    >>>> my_adder(4, 5)
    9



This seems like a lot to add, but when we are writing a function, we will almost always have an idea of what the inputs and outputs are, and run several small examples to test it. Putting it in at function definition will almost always help.

However, most functions are probably not documented like this...

### Function Arguments

An extremely useful python feature is the ability to use named and unnamed arguments, and to include defaults.

Here, the value y is given a default of 10, and we have added the `*restofarguments` to allow us to take any number of additional values:

In [44]:
def my_adder(x, y = 10, *restofarguments):
    for i in x,y:
        assert isinstance(i, (float, int)), f'{i} must be numeric'
    print('Rest of arguments: ', list(restofarguments))
    ret = x + y
    return ret

In [45]:
print(my_adder(11))
print(my_adder(1, 2, 3,4,5,6))

Rest of arguments:  []
21
Rest of arguments:  [3, 4, 5, 6]
3


Matching is based on name, then by position:

In [73]:
def my_func(x = 0, y = 1, z = 2):
    print(x)
    print(y)
    print(z)
    
my_func(z = 'z',y = 'y',x = 'x')
my_func(z = 'x')

x
y
z
0
1
x


We can take in arbitrary positional and keyword arguments using the \* and \*\* to unpack them. Notice we receive them as a tuple and a dict:

In [51]:
def myfunc(*args, **kwargs):
    print('positional args: ', args)
    print('keyword args: ', kwargs)

myfunc(1,2,3,4, x = 'a', z = 'b')

positional args:  (1, 2, 3, 4)
keyword args:  {'x': 'a', 'z': 'b'}


### Exercises

1. Create a function to convert miles to kilometers.
2. Add documentation to the function; the parameters, return, and examples.
3. Add assertions about the input(s), to make sure they are numeric.
4. Modify your function to work to convert miles to km as default, but km to mile if given `units = 'km'` as an argument.
5. Update the documentation.
6. Import numpy, and pass in -np.inf and np.inf as the value to convert. What did we expect to happen?
7. Pass in np.nan. What would we expect here?

In [28]:

#Write a function to convert miles to km
def miles_to_km(miles, units = 'km'):
    import numpy as np
    '''
    Input parameters:
    miles -> numeric value 
    units -> units of numeric output. Default is 'km'
    '''
    assert isinstance(miles, (float,int)), 'miles must be numeric'
    assert units.lower()=='miles' or units.lower=='km', 'Unit must be in Km or miles'
    assert miles >= 0, 'miles must be greater than 0'
    
    
   # if units == "km":
   #     value = miles*1.60934
   # elif units == "miles":
   #     value = miles*0.621371
   # 
   # return value
    
    if units == "km":
        return f'{miles*1.60934} {units}'
    elif units == "miles":
        return miles*0.621371, units
    else:
        return "Units not recognized. Only 'km' and 'miles' are accepted"

#miles_to_km(100, units = 'miles')
#miles_to_km(100, units = 'km')
miles_to_km(100)
#miles_to_km(100, units = 'somethin')


AssertionError: Unit must be in Km or miles

In [27]:
miles_to_km(100, units = 'miles')


(62.137100000000004, 'miles')

In [29]:
miles_to_km(np.inf, units='miles')


(inf, 'miles')

AssertionError: Unit must be in Km or miles