In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Introduction to Python

![python](http://imgs.xkcd.com/comics/python.png)

(via [xkcd](http://imgs.xkcd.com/comics/python.png))

## What is Python?

Python is a modern, open source, object-oriented programming language, created by a Dutch programmer, Guido van Rossum. Officially, it is an interpreted scripting language (meaning that it is not compiled until it is run) for the C programming language; in fact, Python itself is coded in C. Frequently, it is compared to languages like Perl and Ruby. It offers the power and flexibility of lower level (*i.e.* compiled) languages, without the steep learning curve, and without most of the associated debugging pitfalls. The language is very clean and readable, and it is available for almost every modern computing platform.

## Why use Python for scientific programming?

Python offers a number of advantages to scientists, both for experienced and novice programmers alike:

***Powerful and easy to use***  
Python is simultaneously powerful, flexible and easy to learn and use (in general, these qualities are traded off for a given programming language). Anything that can be coded in C, FORTRAN, or Java can be done in Python, almost always in fewer lines of code, and with fewer debugging headaches. Its standard library is extremely rich, including modules for string manipulation, regular expressions, file compression, mathematics, profiling and debugging (to name only a few). Unnecessary language constructs, such as `END` statements and brackets are absent, making the code terse, efficient, and easy to read. Finally, Python is object-oriented, which is an important programming paradigm particularly well-suited to scientific programming, which allows data structures to be abstracted in a natural way.

***Interactive***  
Python may be run interactively on the command line, in much the same way as Octave or S-Plus/R. Rather than compiling and running a particular program, commands may entered serially followed by the `Return` key. This is often useful for mathematical programming and debugging.

***Extensible***  
Python is often referred to as a “glue” language, meaning that it is a useful in a mixed-language environment. Frequently, programmers must interact with colleagues that operate in other programming languages, or use significant quantities of legacy code that would be problematic or expensive to re-code. Python was designed to interact with other programming languages, and in many cases C or FORTRAN code can be compiled directly into Python programs (using utilities such as `f2py` or `weave`). Additionally, since Python is an interpreted language, it can sometimes be slow relative to its compiled cousins. In many cases this performance deficit is due to a short loop of code that runs thousands or millions of times. Such bottlenecks may be removed by coding a function in FORTRAN, C or Cython, and compiling it into a Python module.

***Third-party modules***  
There is a vast body of Python modules created outside the auspices of the Python Software Foundation. These include utilities for database connectivity, mathematics, statistics, and charting/plotting. Some notables include:

* ***NumPy***: Numerical Python (NumPy) is a set of extensions that provides the ability to specify and manipulate array data structures. It provides array manipulation and computational capabilities similar to those found in Matlab or Octave. 
* ***SciPy***: An open source library of scientific tools for Python, SciPy supplements the NumPy module. SciPy gathering a variety of high level science and engineering modules together as a single package. SciPy includes modules for graphics and plotting, optimization, integration, special functions, signal and image processing, genetic algorithms, ODE solvers, and others.
* ***Matplotlib***: Matplotlib is a python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Its syntax is very similar to Matlab. 
* ***Pandas***: A module that provides high-performance, easy-to-use data structures and data analysis tools. In particular, the `DataFrame` class is useful for spreadsheet-like representation and mannipulation of data. Also includes high-level plotting functionality.
* ***IPython***: An enhanced Python shell, designed to increase the efficiency and usability of coding, testing and debugging Python. It includes both a Qt-based console and an interactive HTML notebook interface, both of which feature multiline editing, interactive plotting and syntax highlighting.

***Free and open***  
Python is released on all platforms under the GNU public license, meaning that the language and its source is freely distributable. Not only does this keep costs down for scientists and universities operating  under a limited budget, but it also frees programmers from licensing concerns for any software they may develop. There is little reason to buy expensive licenses for software such as Matlab or Maple, when Python can provide the same functionality for free!


### Sample code: mean and standard deviation

Here is a quick example of a Python program. We will call it `stats.py`, because Python programs typically end with the `.py` suffix. This code consists of some fake data, and two functions mean and var which calculate mean and variance, respectively. Python can be internally documented by adding lines beginning with the `#` symbol, or with simple strings enclosed in quotation marks. Here is the code:

In [3]:
# Import modules you might use
import numpy as np

# Some data, in a list
my_data = [12, 5, 17, 8, 9, 11, 21]

# Function for calulating the mean of some data
def mean(data):

    # Initialize sum to zero
    sum_x = 0.0

    # Loop over data
    for x in data:

        # Add to sum
        sum_x += x 
    
    # Divide by number of elements in list, and return
    return sum_x / len(data)

Notice that, rather than using parentheses or brackets to enclose units of code (such as loops or conditional statements), python simply uses indentation. This relieves the programmer from worrying about a stray bracket causing her program to crash. Also, it forces programmers to code in neat blocks, making programs easier to read. So, for the following snippet of code:

In [4]:
sum_x = 0

# Loop over data
for x in my_data:
    
    # Add to sum
    sum_x += x 

print(sum_x)

83


The first line initializes a variable to hold the sum, and the second initiates a loop, where each element in the data list is given the name `x`, and is used in the code that is indented below. The first line of subsequent code that is not indented signifies the end of the loop. It takes some getting used to, but works rather well.

Now lets call the functions:

In [5]:
mean(my_data)

11.857142857142858

Our specification of mean and var are by no means the most efficient implementations. Python provides some syntax and built-in functions to make things easier, and sometimes faster:

In [6]:
# Function for calulating the mean of some data
def mean(data):

    # Call sum, then divide by the numner of elements
    return sum(data)/len(data)

# Function for calculating variance of data
def var(data):

    # Get mean of data from function above
    x_bar = mean(data)

    # Do sum of squares in one line
    sum_squares = sum([(x - x_bar)**2 for x in data])

    # Divide by n-1 and return
    return sum_squares/(len(data)-1)

In the new implementation of `mean`, we use the built-in function `sum` to reduce the function to a single line. Similarly, `var` employs a **list comprehension** syntax to make a more compact and efficient loop.

An alternative looping construct involves the map function. Suppose that we had a number of datasets, for each which we want to calculate the mean:

In [7]:
x = (45, 95, 100, 47, 92, 43)
y = (65, 73, 10, 82, 6, 23)
z = (56, 33, 110, 56, 86, 88) 
datasets = (x,y,z)

In [8]:
datasets

((45, 95, 100, 47, 92, 43), (65, 73, 10, 82, 6, 23), (56, 33, 110, 56, 86, 88))

This can be done using a classical loop:

In [9]:
means = []
for d in datasets:
    means.append(mean(d))

In [10]:
means

[70.33333333333333, 43.166666666666664, 71.5]

Or, more succinctly using `map`:

In [11]:
list(map(mean, datasets))

[70.33333333333333, 43.166666666666664, 71.5]

Similarly we did not have to code these functions to get means and variances; the numpy package that we imported at the beginning of the module has similar methods:

In [12]:
np.mean(datasets, axis=1)

array([ 70.33333333,  43.16666667,  71.5       ])

## Data Types and Data Structures

In the introduction above, you have already seen some of the important Python data structures, including integers, floating-point numbers, lists and tuples. It is worthwhile, however, to quickly introduce all of the built-in data structures relevant to everyday Python programming.

### Literals
The simplest data structure are literals, which appear directly in programs, and include most simple strings and numbers:


In [13]:
42              # Integer
0.002243        # Floating-point
5.0J            # Imaginary
'foo'
"bar"           # Several string types
s = """Multi-line
string"""

There are a handful of constants that exist in the built-in-namespace. Importantly, there are boolean values `True` and `False`

In [14]:
type(True)

bool

Either of these can be negated using `not`.

In [15]:
not False

True

In addition, there is a `None` type that represents the absence of a value.

In [16]:
x = None
print(x)

None


All the arithmetic operators are available in Python:

In [17]:
15/4

3.75

> **Compatibility Corner**: Note that when using Python 2, you would get a different answer! Dividing an integer by an integer will yield another integer. Though this is "correct", it is not intuitive, and hence was changed in Python 3.

Operator precendence can be enforced using parentheses:

In [18]:
(14 - 5) * 4

36

There are several Python data structures that are used to encapsulate several elements in a set or sequence.

### Tuples

The first sequence data structure is the tuple, which simply an immutable, ordered sequence of elements. These elements may be of arbitrary and mixed types. The tuple is specified by a comma-separated sequence of items, enclosed by parentheses:


In [19]:
(34,90,56) # Tuple with three elements

(34, 90, 56)

In [20]:
(15,) # Tuple with one element

(15,)

In [21]:
(12, 'foobar') # Mixed tuple

(12, 'foobar')

Individual elements in a tuple can be accessed by *indexing*. This amounts to specifying the appropriate element index enclosed in square brackets following the tuple name:

In [22]:
foo = (5,7,2,8,2,-1,0,4)
foo[0]

5

Notice that the index is ***zero-based***, meaning that the first index is zero, rather than one (in contrast to R). So above, 5 retrieves the sixth item, not the fifth.

Two or more sequential elements can be indexed by *slicing*:

In [23]:
foo[2:5]

(2, 8, 2)

This retrieves the third, fourth and fifth (but not the sixth!) elements -- *i.e.*, up to, but not including, the final index. One may also slice or index starting from the end of a sequence, by using negative indices:


In [24]:
foo[:-2]

(5, 7, 2, 8, 2, -1)

As you can see, this returns all elements except the final two. 

You can add an optional third element to the slice, which specifies a **step** value. For example, the following returns every other element of `foo`, starting with the second element of the tuple.

In [25]:
foo[1::2]

(7, 8, -1, 4)

The elements of a tuple, as defined above, are **immutable**. Therefore, Python takes offense if you try to change them:

In [26]:
a = (1,2,3)
a[0] = 6

TypeError: 'tuple' object does not support item assignment

The `TypeError` is called an ***exception***, which in this case indicates that you have tried to perform an action on a type that does not support it. We will learn about handling exceptions further along.

Finally, the `tuple()` function can create a tuple from any sequence:

In [27]:
tuple('foobar')

('f', 'o', 'o', 'b', 'a', 'r')

Why does this happen? Because in Python, strings are considered a sequence of characters.

### Lists

Lists complement tuples in that they are a ***mutable***, ordered sequence of elements. To distinguish them from tuples, they are enclosed by square brackets:

In [28]:
# List with five elements
[90, 43.7, 56, 1, -4]

[90, 43.7, 56, 1, -4]

Elements of a list can be arbitrarily substituted by assigning new values to the associated index:

In [30]:
bar = [5,8,4,2,7,9,4,1]
bar[3] = -5
bar

[5, 8, 4, -5, 7, 9, 4, 1]

Operations on lists are somewhat unusual. For example, multiplying a list by an integer does not multiply each element by that integer, as you might expect, but rather:

In [31]:
bar * 3

[5, 8, 4, -5, 7, 9, 4, 1, 5, 8, 4, -5, 7, 9, 4, 1, 5, 8, 4, -5, 7, 9, 4, 1]

Which is simply three copies of the list, concatenated together. This is useful for generating lists with identical elements:

In [32]:
[0]*10

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

(incidentally, this works with tuples as well)

In [33]:
(3,)*10

(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)

Since lists are mutable, they retain several ***methods***, some of which mutate the list. For example:

In [34]:
bar.extend(foo) # Adds foo to the end of bar (in-place)
bar

[5, 8, 4, -5, 7, 9, 4, 1, 5, 7, 2, 8, 2, -1, 0, 4]

In [35]:
bar.append(5) # Appends 5 to the end of bar
bar

[5, 8, 4, -5, 7, 9, 4, 1, 5, 7, 2, 8, 2, -1, 0, 4, 5]

In [36]:
bar.insert(0, 4) # Inserts 4 at index 0
bar

[4, 5, 8, 4, -5, 7, 9, 4, 1, 5, 7, 2, 8, 2, -1, 0, 4, 5]

In [37]:
bar.remove(7) # Removes the first occurrence of 7
bar

[4, 5, 8, 4, -5, 9, 4, 1, 5, 7, 2, 8, 2, -1, 0, 4, 5]

In [38]:
bar.remove(100) # Oops! Doesn’t exist

ValueError: list.remove(x): x not in list

In [39]:
bar.reverse() # Reverses bar in place
bar

[5, 4, 0, -1, 2, 8, 2, 7, 5, 1, 4, 9, -5, 4, 8, 5, 4]

In [40]:
bar.sort() # Sorts bar in place
bar

[-5, -1, 0, 1, 2, 2, 4, 4, 4, 4, 5, 5, 5, 7, 8, 8, 9]

### Dictionaries

One of the more flexible built-in data structures is the dictionary. A dictionary maps a collection of values to a set of associated keys. These mappings are mutable, and unlike lists or tuples, are **unordered**. Hence, rather than using the sequence index to return elements of the collection, the corresponding key must be used. Dictionaries are specified by a comma-separated sequence of keys and values, which are separated in turn by colons. The dictionary is enclosed by curly braces. 

For example:

In [43]:
my_dict = {'a':16, 
           'b':(4,5), 
           'foo':'''(noun) a term used as a universal substitute 
           for something real, especially when discussing technological ideas and 
           problems'''}
my_dict

{'a': 16,
 'b': (4, 5),
 'foo': '(noun) a term used as a universal substitute \n           for something real, especially when discussing technological ideas and \n           problems'}

In [44]:
my_dict['b']

(4, 5)

Notice that `a` indexes an integer, `b` a tuple, and `foo` a string (now you know what foo means). Hence, a dictionary is a sort of associative array. Some languages refer to such a structure as a hash or key-value store.
	
As with lists, being mutable, dictionaries have a variety of methods and functions that take dictionary arguments. For example, some dictionary functions include:

In [45]:
len(my_dict)

3

In [46]:
# Checks to see if ‘a’ is in my_dict
'a' in my_dict

True

Some useful dictionary methods are:

In [47]:
# Returns key/value pairs as list
my_dict.items() 

dict_items([('a', 16), ('foo', '(noun) a term used as a universal substitute \n           for something real, especially when discussing technological ideas and \n           problems'), ('b', (4, 5))])

In [48]:
# Returns list of keys
my_dict.keys() 

dict_keys(['a', 'foo', 'b'])

In [49]:
# Returns list of values
my_dict.values() 

dict_values([16, '(noun) a term used as a universal substitute \n           for something real, especially when discussing technological ideas and \n           problems', (4, 5)])

When we try to index a value that does not exist, it raises a `KeyError`.

In [50]:
my_dict['c']

KeyError: 'c'

If we would rather not get the error, we can use the `get` method, which returns `None` if the value is not present.

In [51]:
my_dict.get('c')

Custom return values can be specified with a second argument.

In [52]:
my_dict.get('c', -1)

-1

# Programming with Python

## Control Flow Statements

The control flow of  a program determines the order in which lines of code are executed. All else being equal, Python code is executed linearly, in the order that lines appear in the program. However, all is not usually equal, and so the appropriate control flow is frequently specified with the help of control flow statements. These include loops, conditional statements and calls to functions. Let’s look at a few of these here.

### for statements
One way to repeatedly execute a block of statements (*i.e.* loop) is to use a `for` statement. These statements iterate over the number of elements in a specified sequence, according to the following syntax:

In [53]:
for letter in 'ciao':
    print('give me a {0}'.format(letter.upper()))

give me a C
give me a I
give me a A
give me a O


Recall that strings are simply regarded as sequences of characters. Hence, the above `for` statement loops over each letter, converting each to upper case with the `upper()` method and printing it. 

Similarly, as shown in the introduction, **list comprehensions** may be constructed using `for` statements:

In [54]:
[i**2 for i in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Here, the expression loops over `range(10)` -- the sequence from 0 to 9 -- and squares each before placing it in the returned list.

### if statements
As the name implies, `if` statements execute particular sections of code depending on some tested **condition**. For example, to code an absolute value function, one might employ conditional statements:

In [55]:
def absval(some_list):

    # Create empty list
    absolutes = []    

    # Loop over elements in some_list
    for value in some_list:

        # Conditional statement
        if value<0:
            # Negative value
            absolutes.append(-value)

        else:
            # Positive value
            absolutes.append(value)
    
    return absolutes 

Here, each value in `some_list` is tested for the condition that it is negative, in which case it is multiplied by -1, otherwise it is appended as-is. 

For conditions that have more than two possible values, the `elif` clause can be used:

In [56]:
x = 5
if x < 0:
    print('x is negative')
elif x % 2:
    print('x is positive and odd')
else:
    print('x is even and non-negative')

x is positive and odd


### while statements

A different type of conditional loop is provided by the `while` statement. Rather than iterating a specified number of times, according to a given sequence, `while` executes its block of code repeatedly, until its condition is no longer true. 

For example, suppose we want to sample from a truncated normal distribution, where we are only interested in positive-valued samples. The following function is one solution:

In [57]:
# Import function
from numpy.random import normal

def truncated_normals(how_many, l):

    # Create empty list
    values = []

    # Loop until we have specified number of samples
    while (len(values) < how_many):

        # Sample from standard normal
        x = normal(0,1)

        # Append if not truncateed
        if x > l: values.append(x)

    return values    

In [58]:
truncated_normals(15, 0)

[0.9943567055289931,
 0.4888419632979138,
 0.278003681708471,
 1.2687963432420057,
 0.6698480838580629,
 0.6014385976019375,
 0.08837710154582967,
 1.834903917288043,
 0.7327293710377771,
 0.8254672290602311,
 0.29937223268981206,
 1.9885672421872016,
 0.765976444198716,
 0.9801065620267986,
 0.7443657742503714]

This function iteratively samples from a standard normal distribution, and appends it to the output array if it is positive, stopping to return the array once the specified number of values have been added.
	
Obviously, the body of the `while` statement should contain code that eventually renders the condition false, otherwise the loop will never end! An exception to this is if the body of the statement contains a `break` or `return` statement; in either case, the loop will be interrupted.


## Error Handling

Inevitably, some code you write will generate errors, at least in some situations. Unless we explicitly anticipate and **handle** these errors, they will cause your code to halt (sometimes this is a good thing!). Errors are handled using `try/except` blocks.

If code executed in the `try` block generates an error, code execution moves to the `except` block. If the exception that is specified corresponsd to that which has been raised, the code in the `except` block is executed before continuing; otherwise, the exception is carried out and the code is halted.

In [59]:
absval(-5)

TypeError: 'int' object is not iterable

In the call to `absval`, we passed a single negative integer, whereas the function expects some sort of iterable data structure. Other than changing the function itself, we can avoid this error using exception handling.

In [60]:
x = -5
try:
    print(absval(x))
except TypeError:
    print('The argument to absval must be iterable!')

The argument to absval must be iterable!


In [61]:
x = -5
try:
    print(absval(x))
except TypeError:
    print(absval([x]))

[5]


We can raise exceptions manually by using the `raise` expression.

In [62]:
raise ValueError('This is the wrong value')

ValueError: This is the wrong value

## Importing and Manipulating Data

Python includes operations for importing and exporting data from files and binary objects, and third-party packages exist for database connectivity. The easiest way to import data from a file is to parse **delimited** text file, which can usually be exported from spreadsheets and databases. In fact, file is a built-in type in python. Data may be read from and written to regular files by specifying them as file objects:

In [63]:
microbiome = open('data/microbiome.csv')

Here, a file containing microbiome data in a comma-delimited format is opened, and assigned to an object, called `microbiome`. The next step is to transfer the information in the file to a usable data structure in Python. Since this dataset contains four variables, the name of the taxon, the patient identifier (de-identified), the bacteria count in tissue and the bacteria count in stool, it is convenient to use a dictionary. This allows each variable to be specified by name. 


First, a dictionary object is initialized, with appropriate keys and corresponding lists, initially empty. Since the file has a header, we can use it to generate an empty dict:

In [64]:
column_names = next(microbiome).rstrip('\n').split(',')
column_names

['Taxon', 'Patient', 'Group', 'Tissue', 'Stool']

> **Compatibility Corner**: In Python 2, `open` would not return a generator, but rather a `file` object with a `next` method. In Python 3, an generator is returned, which requires the use of the built-in function `next`.

In [65]:
mb_dict = {name:[] for name in column_names}

In [66]:
mb_dict

{'Group': [], 'Patient': [], 'Stool': [], 'Taxon': [], 'Tissue': []}

In [67]:
for line in microbiome:
    taxon, patient, group, tissue, stool = line.rstrip('\n').split(',')
    mb_dict['Taxon'].append(taxon)
    mb_dict['Patient'].append(int(patient))
    mb_dict['Group'].append(int(group))
    mb_dict['Tissue'].append(int(tissue))
    mb_dict['Stool'].append(int(stool))

For each line in the file, data elements are split by the comma delimiter, using the `split` method that is built-in to string objects. Each datum is subsequently appended to the appropriate list stored in the dictionary. After all the data is parsed, it is polite to close the file:

In [68]:
microbiome.close()

The data can now be readily accessed by indexing the appropriate variable by name:

In [69]:
mb_dict['Tissue'][:10]

[136, 1174, 408, 831, 693, 718, 173, 228, 162, 372]

A second approach to importing data involves interfacing directly with a relational database management system. Relational databases are far more efficient for storing, maintaining and querying data than plain text files or spreadsheets, particularly for large datasets or multiple tables. A number of third parties have created packages for database access in Python. For example, `sqlite3` is a package that provides connectivity for SQLite databases:

In [70]:
import sqlite3
db = sqlite3.connect(database='data/baseball-archive-2011.sqlite')

# create a cursor object to communicate with database
cur = db.cursor()   

In [71]:
# run query
cur.execute('SELECT playerid, HR, SB FROM Batting WHERE yearID=1970')

# fetch data, and assign to variable
baseball = cur.fetchall() 
baseball[:10]

[('aaronha01', 38, 9),
 ('aaronto01', 2, 0),
 ('abernte02', 0, 0),
 ('abernte02', 0, 0),
 ('abernte02', 0, 0),
 ('acosted01', 0, 0),
 ('adairje01', 0, 0),
 ('ageeto01', 24, 31),
 ('aguirha01', 0, 0),
 ('akerja01', 0, 0)]

## Functions

Python uses the `def` statement to encapsulate code into a callable function. Here again is a very simple Python function:

In [72]:
# Function for calulating the mean of some data
def mean(data):

    # Initialize sum to zero
    sum_x = 0.0

    # Loop over data
    for x in data:

        # Add to sum
        sum_x += x 
    
    # Divide by number of elements in list, and return
    return sum_x / len(data)

As we can see, arguments are specified in parentheses following the function name. If there are sensible "default" values, they can be specified as a **keyword argument**.

In [73]:
def var(data, sample=True):

    # Get mean of data from function above
    x_bar = mean(data)

    # Do sum of squares in one line
    sum_squares = sum([(x - x_bar)**2 for x in data])

    # Divide by n-1 and return
    if sample:
        return sum_squares/(len(data)-1)
    return sum_squares/len(data)

Non-keyword arguments must always predede keyword arguments, and must always be presented in order; order is not important for keyword arguments.

Arguments can also be passed to functions as a `tuple`/`list`/`dict` using the asterisk notation.

In [74]:
def some_computation(a=-1, b=4.3, c=7):
    return (a + b) / float(c)

args = (5, 4, 3)
some_computation(*args)

3.0

In [75]:
kwargs = {'b':4, 'a':5, 'c':3}
some_computation(**kwargs)

3.0

The `lambda` statement creates anonymous one-line functions that can simply be assigned to a name.

In [76]:
import numpy as np
normalize = lambda data: (np.array(data) - np.mean(data)) / np.std(data)

or not:

In [77]:
(lambda data: (np.array(data) - np.mean(data)) / np.std(data))([5,8,3,8,3,1,2,1])

array([ 0.42192651,  1.54706386, -0.32816506,  1.54706386, -0.32816506,
       -1.07825663, -0.70321085, -1.07825663])

## Example: Least Squares Estimation

Lets try coding a statistical function. Suppose we want to estimate the parameters of a simple linear regression model. The objective of regression analysis is to specify an equation that will predict some response variable $Y$ based on a set of predictor variables $X$. This is done by fitting parameter values $\beta$ of a regression model using extant data for $X$ and $Y$. This equation has the form:

$$Y = X\beta + \epsilon$$

where $\epsilon$ is a vector of errors. One way to fit this model is using the method of *least squares*, which is given by:

$$\hat{\beta} = (X^{\prime} X)^{-1}X^{\prime} Y$$

We can write a function that calculates this estimate, with the help of some functions from other modules:

In [43]:
from numpy.linalg import inv
from numpy import transpose, array, dot

We will call this function `solve`, requiring the predictor and response variables as arguments. For simplicity, we will restrict the function to univariate regression, whereby only a single slope and intercept are estimated:

In [44]:
def solve(x,y):
    'Estimates regession coefficents from data'

    '''
    The first step is to specify the design matrix. For this, 
    we need to create a vector of ones (corresponding to the intercept term, 
    and along with x, create a n x 2 array:
    '''
    X = array([[1]*len(x), x])

    '''
    An array is a data structure from the numpy package, similar to a list, 
    but allowing for multiple dimensions. Next, we calculate the transpose of x, 
    using another numpy function, transpose
    '''
    Xt = transpose(X)

    '''
    Finally, we use the matrix multiplication function dot, also from numpy 
    to calculate the dot product. The inverse function is provided by the LinearAlgebra 
    package. Provided that x is not singular (which would raise an exception), this 
    yields estimates of the intercept and slope, as an array
    '''
    b_hat = dot(inv(dot(X,Xt)), dot(X,y))

    return b_hat 

Here is solve in action:

In [45]:
solve((10,5,10,11,14),(-4,3,0,23,0.6))

array([ 2.04380952,  0.24761905])

## Object-oriented Programming

As previously stated, Python is an object-oriented programming (OOP) language, in contrast to procedural languages like FORTRAN and C. As the name implies, object-oriented languages employ objects to create convenient abstractions of data structures. This allows for more flexible programs, fewer lines of code, and a more natural programming paradigm in general. An object is simply a modular unit of data and associated functions, related to the state and behavior, respectively,  of some abstract entity. Object-oriented languages group similar objects into classes. For example, consider a Python class representing a bird:

In [81]:
class bird:
    # Class representing a bird

    name = 'bird'
    
    def __init__(self, sex):
        # Initialization method
        
        self.sex = sex

    def fly(self):
        # Makes bird fly

        print('Flying!')
        
    def nest(self):
        # Makes bird build nest

        print('Building nest ...')
        
    @classmethod
    def get_name(cls):
        # Class methods are shared among instances
        
        return cls.name

You will notice that this `bird` class is simply a container for two functions (called *methods* in Python), `fly` and `nest`, as well as one attribute, `name`. The methods represent functions in common with all members of this class. You can run this code in Python, and create birds:

In [82]:
Tweety = bird('male')
Tweety.name

'bird'

In [83]:
Tweety.fly()

Flying!


In [84]:
Foghorn = bird('male')
Foghorn.nest()

Building nest ...


A `classmethod` can be called without instantiating an object.

In [85]:
bird.get_name()

'bird'

Whereas standard methods cannot:

In [86]:
bird.fly()

TypeError: fly() missing 1 required positional argument: 'self'

As many instances of the `bird` class can be generated as desired, though it may quickly become boring. One of the important benefits of using object-oriented classes is code re-use. For example, we may want more specific kinds of birds, with unique functionality:

In [87]:
class duck(bird):
    # Duck is a subclass of bird

    name = 'duck'
    
    def swim(self):
        # Ducks can swim

        print('Swimming!')

    def quack(self,n):
        # Ducks can quack
    
        print('Quack! ' * n)

Notice that this new `duck` class refers to the `bird` class in parentheses after the class declaration; this is called **inheritance**. The subclass `duck` automatically inherits all of the variables and methods of the superclass, but allows new functions or variables to be added. In addition to flying and best-building, our duck can also swim and quack:

In [88]:
Daffy = duck('male')
Daffy.swim()

Swimming!


In [89]:
Daffy.quack(3)

Quack! Quack! Quack! 


In [90]:
Daffy.nest()

Building nest ...


Along with adding new variables and methods, a subclass can also override existing variables and methods of the superclass. For example, one might define `fly` in the `duck` subclass to return an entirely different string. It is easy to see how inheritance promotes code re-use, sometimes dramatically reducing development time. Classes which are very similar need not be coded repetitiously, but rather, just extended from a single superclass. 

This brief introduction to object-oriented programming is intended only to introduce new users of Python to this programming paradigm. There are many more salient object-oriented topics, including interfaces, composition, and introspection. I encourage interested readers to refer to any number of current Python and OOP books for a more comprehensive treatment.

## In Python, everything is an object

Everything (and I mean *everything*) in Python is an object, in the sense that they possess attributes, such as methods and variables, that we usually associate with more "structured" objects like those we created above.

Check it out:

In [91]:
dir(1)

['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

In [92]:
(1).bit_length()

1

This has implications for how assignment works in Python.

Let's create a trivial class:

In [93]:
class Thing: pass

and instantiate it:

In [94]:
x = Thing()
x

<__main__.Thing at 0x104e150f0>

Here, `x` is simply a "label" for the object that we created when calling `Thing`. That object resides at the memory location that is identified when we print `x`. Notice that if we create another `Thing`, we create an new object, and give it a label. We know it is a new object because it has its own memory location.

In [95]:
y = Thing()
y

<__main__.Thing at 0x104e151d0>

What if we assign `x` to `z`?

In [96]:
z = x
z

<__main__.Thing at 0x104e150f0>

We see that the object labeled with `z` is the same as the object as that labeled with `x`. So, we say that `z` is a label (or name) with a *binding*  to the object created by `Thing`.

So, there are no "variables", in the sense of a container for values, in Python. There are only labels and bindings.

In [97]:
x.name = 'thing x'

In [98]:
z.name

'thing x'

This can get you into trouble. Consider the following (seemingly inoccuous) way of creating a dictionary of emtpy lists:

In [99]:
evil_dict = dict.fromkeys(column_names, [])
evil_dict

{'Group': [], 'Patient': [], 'Stool': [], 'Taxon': [], 'Tissue': []}

In [100]:
evil_dict['Tissue'].append(5)

In [101]:
evil_dict

{'Group': [5], 'Patient': [5], 'Stool': [5], 'Taxon': [5], 'Tissue': [5]}

Why did this happen?

## References

* [Learn Python the Hard Way](http://learnpythonthehardway.org/book/)  
* [Learn X in Y Minutes (where X=Python)](http://learnxinyminutes.com/docs/python/)  
* [29 common beginner Python errors on one page](http://pythonforbiologists.com/index.php/29-common-beginner-python-errors-on-one-page/)
* [Understanding Python's Execution Model](http://www.jeffknupp.com/blog/2013/02/14/drastically-improve-your-python-understanding-pythons-execution-model/)