# Intro to python for R users

You will find that python and R have a number of similarities, but python has a bit more general-purpose tooling.

## indentation and "blocks"

Unlike in R, blocks are defined by whitespace. By indenting with a tab, you are essentially adding a `{}` as you would in R. For example, this is an if else in python:

In [33]:
value = 3
if value < 100:
    print 'value is less than 100'
else:
    print 'value is greater than 100'

value is less than 100


Instead of using the symbols `&` and `|`, python uses the keywords `and` and `or`:

In [34]:
True or False

True

In [35]:
True and False

False

In [36]:
not False

True

## primitives and datatypes

Like R, python is dynamically typed and thus datatypes are inferred at runtime. Python also has

In [37]:
x = 3
print x
type(x)

3


int

In [38]:
type(3.14)

float

In [39]:
type('this is string')

str

## containers

Python also has container data structures similar to R. You can create a list and using either `[]` or `list()`:

In [40]:
a_list = []
a_list.append(3)
a_list.append(7)
a_list

[3, 7]

Lists can only be indexed using integers and indices start at 0 not 1 (unlike R lists). Like R lists, they can contain mixed data types:

In [41]:
a_list.append('adding a string')
a_list

[3, 7, 'adding a string']

In [42]:
a_list[0]

3

In [43]:
a_list[2]

'adding a string'

You can also check if a value is inside of a list by using the keyword `in`:

In [44]:
7 in a_list

True

In [45]:
0 in a_list

False

I generally do not recommend this as the run time is linear in the size of the list. If you find yourself doing this type of a lookup often, you might want to use a dictionary instead (below).

To get behavior similar to named lists in R, you have to use a different data structure called a dictionary. A dictionary is essentially a hash table. You can create a new dictionary using `{}` or `dict()`.

In [46]:
a_dict = {}
a_dict['name 1'] = 3
a_dict[3] = 0
a_dict

{3: 0, 'name 1': 3}

For every entry in a dictionary, there is a "key" (analogous to the name in R) and a "value". Notice that we can use integers or strings as keys. The keys above are `3` and `'name 1'` with values 0 and 3, respectively.

You can also use the keyword `in`, but it is much more efficient here, as the lookup is constant time

In [47]:
3 in a_dict

True

In [48]:
0 in a_dict

False

## loops

The most common loop in python is a for loop. They work very similarly as they do in R. The function `range` (python) is very similar to the function `seq` (R), but keep in mind that indexing starts at 0 and the end is exclusive (is not included).

In [49]:
some_numbers = range(10)
for x in some_numbers:
    if x % 2 == 0:
        print x, 'is even'

0 is even
2 is even
4 is even
6 is even
8 is even


## functions

In [50]:
def is_even(x):
    return x % 2 == 0

In [51]:
is_even(3)

False

In [52]:
is_even(10)

True

# numpy, scipy, and importing packages

Python doesn't normally do vectorization, but there are packages that can do vectorization. One such package is numpy. Python also doesn't normally do statistics, but the package scipy does a lot of basic stuff (distributions, extensions on random samples, etc.).

To import packages in python, you can use the `import` keyword:

In [53]:
import numpy

Unlike R, python will not bring package 'numpy' into the global namespace, but rather into the namespace 'numpy'. That means that to use the numpy log function, you would need to prefix it with 'numpy.':

In [54]:
numpy.log([3, 1, 10])

array([ 1.09861229,  0.        ,  2.30258509])

Alternatively, python allows you to alias a namespace using the `as` keyword

In [55]:
import numpy as np
np.log([3, 1, 10])

array([ 1.09861229,  0.        ,  2.30258509])

You can also use the keyword `from` to import only a few functions directly into the global namespace

In [56]:
from math import log, exp
log(3)

1.0986122886681098

All vectorized operations in numpy are on `array` objects. They are very similar to lists, except they are more efficient because they cannot be shrunk or grown. You can create one by either specifying the dimensions (into the function `zeros`), or passing in a list (`array`):

In [57]:
arr_1 = np.array([1, 15, 9])
arr_1

array([ 1, 15,  9])

In [58]:
arr_2 = np.zeros(10)
arr_2

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

You can also make an n-dimensional arrays using the function `ndarray` (this is how you would make a matrix). Notice that the matrix is setup row-wise rather than column-wise in R.

In [59]:
np.ndarray((3, 2))

array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

Unlike lists, every single element must be of the same data type (like vectors or matrices in R). You can specify the data type as an optional second argument. If you do not, numpy tries to infer it. Note that this might not be correct:

In [60]:
arr_1.dtype

dtype('int64')

In [61]:
arr_2.dtype

dtype('float64')

In [62]:
arr_3 = np.array([1.0, 15, 9], dtype = np.int32)
arr_3.dtype

dtype('int32')

### vectorization

Numpy allows you to do vectorization as you would in R. For example, we can take the log of a series of numbers and multiply them:

In [63]:
all_x = np.array(range(1, 11))
np.log(all_x) * 5.0

array([  0.        ,   3.4657359 ,   5.49306144,   6.93147181,
         8.04718956,   8.95879735,   9.72955075,  10.39720771,
        10.98612289,  11.51292546])

Note that log function called here is different than the function from the package `math`.

In [64]:
log(all_x) * 5.0

TypeError: only length-1 arrays can be converted to Python scalars

You can make any function that accepts primitive data types vectorized using the numpy function `np.vectorize`:

In [65]:
def my_abs(x):
    if x >= 0:
        return x
    return -x

In [66]:
test_list = [1, -1, 0, -30, 100]

v_my_abs = np.vectorize(my_abs)
v_my_abs(test_list)

array([  1,   1,   0,  30, 100])

In [67]:
np.sum(v_my_abs(test_list))

132

You can also use the map reduce paradigm without using numpy.

The function `map` is very similar to `lapply`. It takes a function and a variable length number of lists and applies the function to all of the lists.

In [68]:
map(my_abs, test_list)

[1, 1, 0, 30, 100]

You can then aggregate things using `reduce`:

In [69]:
all_abs = map(my_abs, test_list)
all_abs

[1, 1, 0, 30, 100]

In [70]:
reduce(lambda x, y: x + y, all_abs, 0)

132

Of course, we could have also use the built-in function `sum`

In [71]:
sum(all_abs)

132

# problem 1

We are going to implement an efficient version of the function `table()` from R using dictionaries and functions. Write a function that takes 2 arguments: `value`, and `counter`. The variable counter will be a dictionary that keeps track of the pair (value, count). The signature will look something like this:

In [1]:
def table(a_list):
    counter = {}
    for value in a_list:
        add_occurrence(value, counter)
        
    return counter

In [2]:
def add_occurrence(value, counter):
    # add in code here
    pass

# problem 2

Refer back to problem set 3, problem 2. We can re-implement that solution using python. A few functions you might need are:

- `scipy.special.gammaln`
- `log_choose` (implemented below)

Implement 2 different versions:
1. Using a `for` loop
2. Using vectorization and numpy

In [74]:
from scipy.special import gammaln

def log_choose(n, k):
    return gammaln(n + 1) - gammaln(k + 1) - gammaln(n - k + 1)