Review
====

## Statistical foundations

- Experimental design
    - Usualy want to isolate a main effect from confounders
    - Can we use a randomized experiment design?
        - Batch effects
- Replication
    - Essential for science
- Exploratory data analysis
    - Always eyeball the data
    - Facility with graphics librareis is essential
    - Even better are interactive graphics libraries (IPython notebook is ideal)
        - Bokeh
- As amount of data grows
    - Simple algorithms may perform better than complex ones
    - Non-parametric models may perform better than parametric ones
    - But big data can often be interpreted as many pieces of small data

## Computing foundations

- Polyglot programming
    - R and/or SAS (for statistical libraries)
    - Python (for glue and data munging)
	- C/C++ (for high performance)
    - Command line tools and [Unix philosophy](http://www.faqs.org/docs/artu/ch01s06.html)
    - SQL (for managing data)
	- Scala (for Spark)
- Need for concurrency
    - Functional style is increasingly important
        - Prefer immutable data structures
	    - Prefer pure functions
            - Same input always gives same output
	        - Does not cause any side effects
- With big data, lazy evaluation can be helpful
    - Prefer generators to lists
    - Look at the `itertools` standaard library in Python 2
- Composability for maintainability and extensibility
    - Small pieces, loosely joined
    - Combinator pattern
	- Again, all this was in the original [Unix philosophy](http://www.faqs.org/docs/artu/ch01s06.html)

## Mathematical foundations

- Core: probability and linear algebra
- Calculus is important but secondary
- Graphs and networks increasingly relevant

## Statistical algorithms

- Numbers as leaky abstractions
- Don't just use black boxes
    - Make an effort to understand what each algorithm you call is doing
    - At minimum, can you explain what the algorithm is doing in plain English?
    - Can you implement a simple version from the ground up?
- Categories of algorithms
    - Big matrix manipulations (matrix decomposition is key)
    - Continuous optimization - order 0, 1, 2
	    - Most convenient = [scipy.optimize](http://docs.scipy.org/doc/scipy/reference/optimize.html#module-scipy.optimize)
	    - For convex optimization [cvxpy](http://www.cvxpy.org/en/latest/)
	- EM algorithm has wide applicability in both frequentist and Bayesian domains
	- Graph algorithms
	    - Most convenient = [networkx](http://networkx.github.io/)
	    - Fastest on single and multi-core machines = [graph-tool](https://graph-tool.skewed.de/)
		- For cluster = [Spark GraphX](https://spark.apache.org/graphx/))
	- Monte Carlo methods, MCMC and simulations
- [Machine learning](https://www.cbinsights.com/blog/python-tools-machine-learning/)
    - 	[sklearn](http://scikit-learn.org/stable/)
    - MADLib in [SQL](https://github.com/madlib/madlib) or [Python](https://github.com/pivotalsoftware/pymadlib)
    - [Spark MLLib](https://spark.apache.org/mllib/)
- Making code fast
    - Make it run, make it right, make it fast
    - Python has amazing profiling tools - use them
	- For profiling C code, try [gperftools](https://code.google.com/p/gperftools/)
	- Compilation: Try numba or Cython in preference to writing raw C/C++
	- Parallel programming
	    - Python GIL
	    - Use Queue from threading or multiprocessing to build a pipeline
		- Skip OpenMP (except within Cython) and MPI
- Big data
    - [Spark](http://spark.apache.org/) is the killer app


## Buidling a parallel pipelline

In [23]:
from multiprocessing import Queue, Process
from time import sleep

In [24]:
class Worker(Process):
    def __init__(self, func, in_queue, out_queue):
        super(Worker, self).__init__()
        self.func = func 
        self.in_queue = in_queue
        self.out_queue = out_queue
        
    def run(self):
        while True:
            try:
                item = self.in_queue.get()
            except IndexError:
                sleep(0.01)
            else:
                self.out_queue.put(self.func(item))

In [20]:
f1 = lambda x: x+1
f2 = lambda x: x*2
f3 = lambda x: x%4

In [21]:
iq1 = Queue(10)
oq1 = Queue(10)
iq2 = Queue(10)
oq2 = Queue(10)
iq3 = Queue(10)
oq3 = Queue(10)

In [26]:
ps = [Worker(f1, iq1, oq1), Worker(f2, iq2, oq2), Worker(f3, iq3, oq3)]
for p in ps:
    p.start()

for i in range(100):
    iq1.put(i)
    
print len(oq3)

KeyboardInterrupt: 