# Introduction to Scientific Python #

## Purpose of this tutorial ##

This tutorial aims to be a not-so-gentle but useful introduction to the python programming language. The intended audience is anyone with some familiarity with programming but limited experience with python for scientific work. Good computational skills are an increasingly important part of the skillset for graduate students and researchers in any field. Unfortunately the education and tools available can be very patchy. There's nothing wrong with learning as you go but it WILL make more work for you down the road. You will run into a problem you can't solve with your specific toolkit, your dataset will get too big, you won't be able to figure out why your analysis takes hours to run and you'll spend days making graphs for a paper that could have been finished in half an hour. Particularly for people who are just starting out it will pay off to put in some hours learning good programming and data analysis practices up front. 

## Why Python? ##

Python is an excellent language for dealing with data. Its libraries for statistics, graphing, data management and scientific programming are extremely good and constantly improving. Python is also a language that can grow with you. Languages like R are also excellent for data but are fairly domain specific and will not give you the tools to solve a wider range of problems. You may not be planning to write a web application, machine learning system, or gui today but if you start in Python those tools will be there when you need them. 

## Table of Contents ##

1. [Introduction](00_scipy_introduction.ipynb)
    1. Setting Up Your Environment
    2. The Scientific Python Stack
    
2. [Programming in Python](01_scipy_programming.ipynb)
    1. Types and data structures
    2. Branching and conditonals
    3. Loops and itteration
    4. Functions and Functional
    5. Classes
    6. Input & Output

3. [Essential Tasks and the Libraries to do them](02_scipy_essentials.ipynb)
    1. Numeric data with Numpy
    2. Everything data with Pandas
    3. Graphing with matplotlib
    4. Statistics with Numpy & Scipy

## Setting Up Your Environment ##

For scientific programming I highly reccomend using one of the prebuilt distributions for science and data analysis. Either Anaconda, Canopy or Python(x,y) is good but for this tutorial we will be using Anaconda. There are a couple reasons for using these distributions over a vanilla python distribution and then downloading the necessary libraries. Anaconda is batteries included and will provide a python environment that already has all the vital libraries and manages them for you. Particularly on Windows manually installing some of the math libraries is a hassle because they contain C, Fortran and other non-Python code that must be compiled every time you update them. All of these distributions will also give you an environment for interactive computing (rapidly modifying and seeing the results of your code) which you will find extremely handy for common tasks such as tweaking graphs for publications.

1. Go to this link and download the distribution for your system [Anaconda Download](https://www.continuum.io/downloads)
2. Click open and follow the installation instructions
3. That's it

## Familiarize yourself with your tools ##

Anaconda comes with several tools and interfaces that you will want to get to know. Many of these such as Conda, are direct clones of industry standard python tools (pip) so skills learned with these will easily transfer.

### 1. [Ipython Command Prompt](http://ipython.readthedocs.io/en/stable/) ###

![Ipython Shell](files/img/ipythonshell.png)

The Ipython command prompt is an enhanced version of the python shell for interactively running Python code. Any code you type in will be immediately run within the shell program. As you are learning I suggest you keep an ipython shell open to try out lines of code and experiment. Open one by typing the command 'ipython' into your systems command prompt or opening the Ipython QtConsole.

#### Magic Commands ( % ) ####
Commands that start with `%` or `%%` are magic commands that interact with the Ipython environment. Try this command to start logging your command history to a file. When you are finished just type `%logstop`. When invoked in Jupyter notebooks `%%` will apply the magic command to the entire cell. 

[Full list of Magics](http://ipython.readthedocs.io/en/stable/interactive/magics.html)

~~~~
%logstart mydirectory/logfile.py
x = 'this command will be recorded in our logfile'
%logstop
~~~~

#### System Commands ( ! ) ####
Commands that start with `!` are passed to your system shell, these will act as though you typed them into your systems normal command prompt but they can interact directly with python code and the results can be saved as python variables.    

~~~~
!ping bbc.com
results = !ping bbc.com
~~~~

#### Enhanced Object Information ( ? ) ####
Commands that start or end with `?` or `??` provide information about an object, for now just know that this can be used to print useful information about variables.

~~~~
myvariable = 125
?myvariable
??myvariable
~~~~

Everything else you type in will be interpreted as Python code 

~~~~
x = 'hello world'
print x
~~~~

### 2. [Jupyter/Ipython Notebooks](https://jupyter.readthedocs.io/en/latest/index.html) ###

![Jupyter Notebooks](files/img/jupyterpreview.png)

Jupyter notebooks are a web server/browser interface for running python code and displaying the results as well as including text, graphics, mathematical formulas and exporting to a variety of formats. Jupyter notebooks were created with the goal of enabling shareable and reproducible analysis for research. They are excellent for this purpose. A well documented Jupyter notebook can be used to clean your data, run your analyses, generate publication quality graphics, explain what you did and then can be exported to make a manuscript, a slideshow, a blogpost and etc.

#### Cells ####
A Jupyter notebook is made up of a series of 'cells'. Cells are independent code or text blocks that can be executed independently of one another. (while the cells execute independently they are not in seperate namespaces, a variable declared in one cell can be accessed or overwritten by another cell be careful as this can introduce unexpected behavior)

#### Kernel ####
The 'heart' of the Jupyter notebook is the kernel which is the program that interprets and executes the code you write in your notebook. The main kernel is of course python, but notebooks give you the ability to swap out the kernel and use different versions of Python or other languages such as R, Julia, Octave, Matlab, Haskell and many others.

#### Interface ####
Open up a new Jupyter Notebook now and you should see some familiar menu items such as File -> save and some unfamiliar ones such as File -> Revert to Checkpoint. Take a few minutes to play with these and familiarize yourself with the environment. I suggest hitting help->User Interface Tour. Make sure you can.

    1. Run a cell
    2. Create and delete cells
    3. Change the content type of cells from code to markdown and back
    4. Export your notebook to different formats.
    5. Open and close new notebooks
    6. Save!!
    7. Revert to a checkpoint

### 3. [Conda Package and Environment Manager](https://conda.io/docs/get-started.html) ###

#### Packages ####
The Conda package manager is an open source package management program that comes with Anaconda. Conda is a direct clone of the default pythong package manager(pip) so all syntax is interchangeable. Packages are libraries of python code that are not included in the basic Python installation and add some additional functionality. By default Anaconda comes with a number of packages preinstalled. You can see these by opening up a command prompt and typing.

~~~~
conda list
~~~~

Conda can be used to install any package found in the Conda repository. Conda should contain most any package you'll need for now. The following commands will search for and install packages. To install a package to a particular environment simply specify the name of the environment with the install command.

~~~~
conda search beautifulsoup4
conda install beautifulsoup4
~~~~
or
~~~~
conda install --name environmentName beautifulsoup4
~~~~

#### Virtual Environments ####

Conda also manages what are called virtual environments. A virtual environment(ve) is an indepenedent installation of Python which can include configuration information and a set of packages. Because Python is an interpreted language all Python scripts are executed in some environment. Anaconda comes with a default environment but you can also create your own. This can be useful if you have scripts that require out of date packages to run, or if you want to distribute a program along with its dependencies to colleagues or for simply keeping your particular setup and package list all in one place. Using virtual environments is good practice and standard in industry so its a good habit to pick up.

The following code creates an environment using python 2.7 called franksLanguagePython and installs two useful packages for dealing with text: natural language toolkit, and beautifulsoup a webscraping package. We then list information about all available environments and use the activate command to switch to using our newly created environment.

~~~~
conda create --name myLanguageVenv python=2.7 NLTK beautifulsoup4
conda info --envs
activate myLanguagVenv
~~~~

### 4. Text Editor ###

The old ways are still strong. We will mostly be using Jupyter Notebooks for this tutorial but you should make sure you have access to a good programming focused text editor. Atom, Notepad++, SublimeText, Vim, Emacs and etc. are all good options. 

### 5. IDE ###

For scientific data analysis integrated development environments are not particularly useful, however if you find yourself doing web development, gui programming or application development I highly recommend [pyCharm](https://www.jetbrains.com/pycharm/download) which is free for educational institutions. IDEs include advanced tools for programming, builds, interacting with external tools and can be essential for certain kinds of programming.

### 6. A Web Browser ###

Never code alone. The best part of the Python ecosystem for scientific computing is the incredible range of resources, tutorials and community available. My entirely subjective opinion is that Python has the best community by far and their official documentation is a close second to industry giants like Matlab. Here are some of my favorite resources.

- [Python Official Docs](https://www.python.org/doc/)
- [Scientific Python Hub](http://www.scipy.org/)
- [Scipy Lecture Notes (very good) ](http://www.scipy-lectures.org/)
- [Software Carpentry Lessons](https://software-carpentry.org/lessons/)
- [Data Carpentry Lessons](http://www.datacarpentry.org/lessons/)
- [NBViewer (view/share interesting notebooks online)](https://nbviewer.jupyter.org/)
- [Curated Collection of Interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)
- [Collection of Machine Learning notebooks/resources](https://sebastianraschka.com/notebooks/python-notebooks.html)
- [Pandas Cookbook](https://github.com/jvns/pandas-cookbook)
- [Pystats Subreddit](https://www.reddit.com/r/pystats/)
- [PyData Confrence (videos of all talks)](https://www.youtube.com/user/PyDataTV/videos)

## The Scientific Python Stack ##

The type of programming you will be doing as a graduate student or scientist is a bit different from the work done in the world of professional developers. Scientific programming is almost always data focused work with the goal of using statistics and analysis to demonstrate and document some interesting effect. The main challenges all programmers working with data face are reliably importing data from various formats, getting it into a useful data structure and then producing relevant analyses and graphs from the data. In the last few years a set of python libraries for working with data have emerged as the defacto standard for this kind of work. Taken together these libraries cover most of the workflow for scientific work and are often called the scientific python stack.

### IPython ###
An enhanced environment for doing scientific work in python. Provides a number of enhancements for workflow such as embedding markdown and mathjax formulas for documentation, inline graphing and etc.

### Numpy ###
The backbone library for scientific work in python. Numpy provides fixed-type numeric array and matrix data structures for python and fast matrix and linear algebra functions. Numpy arrays should always be used over native python lists for numeric data. 

### Pandas ###
A data analysis library which provides a spreadsheet like DataFrame object for efficiently storing and manipulating data as well as functions for importing and exporting data. Acts as a powerful wrapper for using numpy objects.

### SciPy ###
The python library containing most of Pythons scientific and statistical tools. Contains functions for statistics, signal processing, integration, linear algebra, clustering and many others.

### Matplotlib ###
The main graphing library for python. Provides an interface for creating and manipulating graphics. Everything from simple scatter plots to animated heat maps. 

#### Example ####
Here's a high level overview of how you would use these libraries to complete a common scientific task. You've been sent an excel spreadsheet by a collaborator containing the data from an experiment testing the effect of an intervention on a number of variables. Columns contain different variables and each row is an independent data point. You've been asked to take a look at the data and see which if any variables the treatement had an effect on. How would we go about this with Python?

1. Open an Ipython Notebook

2. Use Pandas import functionality to load the spreadsheet into a Pandas dataFrame object. The numeric data will automatically be converted to Numpy arrays and the column names will be preserved.  

3. Using Pandas grouping functionality split the data into the control and treatement groups.

4. Loop through all the variables of interest and using Scipy run the appropriate statistical tests between control and treatement for each variable.

5. Use MatplotLib to produce graphs of the data for each

6. Export your Ipython Notebook containing the results and graphs to a pdf and email it to your collaborator