## Predictive Modeling with Python - _IPython Notebooks and Viewing Data in Python_
#### Author: Kevin Bache

## Outline
1. Python
2. IPython Notebooks
3. Pandas
4. Loading Data
5. Plotting Data

## Python and IPython
* `python` is a programming language and also the name of the program that runs python programs
* If you're running scripts from the command line like this: you can use either `ipython` with something like `ipython my_script.py` or `python` with something like `python my_script.py`
* If you're using the command line interpreter interactively to load and explore data, try out a new package, etc. always use `ipython` over `python`.  This is because `ipython` has a bunch of features that are just plain great:

### IPython Goodies

|Run this                     |Get This          |
|-----------------------------|------------------|
|`import numpy`               | Nothing.  Just run it so the code below works. |
|`numpy.ar` (then tab)        | Tab completion!  |
|`?numpy.arange` (then 'q')   | Easy help!       |
|`x = !ls; print x`           | Access to shell! |

### Python and IPython Exercise - 10 Minutes:
1. Partner up with someone next to you.  On one of your computers:
  1. Start two terminal windows and cd in each to the directory where you stored the course files
  1. Start python in one terminal and ipython in the other terminal
  1. In each terminal:
    1. Write a small program which prints the numbers 1 through 10
    1. Try the commands in the table above
  1. Write a small program in ipython to print out the names of all ipython notebook files in the course directory.  
     **Hint**: IPython notebook files end with ".ipynb"

## IPython Notebook
* IPython notebook is an interactive front-end to ipython which lets you combine snippets of python code with explanations, images, videos, whatever.  
* It's also really convenient for conveying experimental results.

### IPython Notebook Exercise 1 - 5 Minutes:
1. Start a terminal window and cd to the directory where you stored the course files
1. Start the IPython Notebook server with the command `ipython notebook`.  The IPython notebook server runs your python code behind the scenes and renders the output into the notebook
1. Create a new notebook by clicking New (top right) >> Python 2 Notebook

### Notebook Concepts
* **Cells** -- That grey box is called a cell.  An IPython notebook is nothing but a series of cells.  
* **Selecting** -- You can tell if you have a cell selected because it will have a thin, black box around it.
* **Running a Cell** -- Running a cell displays its output.  You can run a cell by pressing **`shift + enter`** while it's selected (or click the play button toward the top of the screen). 
* **Modes** -- There are two different ways of having a cell selected:
  * **Command Mode** -- Lets you delete a cell (press **`d`** twice) and change its type (more on this in a second).
  * **Edit Mode** -- Lets you change the contents of a cell.
  * **`Esc`** -- Changes from edit mode to command mode.
  * **`Enter`** -- Changes from command mode to edit mode.

### IPython Notebook Exercise 2 - 10 Minutes:
1. Click Help >> User Interface Tour and take the tour
1. Click Help >> Keyboard Shortcuts.  Mice are for suckers.
1. Write and run a small program in the first cell to print out the the full path and name of every file in the current directory.  
  **Hint:** the command `pwd` will 'print the current directory in linux or mac.  `echo %cd%` does something similar on windows.

### More Notebook Concepts
* **Cell Types** -- There are 3 types of cells: python, markdown, and raw.
  * **Python Cells** -- Contain python code. Running a python cell displays its output. Press **`y`** in command mode convert any selected cell into a python cell. All cells start their lives as python cells.
  * **Markdown Cells** -- Contain formatted text, lists, links, etc. Press **`m`** in command mode to convert the selected cell into a markdown cell.
  * **Raw Cells** -- Useful for a few advanced corner cases.  We won't deal with these at all today.
  
### IPython Notebook Exercise 3 - 10 Minutes:
1. Partner up with someone next to you
  1. Check out this [Markdown Cheat Sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).  Markdown is a set of simple commands for formatting text to make it pretty.  It isn't specific to IPython Notebooks; it's used all over (for example, a lot of blogging platforms let you write your content in markdown because it's easy and HTML is a pain in the butt).
  1. Check out this [stackoverflow post](http://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook) about using LaTex in IPython Notebooks.
  1. Create and render a markdown cell which contains bold text, a nested numbered list, a working link to UCI's website, an image, a table, some rendered LaTex, and a youtube video.

## Numpy
Numpy is the main package that you'll use for doing scientific computing in Python.  Numpy provides a multidimensional array datatype called `ndarray` which can do things like vector and matrix computations.

In [None]:
# you don't have to rename numpy to np but it's customary to do so
import numpy as np

# you can create a 1-d array with a list of numbers
a = np.array([1, 4, 6])
print 'a:'
print a
print 'a.shape:', a.shape
print 

# you can create a 2-d array with a list of lists of numbers
b = np.array([[6, 7], [3, 1], [4, 0]])
print 'b:'
print b
print 'b.shape:', b.shape
print

In [None]:
# you can create an array of ones
print 'np.ones(3, 4):'
print np.ones((3, 4))
print

# you can create an array of zeros
print 'np.zeros(2, 5):'
print np.zeros((2, 5))
print

# you can create an array which of a range of numbers and reshape it
print 'np.arange(6):'
print np.arange(6)
print 
print 'np.arange(6).reshape(2, 3):'
print np.arange(6).reshape(2, 3)
print

# you can take the transpose of a matrix with .transpose or .T
print 'b and b.T:'
print b
print 
print b.T

In [None]:
# you can iterate over rows
i = 0
for this_row in b:
    print 'row', i, ': ', this_row
    i += 1 
print 
    
# you can access sections of an array with slices
print 'first two rows of the first column of b:'
print b[:2, 0]
print

In [None]:
# you can concatenate arrays in various ways:
print 'np.hstack(b, b):'
print np.hstack([b, b])
print

print 'np.vstack(b, b):'
print np.vstack([b, b])
print

In [None]:
# you can perform matrix multiplication with np.dot()
c = np.dot(a, b)
print 'c = np.dot(a, b):'
print c
print

# you can perform element-wise multiplication with * 
d = b * b
print 'd = b * b:'
print d
print

In [None]:
# numpy provides a ton of other functions for working with matrices
m = np.array([[1, 2],[3, 4]])
m_inverse = np.linalg.inv(m)
print 'inverse of [[1, 2],[3, 4]]:'
print m_inverse
print

In [None]:
# and for doing all kinds of sciency type stuff.  like generating random numbers:
np.random.seed(5678)
n = np.random.randn(3, 4)
print 'a matrix with random entries drawn from a Normal(0, 1) distribution:'
print n

## Pandas
Pandas is a python package which adds some useful data analysis features to numpy arrays.  Most importantly, it contains a DataFrame data type like the r dataframe.  

In [None]:
# like with numpy, you don't have to rename pandas to pd, but it's customary to do so
import pandas as pd

df = pd.DataFrame(data=b,  columns=['Weight', 'Height'])
print 'b:'
print b
print 
print 'DataFame version of b:'
print df
print

In [None]:
# Pandas can save and load CSV files.  
# Python can do this too, but with Pandas, you get a DataFrame 
# at the end which understands things like column headings
baseball = pd.read_csv('data/baseball.dat.txt', skipinitialspace=True)

# A Dataframe's .head() method shows its first 5 rows
print baseball.head()

In [None]:
# Printing a dataframe shows you a bit more: 
print baseball

## Pandas Joins
The real magic with a Pandas DataFrame comes from the merge method which can match up the rows and columns from two DataFrames and combine their data.  Let's load another file which has shoesize for just a few players

In [None]:
# load shoe size data
shoe_sizes = pd.read_csv('data/baseball2.dat.txt', skipinitialspace=True)

print shoe_sizes

In [None]:
merged = pd.merge(baseball, shoe_sizes, on=['Name'])
print merged

In [None]:
merged_outer = pd.merge(baseball, shoe_sizes, on=['Name'], how='outer')
print merged_outer.head()