# An introduction to Python for Data Science

In [None]:
from IPython.display import Image
Image("http://imgs.xkcd.com/comics/python.png")

## Python Language
* General Purpose Programming Language created in 1991
* Batteries included
   * xml/json/csv parsing
   * email sending
   * operating system/network programming
   * [And more](https://docs.python.org/2/tutorial/stdlib.html)
* Often used as a "Glue" Language, to tie different processes together.
* Popular in Finance, Scientific Computing

* Growing Popularity in Data Science/Machine Learning

## Why Python for Data Science
* General Purpose 
* Rich and Growing Data Analysis Support
* Most Numeric Operations use same linear algebra routines as R/Octave
* Easy to Prototype Code, and then scale for "Production"
* Can call out to R code! Check out the [rpy2](https://pypi.python.org/pypi/rpy2) module
* Huge community in the open source world
* Commercial Support available
    * Enthought
    * Continuum

## Why Not Python?!
* Python 2 or Python 3?
 * Split in the language a few years that changed some internal language assumptions
 * I still prefer Python 2...
 * Most Data Science Libraries are all Python 2, with some supporting Python 3
 * Figure it out by 2020... :-)
* Do you already know R?  Are you productive with R? Are you on a deadline?
* Much richer Statistical Modelling community in R.
* Greater Breadth of Sciences/Academia writing R packages.
    * [CRAN](http://cran.r-project.org/) is a great resource.  For Python checkout [pypi](http://pypi.python.org).


## Getting Started
Caveat! This will not teach as much of the Python Language, but instead show you around some of the tools available for Data Science

The first thing to know, is the <b>Zen Of Python</b>

In [None]:
import this

In [None]:
# Each "Cell" is a runnable piece of code.
# Press <Shift>-<Enter> To execute this cell

print "I'm programming in Python"

# Variable Assignment
a = 1 + 1
b = 1.0 + 2.0
c = a + b
print c
# Lists
# Can be of mixed types
l = [a, b, "a", "b"]

# Dictionaries
m = {
"Teacher": "Ernst",
"TA": "Matt",
"Class Size": 42,
42: "Class Size",
}
print m["TA"] 
# Modify a Dictionary Element
m["TA"] = "MGD"
print m["TA"]

In [None]:
# Variables in one cell are available for use in another cell
print "a,b", a, b

print "len(l)", len(l) # How long is that list ?

print "l[0]", l[0]   # Print the FIRST item in the list
print "l[1]", l[1]   # Print the Second item in the list
print "l[-1]", l[-1]  # Print the last item in a list

# Range Syntax
print "l[0:2]", l[0:2] # Print The First, Second, and Third items in a list
print "l", l
print "l[::-1]", l[::-1]  # What does this do ?  

print "For Loops over lists, and strange math"
for i in l:
    print "{0} * {1} = {2}".format(i, 10, i * 10)

In [None]:
print "Let's look at dictionaries"
print "len(m)", len(m) # How many entries in that dictionary

print "m", m  # Printing a dictionary is a bit messy
print    # Inserts a blank line

# loop over all items in the dictionary in an arbitrary order.  
for key, value in m.iteritems():
    print "{0} = {1} ?!?!".format(key, value)


print "Can we go through the dictionary in order of the keys ?"
for key, value in sorted(m.iteritems()):
    print "{0} = {1} ?!?!".format(key, value)
    

In [None]:
# Python Functions!
def some_function(an_argument, another_argument):
    # Loop 5 times
    for i in range(5):
        if i == 1:
            print i, an_argument
        elif i == 2:
            print i, another_argument
        else:
            print "Just Pleasantly Waiting"
    return "All Done"

print some_function("Hello", "Goodbye")

In [None]:
a = set([1,2,3,4,1,2,3,4])
print "a", a
b = set([3,5,4,3,4,5])
print "b", b

# Set difference, we may use this tonight
print "a-b", a - b

#### Is your head ready to explode yet?
* Another good online [Introduction](https://docs.python.org/2/tutorial/introduction.html)
* Other language features to learn as you wish
    * [Iterators](https://docs.python.org/2/whatsnew/2.2.html#pep-234-iterators)
    * [List Comprehensions](https://docs.python.org/2/whatsnew/2.0.html#list-comprehensions)
    * [Generators](https://docs.python.org/2/glossary.html#term-generator)
    * [Exceptions](https://docs.python.org/2/tutorial/errors.html)
    * [classes](https://docs.python.org/2/tutorial/classes.html)
    * [inheritance](https://docs.python.org/2/whatsnew/2.2.html#multiple-inheritance-the-diamond-rule)
    * [Abstract base classes](https://docs.python.org/2/whatsnew/2.6.html#pep-3119-abstract-base-classes)


## Exploring the Data Science Ecosystem

### The Data Science Eco System in Python
* array operations, statistics, linear algebra: [numpy](http://www.numpy.org/), [scipy](http://www.scipy.org)
* DataFrame: [pandas](http://pandas.pydata.org/)
* Machine Learning: [scikit-learn](http://scikit-learn.org/stable/)
* Plotting: [matplotlib](http://matplotlib.org/), also [Bokeh](http://bokeh.pydata.org/en/latest/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/)
* Statistical models: [statsmodels](http://statsmodels.sourceforge.net/)
* Graphs/networks: [networkx](https://networkx.github.io/)
* Workbench: [ipython](http://ipython.org/)

### Plus the rest of the Python ecosystem
* [Django](https://www.djangoproject.com/)/[Flask](http://flask.pocoo.org/): Web sites
* [Jinja](http://jinja.pocoo.org/): Text formatting/templating
* [nose](https://nose.readthedocs.org/en/latest/): software testing
* [astroml](http://www.astroml.org/): All about Astronomy 
  * Cool Binning Algorithm [Bayesian Blocks](http://www.astroml.org/examples/algorithms/plot_bayesian_blocks.html)
* [SQLAlchemy](http://www.sqlalchemy.org/): Database connections
* [Impyla](https://github.com/cloudera/impyla): Connecting to Hadoop/Impala

See more at the [PyPi(Python Package Index)](https://pypi.python.org/pypi)!



## iPython Notebook
* Interactive, Web Based Development Environment for Python
* Being Used for reproducible science
    * Being used for creating/writing academic papers 
    * Everyone remember the funny video's Ernst played about Data Quality and Data Dictionaries!?
* Work is underway to expand the notebook for other programming languages
 * Julia
 * Haskell
 * bash
 * R
 * More!
* Some Great Example Notebooks
 * [Collection of notebook examples](https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks)
 * [Travelling Salesman Problem in Python](http://nbviewer.ipython.org/url/norvig.com/ipython/TSPv3.ipynb)

In [None]:
# This hack will install a nifty plotting library
# that helps make pretty plots.

# From within the notebook, you can run shell commands on your computer!
!pip install seaborn astroML

In [None]:
# If the above didn't work, try this ugly hack
import sys
for package in ["seaborn", "astroML"]:
    sys.argv = ["pip", "install", package]
    import pip
    pip.main()

In [None]:
# Loading the python libraries we will need for today!

import pandas as pd  # Can now refere to pandas as "pd".  Convenience.
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
# Set up our session for interactive plotting.  (There's a config file knob somewhere I always forget to set)
%matplotlib inline

### Want to get help while using ipython ?
Type the name of the function, and then a question mark.  Interactive help pops up!

### Want to see what methods an object supports ?
Type the "." and hit "<tab>"


In [None]:
# Read the documentaiton on read_csv
pd.read_csv?

In [None]:
# Read the documentation on the dataframe
pd.DataFrame?

In [None]:
# See what methods a dataframe supports, but placing your cursor on the right below, and presssing tab
pd.DataFrame.abs?

### Let's load some data!
We'll be using the [ILDP](https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29) Dataset from homework 2

####Dataset description
1. Age Age of the patient
2. Gender Gender of the patient
3. TB Total Bilirubin
4. DB Direct Bilirubin
5. Alkphos Alkaline Phosphotase
6. Sgpt Alamine Aminotransferase
7. Sgot Aspartate Aminotransferase
8. TP Total Protiens
9. ALB Albumin
10. A/G Ratio Albumin and Globulin Ratio
11. Selector field used to split the data into two sets (labeled by the experts) 

In [None]:
names = ["Age", "Gender", "TB", "DB", "Alkphos", "Sgpt", "Sgot", "TP", "ALB", "A/G", "Selector"]

In [None]:
# Read Data into a DataFrame
# Support for reading a local file, website, etc
#frame = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian%20Liver%20Patient%20Dataset%20(ILPD).csv")
frame = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian%20Liver%20Patient%20Dataset%20(ILPD).csv", header=False, names=names)


In [None]:
#TODO: What other pandas read methods are there? 
# pd.read<TAB>
# How do they work ?
pd.read_gbq?

In [None]:
# Shape of the dataset, (Row, Columns)
frame.shape

In [None]:
# Deterine the types of variables... object == str.
frame.dtypes

In [None]:
frame.tail(10) 

In [None]:
frame.describe?

In [None]:
# Summary Statistics of numerical variables...
frame.describe()

In [None]:
## Accessing only one column:
print frame["Age"].describe()

# or accessing two columns
print frame[["Age", "TP" ]].head()

In [None]:
# Gender - it is an object, so let's count occurences of each unique value
frame["Gender"].value_counts()

In [None]:
type(frame["Gender"].value_counts())

## Data Cleansing

If you remember from homework 2, we dealt with some data cleansing tasks.
Dropping Nulls is easy

In [None]:
frame.isnull().sum()

In [None]:
frame.dropna().describe()

In [None]:
# To modify the dataframe in place, add inplace=True
frame.dropna(inplace=True)

## Visualizing the Data

In [None]:
frame["Gender"].value_counts().plot?

In [None]:
frame["Gender"].value_counts().plot

In [None]:
frame["Gender"].value_counts().plot

In [None]:
# Histogram of Gender 
frame["Gender"].value_counts().plot(kind="bar")

In [None]:
# Histograms of all of the variables!
# frame.hist() returns an array of all of the plots.  We use the "_" to ignore the return value
_ = frame.hist( figsize=(15,15))

In [None]:
# TODO: Better bin sizes for all plots?



In [None]:
# TODO:  Plot a histogram of Two Columns
# TODO:  Experiment with different bins, what looks good ?

In [None]:
# Let's look at Ages again.  The variable is an integer...
frame["Age"].hist(bins=20, figsize=(12,5))

In [None]:
# Another way to plot a histogram, which works for integers
ages = frame["Age"].value_counts()
ages.plot(kind="bar", figsize=(12,5))

In [None]:
# This is a bit better histogram,
ages = frame["Age"].value_counts().sort_index()
ages.plot(kind="bar", figsize=(12,5))

### One Last Fun Bin Technique for Histograms: Bayesian Blocks Binning
Use a dynamic algorithm to determine the bins that best describe the data set.  The bins do not have uniform width, but instead adapt to the distribution of the data.

* [Astronomical Paper](http://adsabs.harvard.edu/abs/2012arXiv1207.5578S)
* [Easier Introduction to the above](http://www.astroml.org/examples/algorithms/plot_bayesian_blocks.html)

In [None]:
from astroML.plotting import hist

In [None]:
f, ax = plt.subplots(figsize=(10, 5))
counts, edges, plot_objects = hist(frame["A/G"], bins="blocks", ax=ax)
_ = ax.set_xticks(edges)
_ = ax.set_xticklabels(edges, rotation=45)

In [None]:
print edges

In [None]:
for column in frame.columns:
    if column == "Gender" or column == "Selector":
        continue
    f, ax = plt.subplots(figsize=(10, 5))
    # Plot a regular histogram
    hist(frame[column], bins=20, histtype='stepfilled', ax=ax,  alpha=0.4,  color="blue", label='standard histogram')
    # Overlay the Bayesian Blocks outline
    counts, edges, plot_objects = hist(frame[column], bins="blocks",  color="black", ax=ax, linewidth=2, histtype='step', label="Bayesian blocks")
    # Toss on a legend, title, and have reasonable xlabels.
    ax.legend()
    ax.set_title(column, size=16)
    _ = ax.set_xticks(edges)
    _ = ax.set_xticklabels(edges, rotation=45)

### Outlier Removal

SGOT: Remove Everything greater than 2 standard deviations away from the mean

In [None]:
frame["Sgot"].hist(bins=30)

In [None]:
sgot = frame["Sgot"].describe()
print sgot

In [None]:
# 
high_limit = sgot["mean"] + 2 * sgot["std"]

print "Sgot high limit is ", high_limit
boolean_series = frame["Sgot"] < high_limit
print boolean_series

In [None]:
frame = frame[boolean_indexer]
frame["Sgot"].describe()

In [None]:
#TODO: Do the same for Sgot


## Data Exploration, Visualization

### Scatter plots of all pairwise relationships

In [None]:
frame.describe()

In [None]:
sns.pairplot?

In [None]:
sns.pairplot(frame, hue="Selector", size=2.5)

# Could also plot with hue="Gender", or hue="Age", or any discrete features

In [None]:
# TODO: What variables look the most interesting?  Can you produce a pairwise plot with just those columns?
sns.pairplot(frame[["TB", "DB", "Age"]], hue="Age")

## Back To DataFrames
### Selecting Rows:

In [None]:
dudes = frame[frame["Gender"] == "Male"]

In [None]:
## Selecting Age Ranges
whipper_snappers = frame[(frame["Age"] < 30) & (frame["Age"] > 20)]
whipper_snappers.head()

In [None]:
whipper_snappers.describe()

In [None]:
#TODO:
# How to select all Females > 50 ?

### Randomly selecting rows
To create a training set, and a test set ?

In [None]:
import numpy.random 


In [None]:
num_rows = frame.shape[0]
training_indices = numpy.random.choice(frame.index,num_rows * .6, replace=False)
test_indices = set(frame.index) - set(training_indices)

In [None]:
print "Training set is ", len(training_indices), "items, and is type", type(training_indices)
print "Test set is     ", len(test_indices), "items, and is type", type(test_indices)

In [None]:
training = frame.ix[training_indices]
training.head()

In [None]:
testing = frame.ix[test_indices]
testing.head()

<b>Q: This is Python, there is a library that does this somewhere - shouldn't I be using that ?!</b>

<i>A: YES! (but you should also know how it works)</i>

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
train_test_split??

In [None]:
training_data, testing_data = train_test_split(frame, test_size=.4)

In [None]:
training_data.shape

In [None]:
testing_data.shape

## Double Bonus, more plotting
### Correlation between variables

In [None]:
sns.corrplot??

In [None]:
f, ax = plt.subplots(figsize=(10, 10))
_ = sns.corrplot(frame, annot=True,  diag_names=False, ax=ax)

#### TODO: What does the above correlation graph tell you ?

### Closer look at a single pair of variables

In [None]:
sns.jointplot?

In [None]:
sns.jointplot(frame["TB"], frame["DB"], kind="reg")

### Mulitple Linear Regressions.

In [None]:
g = sns.lmplot("ALB", "A/G", hue="Selector", data=frame,
                size=6)

### Want More ? Look Here!
* [Python.org](http://www.python.org) Language Homepage. Where else?
* [pydata.org](http://www.pydata.org) One home of a large group of data scientists
 * [Seattle PyData Conference](http://seattle.pydata.org/) Coming this summer! July 24th - 26th)
 * [David Beazley's Homepage](http://www.dabeaz.com/) The best teacher I've found for beginning to advanced python.
    * Book Recommendation <u>Python Essential Reference (4th Edition)</u> (aka The Castle Book)
* [Python Parambulations](https://jakevdp.github.io/)  Lengthly posts describing using Python for Science.