# Using Python and Jupyter to analyze data

This repo should teach you to use Jupyter notebooks as a analysis and visualization framework. 


## Getting started

There are a number of ways to [install Jupyter](http://jupyter.readthedocs.io/en/latest/install.html), the most recommended being by using [Anaconda](https://www.continuum.io/downloads). I've never used Anaconda and find it easier to install with [Pip](https://pypi.python.org/pypi/pip) (A package manager for Python, think Python's `npm`,) preferably in a virtual environment. I like to use [Virtualenvwrapper](https://virtualenvwrapper.readthedocs.io/en/latest/).


```bash
$ mkvirtualenv jupyter-skillshare
(jupyter-skillshare) $
```

The parentheses indicate that you're inside your virtualenv. If you need to exit the environment, you can type `deactivate`. To re-enter the virtual environment, type `workon jupyter-skillshare` (or whatever the name of your environment.)


When you have your virtual env ready, clone the repo:

```bash
(jupyter-skillshare) $ git clone git@github.com:WPMedia/gfx-jupyter-skillshare.git jupyter-skillshare
```

Then use pip to install the requirements listed in requirements.txt (this may take a while)

```bash
(jupyter-skillshare) $ pip install -r requirements.txt
```

To boot up a notebook, type `jupyter notebook`

```bash
(jupyter-skillshare) $ jupyter notebook
```

Previously, this would've automatically opened up a browser window, but looks like there's a bug with the latest version of OSX. Open your notebook by going to [localhost:8888](http://localhost:8888)

## Or, a quicker method

Ben Welsh has a notebook server set up for his ["First Python Notebook"](http://www.firstpythonnotebook.org/) MOOC through the California Civic Data Coalition, and has kindly offered it for this skillshare. If you don't feel like going through the setup/install process, you can start up a server at http://notebooks.californiacivicdata.org/ (you'll need to sign in and authorize GitHub to use the server.) 

If you're interested in Python/Jupyter, you should take a look at that course. It's much more in-depth and gives you a much better idea of what's possible with Jupyter notebooks. 

After running `jupyter notebook` and going to localhost:8888 you should see a page that looks similar to this 

![Jupyter notebook homepage](img/jupyter_homepage.png)

Click the `New` button to create a new Python notebook. You should see something like this, a mostly empty screen with one cell for text:

![Empty notebook](img/empty_notebook.png)

## Hello Python

You can execute any python code in notebooks. The rendered output will appear below your current cell. If you need a starter on Python, [there's a notebook for that](https://notebooks.azure.com/Microsoft/libraries/samples/html/Introduction%20to%20Python.ipynb). 

In [3]:
print "hello" + " world!"

hello world!


In [4]:
2 + 2


4

In [10]:
for i in range(0, 5):
    print i
    

0
1
2
3
4


Like Python in the interpreter or a script, undefined variables and syntax errors will throw errors. 

In [8]:
2 + undefined_variable

NameError: name 'undefined_variable' is not defined

In [None]:
for i in range(0,10)
    print i

## Hello Jupyter

You may notice that our code is broken up into little snippets, or "cells." Think of each cell as being broken up into a statement. And because they're reorganizeable (you can cut, paste, and shift cells up and down,) it's good for legibility and your sanity to keep statements short. 

You might also notice that we're mixing "code" cells with Markdown cells. This lets you keep detailed notes and documentation on your work as you go - more detailed than regular inline comments you might use in your code.

Python not your thing? You can install other language kernels to use with the notebook app, including [IRKernel](https://irkernel.github.io/) for R, or [IJavaScript](https://github.com/n-riesco/ijavascript) for JavaScript. You can see the [full list of kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) available. 

And here are some instructions for [installing and running](https://www.datacamp.com/community/blog/jupyter-notebook-r#gs.6ggEhYw) an R kernel. 

### Some handy shortcuts

`enter` - Enter cell edit mode

`escape` - Enter command mode

`shift-enter` - Run the cell

`a` - Create cell above

`b` - Create cell below

`j` - Move to cell below

`k` - Move to cell above

`m` - Convert cell to Markdown

`y` - Convert cell to code 

`s` - Save notebook

## Hello Data 

Let's read in some data. We'll use data from MLB's [Statcast](https://baseballsavant.mlb.com/statcast_search). `data/nationals_pitch_data.csv` contains every pitch thrown by Washington Nationals pitchers this season. 

We'll read in the CSV as a [Pandas](http://pandas.pydata.org/pandas-docs/stable/) data frame. If you're familiar with data frames in R, you'll find that these work very simliarly. 

First, we need to import Pandas. We'll use a common alias so it's not so clunky to type. Usually this would happen at the top of a notebook but doing it here to illustrate the process.

In [12]:
import pandas as pd

Note that a little asterisk briefly appears in the brackets next to the cell, and then (hopefully) is replaced by a number. This means the code in the cell is running. 

We can read in the module using `read_csv`. Note that we add `na_values="null"` because that is how NA values are included in the CSV. Otherwise, the cell values would be the string "null". 

In [34]:
df = pd.read_csv('data/nationals_pitch_data.csv', na_values="null")

### Summarizing data

There are a number of data summary functions we can use. First, check out the data using `head`.

In [35]:
df.head()

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number
0,SL,2017-06-25,79.8,2.6252,5.2493,Oliver Perez,458015,424144,strikeout,swinging_strike,...,54.8085,0.0,0.0,0.0,1.0,0.0,0.0,,72,6
1,SI,2017-06-25,93.1,2.7662,5.3738,Oliver Perez,458015,424144,,ball,...,53.9725,0.0,0.0,,,,,,72,5
2,SL,2017-06-25,79.5,2.6793,5.1894,Oliver Perez,458015,424144,,swinging_strike,...,54.8107,0.0,0.0,,,,,,72,4
3,FF,2017-06-25,90.9,2.6775,5.4827,Oliver Perez,458015,424144,,ball,...,54.4177,0.0,0.0,,,,,,72,3
4,SI,2017-06-25,91.1,2.4475,5.3219,Oliver Perez,458015,424144,,called_strike,...,54.2785,0.0,0.0,,,,,,72,2


You can also see how many rows and columns are in the data with shape. We have data on more than 11,000 pitches, and 78 columns.  

In [36]:
df.shape

(11320, 78)

Another important command is `dtypes`, which provides the type of value in a column. "Object" is basically a string. 

In [37]:
df.dtypes

pitch_type                          object
game_date                           object
release_speed                      float64
release_pos_x                      float64
release_pos_z                      float64
player_name                         object
batter                               int64
pitcher                              int64
events                              object
description                         object
spin_dir                           float64
spin_rate_deprecated               float64
break_angle_deprecated             float64
break_length_deprecated            float64
zone                               float64
des                                 object
game_type                           object
stand                               object
p_throws                            object
home_team                           object
away_team                           object
type                                object
hit_location                       float64
bb_type    

You can also use `describe` to get an overview of data in a column. I find this easiest to read when summarising one or two columns at a time. 

In [38]:
df['release_speed'].describe()

count    11091.000000
mean        89.453485
std          5.930706
min         70.300000
25%         85.400000
50%         90.700000
75%         94.000000
max        101.500000
Name: release_speed, dtype: float64

### Grouping 

## Hello Charts

## Hello maps!

## Useful links

- [California crop production wages analysis](https://github.com/datadesk/california-crop-production-wages-analysis) good example of how to integrate notebooks into the reporting and analysis process. 
- [Sample notebooks from Microsoft Azure](https://notebooks.azure.com/) includes overviews of using Python, R, etc.
- [28 Jupyter Notebook tips, tricks and shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) 
- [10 Python notebook tutorials for data science and machine learning](http://www.kdnuggets.com/2016/04/top-10-ipython-nb-tutorials.html)