#  Data Transformation & Visualization in Python: numpy, pandas, matplotlib, and seaborn
## Topics

- Array data types with **numpy**
- More advanced data manipulations with **pandas**
- Python plotting basics with **matplotlib**
- **Seaborn** for easy statistical plots

## Introduction

Today we're covering libraries that are an important part of the core Python data stack - numpy, pandas, matplotlib, and seaborn

## NumPy Basics
Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data.

The main object that numpy provides is the **ndarray**.  They differ in a few fundamental ways from regular Python lists:

1. **Arrays cannot be of mixed types.** They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:<br><br>
2. Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but for your multidemsional arrays, every row must have the same number of columns.<br><br>
3. We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in a slice of the array.<br><br>

## Initializing Arrays
Here's one way: start with a list and turn it into an array with the array method:

In [None]:
import numpy as np # numpy is usually abbreviated as 'np' on import

a = [0] * 40
print(type(a))
a = np.array(a)
print(type(a))
print(a)

You now have an array a of 1 row and 40 columns with zeros. But there's a better way to get a vector of zeros:

In [None]:
a = np.zeros(40)
a

There's also `ones` (but no `twos` or `threes`...)

In [None]:
a = np.ones(20)
a

And here's how to declare something that's not all zeros

In [None]:
a = np.arange(40)
print(a)

Like with range(), you can also give arange() more parameters:

In [None]:
np.arange(40, 50)  # Start and Stop

In [None]:
np.arange(40, 50, 2) # Start, Stop, and increment

In [None]:
np.arange(40,50,.25)

In [None]:
a = np.linspace(5, 6, 100) #from 5 to 6 with 100 items, linear spacing
a

In [None]:
a = np.logspace(0, 5, 20) # from 10^0 to 10^5, logarithmic spacing
a

As I said above, you can have arrays with more than one dimension

In [None]:
a = np.zeros(  (10, 10)   ) # Note the inner set of parentheses. (Rows, Columns)
a

In [None]:
# Or create using python list-of-lists
a = [[1, 2, 3, 4, 5], [2, 4, 6, 8, 10], [3, 6, 9, 12, 15]]
a = np.array(a)
a

### To recap: creating numpy arrays:
- `np.zeros`: array of 0.0 of specified size
- `np.ones` : array of 1.0 of specified size
- `np.array`: create numpy array from python list
- `np.linspace` : create a numpy array within a given interval with linear spacing

You can also load from text using [np.loadtxt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) but pandas is really better for this sort of thing

## Accessing elements
### 1d arrays

In [None]:
#For 1d arrays, it's just the same as lists.
a = np.arange(10)
print('The whole array:',a)
print('The fifth element:',a[5])

In [None]:
# slicing conventions work
a[1:]

In [None]:
# However, can also index with a list, or with another numpy array
a = np.array([2, 4, 6, 8, 10, 12])
items = [1, 3, 5, 0] # items we'd like
a[items]


In [None]:
items = np.array(items) # Also works if 'items' is another numpy array
a[items]

In [None]:
# This DOES NOT work with normal python lists
# You'd have to write a comprehension or a loop

a = [2, 4, 6, 8, 10, 12]
items = [1, 3, 5, 0] # items we'd like

[a[i] for i in items]

You can also index into an array with boolean values
- must be the same length as the array (or else get a warning)
- doesn't work with regular python lists

In [None]:
a = np.array([2, 4, 6, 8, 10, 12])
b = np.array([True, True, False, False, False, True])
a[b]

This is very powerful, as we'll see later

### 2d arrays

For 2d arrays the indexing is [row, column]

In [None]:
a = [[1, 2, 3, 4, 5], [2, 4, 6, 8, 10], [3, 6, 9, 12, 15]]
a = np.array(a)
a

In [None]:
a[1, 2] # row 1, column 6 (zero-based indexing)

In [None]:
a[1, :] # entire row 1

In [None]:
a[:, 2] # entire column 2

Slicing and bools work as they did before, only on each axis

## Numpy array properties

Numpy arrays have two important properties:

In [None]:
# shape/size
a = np.zeros((5, 10))
a

In [None]:
a.shape

In [None]:
a.size

In [None]:
len(a) # just gives you the length of the outer-most dimension

In [None]:
# numpy array type
a.dtype

Every value in a numpy array must be the same type (unlike normal python lists)
This is stored in the 'dtype' property

In [None]:
type(a[0, 0]) # type of a single entry

In [None]:
# arange makes it 'int64' by default
a = np.arange(20)
a.dtype

In [None]:
# can tell it to do otherwise
a = np.arange(20, dtype='float')
a.dtype

In [None]:
# or can cast from one to the other
a = np.arange(20).astype('float')
a.dtype

## Operating on numpy arrays

In [None]:
a = np.zeros((10, 10))

In [None]:
a[1, 3] = 7
a

In [None]:
a[1, 3] *= 2
a

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

In [None]:
a[1,:] = 1  # assign '1' to entire row 1
a

In [None]:
a[:,0] = 7 # assign 7 to entire column 0
a

In [None]:
a[a == 0] = -1 # Assign -1 everywhere array is zero (this works because (a == 0) produces a 2d boolean numpy array)
a

Let us pause for a moment and think about how we would do this with a for loop in lists:

In [None]:
# Create a list of lists of all zeros
LoL = [[0]*10 for i in range(10)] #LoL - List of Lists
 
# Set entries in row 1 to 1
for i, elem in enumerate(LoL[1]):
    LoL[1][i] = 1

# Set entries in column zero to 7
for L in LoL:
    L[0] = 7

We can also take slices of arrays, just as if they were lists:

## Vector Math with Arrays
We can do math on many values at once with arrays, no for loop required.

In [None]:
a = np.arange(0, 100, 2)
b = np.arange(50)

a

In [None]:
b

In [None]:
b / 2.0

In [None]:
a * b # Pairwise multiplication

In [None]:
(a * b).sum()

In [None]:
np.dot(a, b) # or can take the dot product

In [None]:
# Other math functions which apply to every element
a = np.linspace(-np.pi, np.pi, 20)
np.sin(a)

Also:
- `np.exp` : exponential function
- `np.log` : logarithm
- `np.abs` : absolute value
- ... and many others


## Useful numpy functions

NumPy is **huge**, with around 1200 pages of [reference documentation](http://docs.scipy.org/doc/numpy/reference/index.html), but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

### Random distributions

In [None]:
a = np.random.uniform(0, 100, 10) # Low, High, Size of output
a

In [None]:
a = np.random.uniform(0, 100, (3,3)) # Can also give a shape for the third argument
a

In [None]:
a = np.random.normal(0, 1, 10) # Normal distribution with mean=0, std=1, 10 samples
a

### Summary Statistics

In [None]:
a = np.random.normal(5, 3, 1000)  # Draw 1000 numbers from the standard normal distribution with mean 5 and std 3
a.mean()

In [None]:
np.std(a) # Standard deviation

In [None]:
np.min(a)

In [None]:
np.max(a)

### Operating on 2d arrays
One of the areas where numpy really shines is its ability to quickly operate along an axis of a 2d array

In [None]:
a = np.ones((5,3))# 5 rows, 3 columns
a

In [None]:
a.sum()  # Sum over all elements

In [None]:
a.sum(axis=0)  # Sum across all rows

Rows are axis 0 and Columns are axis 1.  The order here makes sense because its the same order that you use when indexing an array, rows first - then columns.

In [None]:
a.sum(axis=1) # Sum across all columns

## Boolean Numpy Arrays for Selection and Filtering

In [None]:
a = np.zeros(10, dtype=bool)
a

In [None]:
# Slicing and mass-assignment still work
a[2:5] = True
a

In [None]:
# The ~ character inverts the boolean array
b = ~a
b

In [None]:
# Demonstrating "&" and "|"
a = np.array([True, False, True])
b = np.array([False, False, True])

print("A and B")
print(a & b)

print("A or B")
print(a | b)

Using boolean expressions, you can specifically read out or assign to pieces of the array based on the values in the array

In [None]:
data = np.random.randn(10)
data

In [None]:
data_less_than_zero = data < 0
data_less_than_zero

In [None]:
data[data_less_than_zero] = 0   # Replace all values less than zero, with zero
data

In [None]:
data = np.random.randn(10)
data[data < 0] = 0  # You could also do this without a temporary variable (data_less_than_zero)
data

In [None]:
data = np.random.rand(20,5)*10 # Random data from 0 to 10
data

In [None]:
# Show me the mean of each row
row_means = data.mean(axis=1)
row_means

In [None]:
# Give me a subset of the data matrix, containing only rows with a mean > 5 and the second column < 4 
mean_greater_five = data.mean(axis=1) > 5
print("mean_greater_five: ", mean_greater_five)

In [None]:
new_matrix = data[mean_greater_five, :]
new_matrix

In [None]:
# OR, all in one line - without a temporary variable
new_matrix = data[ (data.mean(axis=1) > 5), : ]
new_matrix

### Why Numpy?
1. Avoid writing loops (don't re-invent the wheel)
2. Most other python data libraries work on top of numpy
2. **Efficient Computation**

Regarding the second point, numpy is useful because operations using it are many times faster than their pure Python implementations.  This is because numpy processes arrays using code written in 'low-level' languages like C or Fortran.  These languages are much more tedious to write programs in, but run much faster than a 'high-level' language like Python.  However, by using Python to call functions written by other people in low-level languages, you can get the best of both worlds.

**Quick performance comparison**

In [None]:
## Setup Create a 1000 x 1000 list of lists (2d matrix)
N_ROWS = 1000
N_COLS = 1000
python_matrix = [[1]*N_COLS for i in range(N_ROWS)]

In [None]:
%%timeit
# Add 1 to every entry in the matrix
for i in range(N_ROWS):
    for j in range(N_COLS):
        python_matrix[i][j] += + 1

In [None]:
%%timeit
# List comprehensions help...a little
result = [[x+1 for x in row] for row in python_matrix]

In [None]:
%%timeit numpy_matrix = np.zeros((1000, 1000))
numpy_matrix += 1

Because numpy is able to know that everything is going to be a float, it can do a lot of optimizations to the arrays that it wouldn't be able to do if each element could, conceivably be a different type. Furthermore, a lot of the time is spent checking to make sure i and j aren't too big or small for the size of the lists, while the numpy code just loads the size of the array once and never checks again.

# Introduction to Pandas

Pandas is a great tool for working with data in Python.  The main object in Pandas you will use is the **DataFrame**.  It has several advantages over numpy ndarrays:

1. Allows mixed-types
2. Label-based row-column indices
3. Easy database-like operations (merge, join, groupby, sort, etc...)

**If pandas is so great, why'd we just learn about numpy?**
- many of the same thigns work on numpy dataframes and pandas arrays
- pandas actually uses numpy under the hood
- many libraries expect numpy arrays (but easy to cast from pandas to numpy)

In [None]:
import pandas as pd  # Pandas is usually abbreviated this way in python

To play around with Pandas, first let's read in some data from file.

In the nycflights13 folder, we have a set of files with data on all the flights that departed NYC airports in 2013.

In [None]:
# Read in data from a tab-delimited text-file
planes = pd.read_table("../data/nycflights13/planes.txt")

# Pandas also has read_excel, read_csv, read_json, read_sql and others

In [None]:
# What's this 'plane' variable have in it?
print(type(planes))

In [None]:
planes

In [None]:
# How big is it?
print(planes.shape)  # same as for numpy array

In [None]:
# What are the column labels?
print(planes.columns)

In [None]:
# What are the row labels?
print(planes.index)

There are three important types that are used by DataFrames:

- DataFrame
- Series
- Index

## Series

One-dimensional - represents a single column or row of data.  Only has one Index

## DataFrame

Two-dimensional.  Has both row and column labels (two Indexes)

## Index

This represents the row or column labels in Series and DataFrames

![DataFrame Vs Series](DataFrameVsSeries_.png)

In [None]:
print(planes.columns)
print(type(planes.columns))

### DataFrame Indexing

You can grab a single column using
```
dataframe[column_name]
```

To grab a row, use:
```
dataframe.loc[row_name]
```

And to grab a specific element use:
```
dataframe.loc[row_name, column_name]
```

In [None]:
planes.head(10) # show just the first 10 rows

In [None]:
planes['manufacturer']

In [None]:
print(type(planes['manufacturer']))

When you grab a single column, you have a series

![ColumnIndex](ColumnIndex.png)

In [None]:
rowthree = planes.loc[3]  # We use 3 because the row index is just numbers right now
print(rowthree)
print(type(rowthree))

You'll notice the row is a 'Series', and it has its own index - the same as the columns of the data frame!

![Row Indexing](RowIndex.png)

### Dataframe index

So far the row-index has been numeric (just 0 through ~3300).  However, we might want to use labels here too.

To do this, we can select a column to be the dataframe's index

**Only do this if the column contains unique data**

In [None]:
planes.head(5) # Before

In [None]:
planes = planes.set_index('tailnum')

In [None]:
planes.head(5) # After

You can also set the index column when you read the file in:

```python
planes = pd.read_table('planes.txt', index_col=0) #Set the first column as the index
```

In [None]:
# Now we can grab a row by name:
planes.loc['N10156']

In [None]:
# Also use .loc to grab a single value

print(planes.loc['N10156', 'model'])

### But now how do I get the 3rd row since we changed the index to tail-numbers?

Here's where **iloc** comes into play.

Works like **loc** but uses integers

In [None]:
print(planes.iloc[3]) # Get the third row

In [None]:
print(planes.iloc[:, 3]) # Get the third column

### Indexing: In-summary

You can grab a single column using
```python
dataframe[column_name]
```

To grab a row, use:
```python
dataframe.loc[row_name]
```

And to grab a specific element use:
```python
dataframe.loc[row_name, column_name]
```

If you want to grab rows or column based on their position, use:
```python
dataframe.iloc[row_number or :, column_number or :]
```

## Let's explore the 'flights' table

In [None]:
flights = pd.read_table("../data/nycflights13/flights.txt")

In [None]:
flights.shape

In [None]:
flights.head(5) # first 5 rows

In [None]:
flights.tail(5) # last 5 rows

In [None]:
flights.sample(5) # random 5 rows

### Perform functions along an axis

In [None]:
# Get the average air_time across all flights
flights['air_time'].mean()

In [None]:
subset = flights[['air_time', 'dep_delay', 'arr_delay']]  # Grab only these three columns
subset.mean(axis=0)  # Take mean across all rows

In addition to mean, there's also:
- min
- max
- median
- sum
- var (for variance)
- std (for standard deviation)

There's also `sort_values` to sort by one or more columns:

In [None]:
flights.sort_values("air_time").head(10)

# Shortest flights are only ~20 minutes from NYC to Philadelphia or Connecticut!

In [None]:
flights.sort_values(['year', 'month', 'day', 'hour', 'minute']).head(10)

# Sorts by year, then by month, then by day....and so on

**unique()** is useful for checking out the values in a column

In [None]:
flights['origin'].unique()  # Three departure airports in the NYC area in the data set

## Identifying and removing NAs in a dataset
How do you find missing values and remove observations for which there are NAs? 

In [None]:
# Example of some bad rows
flights.iloc[835:845, :]

In [None]:
#Are there any NAs in the flights dataframe?
flights.isnull().any()

In [None]:
#Selecting for flights where there is complete data, what are the dimensions?

print("Original Matrix Shape:", flights.shape)

null_rows = flights.isnull().any(axis=1) # Rows where any value is null
complete_rows = ~null_rows  # Invert the boolean series
flights_complete = flights.loc[complete_rows]

print("Complete-rows shape:", flights_complete.shape)

### Aside: Why does this work with loc?  

Earlier I showed .loc operating on row/column labels.

Well, it can also operate on boolean (true/false) lists (or numpy arrays, or **pandas Series**)

Above, what is null_rows?

In [None]:
print(type(null_rows))
null_rows

The great thing about Pandas is that if you pass in a Series, the order of the elements in it doesn't matter anymore.  It uses the index to align the Series to the row/column index of the dataframe.

This is very useful when creating a boolean index from one dataframe to be used to select rows in another!

Alternately, with removing NA values there is a [dropna](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) function that can be used.

Now...back to flights!

## Selecting specific rows

What if we wanted to find the average departure delay for each of the three airports?

A few ways we could do this:

In [None]:
ewr_delays = []
lga_delays = []
jfk_delays = []

for i in flights.sample(10000).index:  # Only running over a small part, this takes ~2 minutes over the whole thing!
    row = flights.loc[i]
    origin = row['origin']
    delay = row['dep_delay']
    
    if pd.isnull(delay): continue   #  Skip NaNs
        
    if origin == 'JFK':
        jfk_delays.append(delay)
    if origin == 'EWR':
        ewr_delays.append(delay)
    if origin == 'LGA':
        lga_delays.append(delay)
        
print('JFK Delay: ', sum(jfk_delays) / len(jfk_delays))
print('EWR Delay: ', sum(ewr_delays) / len(ewr_delays))
print('LGA Delay: ', sum(lga_delays) / len(lga_delays))
        

In [None]:
# A better way

lga_rows = (flights['origin'] == 'LGA')
print(lga_rows)

In [None]:
jfk_delays = flights.loc[flights['origin'] == 'JFK', 'dep_delay']
ewr_delays = flights.loc[flights['origin'] == 'EWR', 'dep_delay']
lga_delays = flights.loc[flights['origin'] == 'LGA', 'dep_delay']

print('JFK Delay: ', jfk_delays.mean())  # pandas mean ignores NaNs by default
print( 'EWR Delay: ', ewr_delays.mean())
print( 'LGA Delay: ', lga_delays.mean())

That's nice and all, but what if there were 100 origins?  

Wouldn't want to write 100 lines here!


### Using Groupby

In [None]:
# All in one statement
flights.groupby('origin')['dep_delay'].mean()

### What's happening here?

![GroupByExample](GroupBy.png)

In [None]:
# Could group by another variable - with more levels
flights.groupby('carrier')['dep_delay'].mean().sort_values()

## Merging tables 'vertically' // Subsetting and re-combining flights from different airlines
You will likely need to combine datasets at some point.  For simple acts of stitching two dataframes together, the pandas **concat** method is used.

Let's create a data frame with information on flights by United Airlines and American Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging. 

The main requirement is that the columns must have the same names (may be in different order).

In [None]:
#Subsetting the dataset into two distinct data frames

flightsUA = flights.loc[flights.carrier == 'UA',]
flightsAA = flights.loc[flights.carrier == 'AA',]

print('UA rows:',flightsUA.shape[0])
print('AA rows:',flightsAA.shape[0])
print('Total rows:',flightsAA.shape[0] + flightsUA.shape[0])

In [None]:
# Combine the two data frames

flightsUAandAA = pd.concat([flightsUA,flightsAA], axis=0) # axis=1 would stitch them together horizontally
print('Combined rows: ',flightsUAandAA.shape[0])

Nothing special, just be sure the dataframes have the columns with the same names and types.



In [None]:
print('Binding 3 data frames and checking the number of rows')
allthree = pd.concat([flightsUA,flightsAA,flightsUAandAA])
allthree.shape[0]

## Merging - by column

The `merge` function provides a way to combine two tables together based on the data in them

To demonstrate this, we'll look at combining the planes and flights tables

In [None]:
# Notice that planes has 'tailnum' as an index
planes.head(10)

In [None]:
# Flights has a column for tailnum - every flight corresponds to a row in planes
flights.head(10)

What if we want to know how many seats (total) were on flights that took off on february first?

In [None]:
# First, subset flights to just have rows for february first

feb1_flights = flights.loc[ (flights.month == 2) & (flights.day == 1)]

feb1_flights.head(10)

Next, we're going to merge the two tables together.  For every row in flights, we're going to add in columns from planes from the row that matches the flights 'tailnum'

In [None]:
feb1_flights_w_planes = feb1_flights.merge(planes, left_on='tailnum', right_index=True, how='left')
feb1_flights_w_planes.head(10) # Let's look at the resulting table

In this statement:

```python
feb1_flights.merge(planes, left_on='tailnum', right_index=True)
```

'left' refers to the first dataframe (feb1_flights), and 'right' refers to the second dataframe (planes)

`left_on='tailnum'` means:  Use the 'tailnum' column for the feb1_flights dataframe

We could also supply `right_on` to tell it what column to use in the planes dataframe, but since we want the index, we use `right_index=True` instead (you can't do `right_on='Index'` because what if a column was named Index?)

**Why did we use `how='left'`?**

In [None]:
len(set(planes.index) - set(feb1_flights.tailnum))

There are 2750 planes in the planes table that aren't in the feb1_flights table at all.

Here are the different arguments for how and what they'd do:

- 'left': use all rows from feb1_flights, and only rows from planes that match
    - feb1_flights rows with no corresponding plane row are filled with NaN for the plane columns
- 'right': use all rows for planes, and only rows from feb1_flights that match
    - plane rows with no corresponding feb1_flights row are filled with NaN for the feb1_flights columns
- 'inner': use only rows for airports and flights that match on the dest/faa columns
    - if a flight doesn't have an entry in the planes table, it's row is dropped in the result
- 'outer': use all rows from both airports and flights
    - NaNs filled when they don't correspond

In [None]:
# Finally, we can answer our question now by summing the 'seats' column, which came from the 'planes' table
feb1_flights_w_planes['seats'].sum()

That's a lot of seats!  

Just for flights leaving three airports in the NYC area on one day.

I'm sure all the flights weren't full, but I bet they real number of people departing is at least 80% of that figure.

## Another Merge example - what are the most common destination airports?
The `flights` dataset has destination airports coded, as three-letter airport codes. What do they mean?

In [None]:
airports = pd.read_table('../data/nycflights13/airports.txt')
airports.head()

The `airports` table gives us a key! Let's merge the `flights` data with the `airports` data, using `dest` in `flights` and `faa` in `airports`.

In [None]:
print('Merging in pandas')
flights_readdest = flights_complete.merge(airports, left_on='dest', right_on = 'faa', how='left')
flights_readdest.head()

Well this merged dataset is nice, but do we really need all of this information?

In [None]:
flights_readdest.columns

In [None]:
flights_sm = flights_readdest[['origin', 'name', 'year', 'month', 'day', 'air_time']]
flights_sm.head()

Since each operation gives us back a dataframe, they are easily chained:

In [None]:
airtime = flights_complete.merge(airports, left_on='dest', right_on='faa', how='left') \
    .loc[:, ['origin', 'name', 'air_time']] \
    .groupby(['origin', 'name'])['air_time'] \
    .mean()

print(airtime.shape)
airtime

**Goal: What's the longest flight from each airport, on average?**

Here, 'airtime' is a little abnormal because it's Index has two levels
    - First level is the 'origin'
    - Second level is the name of the destination
    
This is because we grouped by two variables.

Now we need to group by 'origin' and apply the 'max' function.  Groupby can work for the levels of a multi-index too

In [None]:
airtime.groupby(level='origin').max()

In [None]:
# What if we want to know where the flight goes?

rows = airtime.groupby(level='origin').idxmax() # This returns the indices in airtime where the max was found
rows

In [None]:
airtime[rows] # Index by it to get the max rows

In [None]:
# Could also do it this way

airtime.reset_index() # resets the heirarchical index back the dataframe

In [None]:
airtime.reset_index().groupby('origin')['air_time'].max()

## Pivot Table // Average flight time from origin to destination

Let's put destinations in rows and origins in columns, and have `air_time` as values.

In [None]:
pvt_airtime = airtime.unstack() # Since airtime has a hierarchical index, we can use unstack
pvt_airtime

However, often you want to pivot just a regular dataframe.  I'll create one from airtime for an example:



In [None]:
airtime_df = airtime.reset_index()
airtime_df.head()

In [None]:
airtime_pv = airtime_df.pivot(index='origin', 
                columns='name',
                values='air_time')
airtime_pv

## Multi-column merge // What's the weather like for departing flights?
Flights...get delayed. What's the first step if you want to know if the departing airport's weather is at all responsible for the delay? Luckily, we have a `weather` dataset for that.

Let's take a look.

In [None]:
weather = pd.read_table('../data/nycflights13/weather.txt')
weather.head()

In [None]:
print(flights_complete.columns & weather.columns) # What columns do they share?

In [None]:
flights_weather = flights_complete.merge(weather, 
                         on=["year", "month","day","hour", "origin"])

print(flights_complete.shape)
print(flights_weather.shape) 


`flights_weather` has less rows.  Default behavior of merge is 'inner' and so this means there are some flight year/month/day/hour/origin combos where we don't have a weather entry

In [None]:
# Let's grab flights+weather where the delay was greater than 200 minutes

flights_weather_posdelays = flights_weather.loc[flights_weather.dep_delay > 200]
flights_weather_posdelays.shape

In [None]:
# Anything unusual about these flights?
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure()
plt.hist(flights_weather.dropna().wind_gust, 30, range=(0, 50), normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().wind_gust, 30, range=(0,50), normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Wind Gust')

plt.figure()
plt.hist(flights_weather.dropna().pressure, 30,  normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().pressure, 30,  normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Pressure')

plt.figure()
plt.hist(flights_weather.dropna().hour, 30,  normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().hour, 30,  normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Hour')

## Some other tidying
## Capitalization issues.

In [None]:
flights_complete['dest'].str.lower().head() # For string columns, use .str to access string methods

In [None]:
flights_complete.dest.str.upper().head()

## Removing duplicates

In [None]:
flights_complete = flights_complete.drop_duplicates('month', keep='first')
flights_complete

## Writing to file

Pandas makes writing the results to a file very simple:

In [None]:
# Write to a CSV (comma-seperated value) file

# top 20 rows
top = flights.head(20)
top.to_csv("flights_top.csv")

top.to_csv("flights_top.csv", sep="\t")  # Use tab as a separator instead of comma


top.to_excel("flights_top.xlsx", sheet_name='FlightsTop')  # Use tab as a separator instead of comma

# You might need to install the openpyxl module for Excel writing to work
# To do this, open a terminal and type in "conda install openpyxl", then restart the jupyter notebook by
# going to Kernel (at the top) and selecting 'Restart'.  You will have to re-run the earlier cells that load the data

# Plotting in Python

    - Matplotlib
    - Seaborn
    - Other Popular Libraries

# Matplotlib

In [None]:
import matplotlib.pyplot as plt

Kind of strange, but this is generally how matplotlib is imported.

In [None]:
# First plot
plt.figure()
x = [1, 2, 3, 4, 5]
y = [1, 6, 9, 6, 1]
plt.plot(x, y)
plt.show()

What is happening here:

- `plt.figure()`: Create a new figure
- `plt.plot(...)`: Add a plot to this figure
- `plot.show()`: Render the plot

In a terminal, `plot.show()` opens a new window.  Code execution halts until you close the window.

If you use **ipython** though, you can run `%matplotlib` in order to enable interactive plotting.

In jupyter notebook - use `%matplotlib inline` or `%matplotlib notebook`
- No need for `plt.show` then.  Any in-progress plots are shown at the end of a cell's execution automatically

In [None]:
%matplotlib inline
plt.figure()
plt.plot(x,y)

Why use `plt.figure`?

If you don't create a new figure explicitly, then plots are added to the existing figure.
- Unless there is no existing figure.  Then one is just created

In [None]:
plt.figure()
x = [1, 2, 3, 4, 5]
y = [1, 6, 9, 6, 1]
y2 = [5, 6, 7, 6, 5]
plt.plot(x,y)
plt.plot(x,y2)

Now, let's load some more interesting data

In [None]:
import pandas as pd
airlines = pd.read_table("../data/nycflights13/airlines.txt")
airports = pd.read_table("../data/nycflights13/airports.txt")
flights = pd.read_table("../data/nycflights13/flights.txt")
planes = pd.read_table("../data/nycflights13/planes.txt")
weather = pd.read_table("../data/nycflights13/weather.txt")

Lets look at the weather table - this table contains information on the weather at each of the three origin airports for every hour of 2013.

In [None]:
weather.head()

In [None]:
# Let's get the total precipitation for each month
daily_precip = weather.groupby(['origin', 'month'])['precip'].sum().reset_index()
ewr_precip = daily_precip.loc[daily_precip.origin == 'EWR'].sort_values(['month']).precip.values
lga_precip = daily_precip.loc[daily_precip.origin == 'LGA'].sort_values(['month']).precip.values
jfk_precip = daily_precip.loc[daily_precip.origin == 'JFK'].sort_values(['month']).precip.values

print(ewr_precip)

In [None]:
# Let's add multiple line plots to the same axes
plt.figure()
plt.plot(lga_precip)
plt.plot(ewr_precip)
plt.plot(jfk_precip)

In [None]:
# Let's change the style of them
plt.figure()
plt.plot(lga_precip, 'o', markersize=10)
plt.plot(ewr_precip, 'v', markersize=10)
plt.plot(jfk_precip, '*', markersize=10)

### Line/Marker styles in Matplotlib

There are 2 ways to set the style for the line and the points on the edges

1. Specify each property as its own argument in the plot function

    ```python
    plt.plot(x, y,
            linestyle='solid', linewidth=10, color='blue',
            marker='o', markersize=5, markerfacecolor='green', markeredgecolor='red'
            )
    ```

2. Use an abbreviation ([documented here](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot))

    ```python
    plt.plot(x, y, '-ob')  # Solid line, 'o' marker, blue
    ```

In [None]:
# Let's add a legend
plt.figure()
plt.plot(lga_precip, 'o', markersize=10, label='LGA')
plt.plot(ewr_precip, 'v', markersize=10, label='EWR')
plt.plot(jfk_precip, '*', markersize=10, label='JFK')
plt.legend()

Let's add a bit more information to the plot

In [None]:
plt.figure()
plt.plot(lga_precip, 'o', markersize=10, label='LGA')
plt.plot(ewr_precip, 'v', markersize=10, label='EWR')
plt.plot(jfk_precip, '*', markersize=10, label='JFK')
plt.legend()
plt.xlabel('Month')
plt.ylabel('Total Precipitation (inches)')
plt.title('Precipitation Over Time')

In [None]:
# Easy to save to a variety for formats
plt.savefig('precipitation.pdf')
plt.savefig('precipitation.svg')
plt.savefig('precipitation.png')

### Scatter plots

Difference between scatter and plot
- Use scatter plots when you want every point to have a different size or color
- Let's see an example

In [None]:
# Like this, it's not too different from plt.plot with markersize=5
plt.figure()
plt.scatter(weather.temp, weather.dewp, s=5)

In [None]:
plt.figure()
plt.scatter(weather['temp'], weather.dewp, c=weather.humid, s=5)

In [None]:
# Let's add a x=y line and a colorbar
plt.figure()
plt.scatter(weather['temp'], weather.dewp, c=weather.humid, s=5)
plt.plot([0, 100], [0, 100], '--', color='#aaaaaa', linewidth=.5)
plt.colorbar()

**Here we can learn something about temperature, dewpoint, and humidity**

- Dewpoint is the temperature which water will condensate on an object
- It's linearly dependent on temperature, but there was still not a 1-1 mapping
- By adding color, we can see that humidity is the contributor to the difference between temperature and dewpoint

### Multiple plots in a figure - using subplot

In [None]:
plt.figure() # Create a new figure
plt.subplot(1, 2, 1) # 1 row, 2 columns, plot #1

plt.plot(lga_precip, 'o', markersize=10, label='LGA')
plt.plot(ewr_precip, 'v', markersize=10, label='EWR')
plt.plot(jfk_precip, '*', markersize=10, label='JFK')
plt.legend()
plt.xlabel('Month')
plt.ylabel('Total Precipitation (inches)')
plt.title('Precipitation Over Time')

plt.subplot(1, 2, 2) # 1 row, 2 columns, plot #2

plt.scatter(weather.temp, weather.dewp, c=weather.humid, s=5)
plt.plot([0, 100], [0, 100], '--', color='#aaaaaa', linewidth=.5)
plt.colorbar()
plt.xlabel('Temperature (F)')
plt.ylabel('Dewpoint (F)')
plt.title('Temp vs Dewpoint')
plt.tight_layout()

### Adjusting subplots

You'll notice that while we have two plots here, they aren't positioned very nicely.

There are three ways to fix this

1. Give the plots more space to start with
    - When you call `plt.figure`, you can specify a figure size (inches x inches) (demo with 10x5)<br><br>
    
2.  Manually specify different spacings using `plt.subplots_adjust`
    - left = 0.125
        - the left side of the subplots of the figure
    - right = 0.9
        - the right side of the subplots of the figure
    - bottom = 0.1
        - the bottom of the subplots of the figure
    - top = 0.9
        - the top of the subplots of the figure
    - wspace = 0.2
        - the amount of width reserved for blank space between subplots, expressed as a fraction of the average axis width
    - hspace = 0.2
        - the amount of height reserved for white space between subplots, expressed as a fraction of the average axis height<br><br>
3.  Call `plt.tight_layout()` and let matplotlib figure it out

In [None]:
plt.close('all') # Close all interactive plots currently open

### Histograms

Lets look at some histograms in matplotlib

To do this, we'll use some of the data in the 'flights' table

In [None]:
flights.head()

In [None]:
daily_departures = (flights.groupby(['origin', 'year', 'month', 'day'])
    .size()
    .reset_index())

ewr_departures = daily_departures.loc[daily_departures.origin == 'EWR'][0]  # .size() put its result in column 0
plt.figure()
plt.hist(ewr_departures, bins=30);

In [None]:
# Add multiple histograms - use 'alpha' to overlay

jfk_departures = daily_departures.loc[daily_departures.origin == 'JFK'][0]  # .size() put its result in column 0
lga_departures = daily_departures.loc[daily_departures.origin == 'LGA'][0]  # .size() put its result in column 0
plt.figure()
plt.hist(ewr_departures, bins=30, label='EWR', alpha=.6)
plt.hist(jfk_departures, bins=30, label='JFK', alpha=.6)
plt.hist(lga_departures, bins=30, label='LGA', alpha=.6)
plt.legend()

### Alternate plot styles

You can use the command 'plt.styles.use' to change the plotting style

Matplotlib comes with several options built-in

In [None]:
plt.style.available

In [None]:
plt.style.use('ggplot')
plt.figure()
plt.hist(ewr_departures, bins=30, label='EWR', alpha=.6)
plt.hist(jfk_departures, bins=30, label='JFK', alpha=.6)
plt.hist(lga_departures, bins=30, label='LGA', alpha=.6)
plt.legend(loc='best')

# Seaborn

Seaborn is a plotting library that is built on top of **matplotlib**

You can do anything in seaborn with just matplotlib commands.  Seaborn just makes it much less tedious.

Seaborn is useful for:

- Heatmaps
- Statistical plots
- Dealing with Categorical variables

## Learning seaborn conventions with *jointplot*

`jointplot` is a handy function for plotting the joint distribution of two variables

In [None]:
import seaborn as sns  # This is how seaborn is usually abbreviated
plt.style.use('default')
tips = sns.load_dataset('tips')
tips.head()

In [None]:
#sns.jointplot(tips.total_bill, tips.tip)
sns.jointplot(tips.total_bill, tips.tip, kind='hex')

Here's our plot! Some things to notice:

- Seaborn has already filled in the x and y labels
- This is actually a matplotlib Figure.  The figure has three Axes (subplots)
- If we zoom in, the histograms zoom to follow

In [None]:
sns.jointplot(x=tips.total_bill, y=tips.tip, kind='kde')

### DataFrames and Seaborn

Seaborn makes it easy to just use dataframes directly.

Recall that our 'tips' dataframe has columns 'total_bill' and 'tip'

All seaborn plotting functions provide an alternate way of calling them that uses the dataframe directly:

In [None]:
# Use dataframe columns directly
sns.jointplot(x='total_bill', y='tip', data=tips)


In [None]:
# Pairplot
sns.pairplot(vars=['total_bill', 'tip', 'size'], data=tips)

In [None]:
# Pairplot - add hue
sns.pairplot(vars=['total_bill', 'tip', 'size'], hue='time', data=tips)
plt.subplots_adjust(right=.85)

### Box/violin

These are all plots for showing distributions

Let's use a new dataset for this one

In [None]:
titanic = sns.load_dataset("titanic")
titanic.head()

In [None]:
plt.figure()
sns.boxplot(x='class', y='fare', data=titanic)
plt.ylim(0, 200);

In [None]:
plt.figure()
sns.boxplot(x='class', y='fare', hue='alive', data=titanic)
plt.ylim(0, 160)

In [None]:
plt.figure()
sns.swarmplot(x='class', y='fare', hue='alive', data=titanic)
plt.ylim(0, 200);

### Styles in Seaborn

One of seaborns early uses was just to get prettier matplotlib plots.

Matplotlib defaults were ugly, but if you just imported seaborn, they'd be set to something that looked nicer.

Now, matplotlib (2.0 and up) has decent looking defaults, but seaborn still has some nice options

- sns.set_style(*style_name*) can switch between plotting styles
    - *style_name* are 'white', 'dark', 'whitegrid', 'darkgrid', 'ticks'<br><br>
- sns.despine() removes the top and right axes spines<br><br>
- sns.set_context(*context_name*) will scale plot elements
    - *context_name* are 'talk', 'paper', 'notebook', 'poster'<br><br>

In [None]:
sns.set_context('talk') # Make text bigger
sns.set_style("dark")
plt.figure()
sns.boxplot(x='class', y='fare', hue='alive', data=titanic)
plt.ylim(0, 160)

In [None]:
sns.set_context('paper') # Make text bigger
sns.set_style("white")
sns.set_style("ticks")
plt.figure()
sns.boxplot(x='class', y='fare', hue='alive', data=titanic)
plt.ylim(0, 160)
plt.subplots_adjust(left=.2, bottom=.2)
sns.despine(offset=25)

In [None]:
sns.set_context('notebook')

### Using *FaceGrid* to create plots for every level of a categorical variable

Let's visualize the distribution of the number of departures per day, in a separate plot for each month


In [None]:
flights_per_day = flights.groupby(['origin', 'month', 'day']).size().reset_index()
flights_per_day.head()

In [None]:
plt.close('all')
g = sns.FacetGrid(data=flights_per_day, col='month', col_wrap=4)
g.map(plt.hist, 0)

In [None]:
# And we can add a hue (because why not?)
g = sns.FacetGrid(data=flights_per_day, col='month', col_wrap=4, hue='origin')
g.map(plt.hist, 0, range=(200, 400), alpha=.6)

In [None]:
# That's still a little hard to see, let's use box plots instead
g = sns.FacetGrid(data=flights_per_day, col='month', col_wrap=4)
g.map(sns.boxplot, "origin", 0)

### Heatmaps

Heatmaps are an area where seaborn really makes things a bit easier.

You can use 'plt.pcolormesh' to plot a grid of color in matplotlib.  However, you'll have to set up all the tick-labels manually.

Here's an example plotting a heatmap with seaborn

In [None]:
# Number of departures per hour
# Rows - day of the week
# Columns - hour of the day
# Values - avg # of flights

def day_of_week_2013(month, day):
    """
    2013 was NOT a leapyear and started on a Tuesday
    """
    days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    day_num = (month-1)*31 + day
    if month > 2: day_num -= 3
    if month > 4: day_num -= 1
    if month > 6: day_num -= 1
    if month > 9: day_num -= 1
    if month > 11: day_num -= 1
    
    return days[(day_num) % 7]

flights['Weekday'] = [day_of_week_2013(month, day) for month, day in zip(flights.month, flights.day)]

counts = flights.groupby(['Weekday', 'hour', 'day', 'month']).size().reset_index()
weekday_counts = counts.groupby(['Weekday', 'hour'])[0].mean()
weekday_counts = weekday_counts.reset_index()
weekday_counts = weekday_counts.pivot(index='Weekday', columns='hour', values=0)
weekday_counts = weekday_counts.fillna(0)
weekday_counts = weekday_counts.loc[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']] # Sort the weekdays
weekday_counts

In [None]:
plt.figure()
sns.heatmap(weekday_counts)

That's....ok.  But still needs a little work.

In [None]:
plt.figure()
data2 = weekday_counts.drop([1, 5, 22, 23], axis='columns')
sns.heatmap(data2, cmap='YlGnBu', vmin=15, vmax=75)
plt.yticks(rotation=20)
plt.ylabel("")
plt.tight_layout()

In [None]:
plt.figure()
data2 = weekday_counts.drop([1, 5, 22, 23], axis='columns')
sns.heatmap(data2, cmap='YlGnBu', vmin=15, vmax=75, annot=True, square=True, cbar=False)
plt.yticks(rotation=20)
plt.ylabel("")
plt.tight_layout()

## Clustermap - Heatmap+Dendrogram

Let's see if we can cluster the days based on how they deviate from the average # of flights in the day

In [None]:
# Get an idea of how every day deviates from the average number of flights in that day of the week
counts = flights.groupby(['Weekday', 'hour', 'day', 'month']).size().reset_index()
counts['avg'] = [weekday_counts.loc[weekday, hour] for weekday, hour in zip(counts.Weekday, counts.hour)]
counts['deviation'] = counts[0] - counts['avg']
counts['date'] = ["{}/{}".format(month, day) for month, day in zip(counts.month, counts.day)]
counts.head()

In [None]:
data = counts.pivot(index='date', columns='hour', values='deviation').fillna(0)
data.head()

In [None]:
cm = sns.clustermap(data, col_cluster=False, vmin=-10, vmax=10)
plt.sca(cm.ax_heatmap) # More on this in a minute
plt.yticks(rotation=0);

# Advanced Matplotlib




## Pyplot vs Objects

There are two ways to interact with Matplotlib plots

When we use `pyplot.plot`  (abbreviated as plt.plot) to plot a picture, we're using the pyplot state machine.

This was developed to give a MATLAB-like interface to the plotting system in matplotlib.

`plt.plot` creates a plot using the **current axes** in the **current figure**.

We could also call methods on these Figures and Axes directly.

### Figures and Axes

The image below is a single Figure object, with multiple Axes.

![Figures/Axes](subplots.png "Logo Title Text 1")

When we call `plt.plot`, it's acually calling the `plot` method of the current Axes object.


*Snippet from the matplotlib.pyplot source on Github*
![Pyplot_plot](pyplot_plot.png "Logo Title Text 1")

Focusing on the red-underlined parts, you can see the function really does two things.

1. Get the **current axes** by calling gca()
2. Calls ax.plot(...) on that axes and passes the arguments through.

*The rest has to do with the 'hold' state which determines if new plots are added to a figure or replace the currentplot*

In [None]:
plt.figure()
ax = plt.gca() # 'gca - Get Current Axes'
ax.plot([1, 2, 3, 4, 5], [1, 3, 5, 3, 1])

In [None]:
fig = plt.gcf() # gcf - 'Get Current Figure'
fig.__repr__()

The Axes and Figure objects provide us with an endpoint to access/modify various aspects of a plot

![Anatomy of a Figure](anatomy_of_a_figure.png "Logo Title Text 1")

In [None]:
fig = plt.figure(figsize=(8, 5)) # Create a new figure
ax1 = plt.subplot(1, 2, 1) # Create a subplot (returns an axes)
ax2 = plt.subplot(1, 2, 2) # Create the other subplot (returns an axes)

sns.boxplot(x='class', y='fare', hue='alive', data=titanic, ax=ax2) # Create our titanic plot, tell seaborn to use axes 2

ax1.plot(lga_precip, 'o', markersize=10)
ax1.plot(ewr_precip, 'v', markersize=10)
ax1.plot(jfk_precip, '*', markersize=10)

ax1.set_xlabel("Month") # same as plt.xlabel
ax1.set_ylabel("Precipitation") # same as plt.ylabel
ax1.set_title("Precipitation per month")

ax2.set_title("Titanic Survival")

fig.suptitle("Some plots!")

fig.subplots_adjust(wspace=.5)

This gives us a way to modify the plots that seaborn creates

In [None]:
jp = sns.jointplot(x='total_bill', y='tip', data=tips)
ax = jp.ax_joint # jointplot has 3 axes.  The main one is in this variable

rot = 0
for tick in ax.get_yticklabels(): # Rotate the Y tick labels
    tick.set_rotation(rot)
    rot += 45 
    

## Gridspec for more complicated figure layouts

So far we just showed subplots using multiple axes that evenly divided the plot area.

If you wanted to make a more complicated layout (like what Seaborn does in jointplot above) you can use GridSpec

In [None]:
from matplotlib import gridspec
plt.figure()
gs = gridspec.GridSpec(2, 2,
                       width_ratios=[1,2],
                       height_ratios=[4,1]
                       )

ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1], sharey=ax1)
ax3 = plt.subplot(gs[2], sharex=ax1)
ax4 = plt.subplot(gs[3], sharex=ax2, sharey=ax3)

for ax in [ax1, ax2, ax3, ax4]:
    ax.plot(lga_precip, 'o', markersize=10)
    ax.plot(ewr_precip, 'v', markersize=10)
    ax.plot(jfk_precip, '*', markersize=10)


In [None]:
# Or even crazier layouts

plt.figure()
gs = gridspec.GridSpec(3, 3)

ax1 = plt.subplot(gs[0, :])   # Use row 1, all columns
ax2 = plt.subplot(gs[1,:-1])  # Use row 2, all columns but the last one
ax3 = plt.subplot(gs[1:, -1]) # Use row 2&3, only the last column
ax4 = plt.subplot(gs[-1,0])   # Use the last row, first column
ax5 = plt.subplot(gs[-1,-2])  # Use the last row, second column

plt.subplots_adjust(wspace=.3, hspace=.3)

# Other plotting tools

- [Plotly](https://plot.ly/python/) (Interactive HTML/JS plots)
- [Bokeh](http://bokeh.pydata.org/en/latest/docs/gallery/les_mis.html) (More interactive HTML/JS plots)
- [ggpy](https://github.com/yhat/ggpy) (ggplot-style plotting in Python)

# Closing Remarks

Whew, we covered a whole lot here! 

I wouldn't expect anyone to be an expert in these tools after just one lesson, but hopefully this has given you an idea on the *kinds* of tasks where these tools are used, and how they can help you analyze data.