![DSB Logo](img/Dolan.jpg)
# The pandas Library
## pandas Series and DataFrames 
[The pandas Docs](http://pandas.pydata.org/pandas-docs/stable/index.html)

# Learning Objectives

## Theory / Be able to explain ...
- pandas vis-a-vis NumPy
- The Series and DataFrame data types

## Skills / Know how to  ...
- Create well-structured Series and DataFrames from lists, dicts, Numpy arrays, etc.
- Import and export data from/to various sources

# What's pandas?
## NumPy needs a little help sometimes ...

# From the docs ...
> pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with 'relational' or 'labeled' data both easy and intuitive.

- Sounds a lot like a database management system, right? 
- Why would we need that if we already have NumPy?

# Information is more than just Data
- NumPy is a very fast and efficient number crunching machine for structured data, which is great, but … 
   - NumPy arrays are not so good at preparing data for analysis
   - What if we only want a subset of the rows and columns? Is slicing enough?
   - What if our data tables have missing values? 
   - What if we want more than ints, floats, and strings?
- NumPy also does almost nothing about presentation of data for consumption by  humans

# pandas FTW
pandas is built on top of NumPy (for now) to provide ...
- More flexible data structures that can handle **more data types and indexing schemes**
- A wide variety of functions and methods for **slicing and dicing large data sets** into NumPy-friendly chunks
- Import and export facilities for **just about any data source** one might actually encounter

# Why Do We Need NumPy Then?
pandas’s flexibility and utility comes at a performance cost: 
- **Speed:** NumPy, with its highly structured and optimized numerical routines, is much faster than Pandas for most things.
- **Storage:** Pandas’s greater expressiveness also makes it more verbose and less space efficient.  

Nonetheless, pandas makes some things possible that NumPy can’t do alone.

# Bridging an Architectural Gap
![NumPy Data Gap](img/L6_NumPy_gap1.png)

![NumPy Data Gap](img/L6_NumPy_gap2.png)

# pandas `dtype`s
Recall that NumPy has just `int`, `float`, and string objects (in various formats)

pandas has a larger set of dtypes to choose from:
- Primitives: `int`, `float`, and `bool`
- Date/Time: `datetime64` and `timedelta`
- Enumerations: `category`
- Serializables: `object` (e.g., string spec'ed as `'O'`) 

Note that in pandas strings are not considered a primitive but booleans are. 

# Missing Data (NaN)
Sometimes data tables are incomplete, with missing data. When this happens pandas stores `NaN` ('not a number') instead of a number.  
**This is a feature, not a bug!**

# Standard Imports
The remaining slides assume that we have already imported NumPy and Pandas in the standard way.

In [1]:
import numpy as np
import pandas as pd

# Series and DataFrames
pandas includes two data structures:
- 1D **Series** acts like an `array` and a `dict` at the same time
   - Values stored in a given order and in the same `dtype`
   - An `index` consisting of 'labels' (keys), one per value
   - Compatible with the `ndarray` data type. In fact, you can use most NumPy methods and functions without converting anything.

- 2D **DataFrame** is a container for multiple, equal-length *columns* of data
   - Conceptually, a `dict` of Series, one key per column, though there are subtle differences between a one-column DataFrame and a single Series.
   - Designed to be equivalent to a database table, with functions and methods to match.

# The Series Data Structure
## NumPy `ndarray`s with a few useful tweaks

# Series is a Container for 1D data
- A pandas series has a common data type (int, float, etc.) and optionally a **name** (like a column header). 
- The **values** in the series are **indexed** either by position (0,1,2,3,4,etc.) or by label ('a', 'b', 'c', etc.).
- Series are also NumPy compatible (implementing the same interface as `ndarray`)

# Series are `array`-like and `dict`-like

In [2]:
# A series with index labels
s = pd.Series([3.1, 2.8, 8.9], 
               index=['apatite', 'calcite', 'copper'])

In [3]:
# can slice like a NumPy array
s[:2]

apatite    3.1
calcite    2.8
dtype: float64

In [4]:
# can use keys like a dict 
s['apatite']

3.1

# Creating Series
Ultimately, every series has the following elements:
- data: a sequence of items
- `dtype`: all items will be coerced to the same data type
- `index`: a sequence of unique, hashable keys/labels (one per item)
- `name`: optional, more for us humans than anything else

While each of these things can be given at series creation, often pandas can infer them for us. 

This allows us to populate a series in lots of ways. For example ...

## with just data

In [5]:
s1 = pd.Series([3.1, 2.8, 8.9])
s1

0    3.1
1    2.8
2    8.9
dtype: float64

The `dtype` is inferred from the data.  
Integer indexes are generated automatically.

## with data + dtype 

In [6]:
s2 = pd.Series([3.1, 2.8, 8.9], dtype='float32')
s2

0    3.1
1    2.8
2    8.9
dtype: float32

pandas usually guesses the intended dtype pretty well. We can explicitly set it when needed, however. 

## with data + index

In [7]:
s3 = pd.Series([3.1, 2.8, 8.9], 
               index=['apatite', 'calcite', 'copper'])
s3

apatite    3.1
calcite    2.8
copper     8.9
dtype: float64

Here we have included labels as indexes.

## ... with data + index + name

In [8]:
s4 = pd.Series([3.1, 2.8, 8.9], 
               index=['apatite', 'calcite', 'copper'], 
               name= 'density')
s4

apatite    3.1
calcite    2.8
copper     8.9
Name: density, dtype: float64

The name is optional but does help wth displaying multiple series at a time.

## … from a NumPy `ndarray`

In [9]:
s5=pd.Series(np.array([3.1, 2.8, 8.9]),
          index=['apatite','calcite','copper'],
          name='density')
s5

apatite    3.1
calcite    2.8
copper     8.9
Name: density, dtype: float64

Not surprisingly, we can use an `ndarray` instead of a list.

## … from a `dict`

In [10]:
s6 = pd.Series({'apatite':3.1,'calcite':2.8,'copper':8.9},
               name='density')
s6

apatite    3.1
calcite    2.8
copper     8.9
Name: density, dtype: float64

The dictionary keys are used as the indexes, just as we'd expect. 

# The DataFrame Data Structure
## The Workhorse of Data Science

# DataFrames are for 2D data
- Most similar to a database table:
    - organized into rows and columns
    - each column has a name and a data type
    - each row has an index (numbered or labeled)
- Convertible from/to NumPy Structured Arrays.
    - Even the attributes (names) from `rec.array` translate to column labels
- Have advanced indexing features to provide 'query-like' selections of data


# DataFrames as Series Containers
Internally, a DataFrame organizes 2D data into 1D columns:
- each column is a Series with a shared index and the same length 
- each column has a name, which works like a key  

In fact, we can construct a DataFrame from a `dict` of `pd.Series`

In [11]:
planets = ['Earth', ' Mercury', 'Venus'] # the shared index
solar_system = pd.DataFrame(
                   {   'diam': pd.Series([12756,4878,12104],index=planets), 
                       'spin': pd.Series([0.997, 59, 243], index=planets),
                       'orbit': pd.Series([365.25, 88, 0.9], index=planets)
                   }
             )
solar_system

Unnamed: 0,diam,spin,orbit
Earth,12756,0.997,365.25
Mercury,4878,59.0,88.0
Venus,12104,243.0,0.9


# Creating DataFrames
Lots and lots and lots of options:
- From a `list` of `dict`s (row-wise)
- From a `dict` of `list`s or `np.arrays` (column-wise)
- From an `ndarray` or `rec.array` 
- From a `Series` (or `dict `of `Series`)
- Using `pd.from_dict()`, `pd.from_records()`, `pd.from_items()` functions  
...


## … From a `list` of `dict`s

In [12]:
planets_list_of_dicts = [{'name':'Mercury','diam':4878,'spin':59,'orbit':88,'grav':0.38},
 {'name':'Venus','diam':12104,'spin':243,'orbit':224,'grav':0.9},
 {'name':'Earth','diam':12756,'spin':0.997,'orbit':365.25,'grav':1.0},
 {'name':'Mars','diam':6794,'spin':1.025,'orbit':687,'grav':0.38},
 {'name':'Jupiter','diam':142984,'spin':0.413,'orbit':4329,'grav':2.64},         
 {'name':'Saturn','diam':120536,'spin':0.44375,'orbit':10592.25,'grav':1.16},
 {'name':'Uranus','diam':51118,'spin':0.71805,'orbit':30681,'grav':1.11},
 {'name':'Neptune','diam':49532,'spin':0.67153,'orbit':60193.2,'grav':1.21}]
planets1 = pd.DataFrame(planets_list_of_dicts)

In [13]:
planets1

Unnamed: 0,diam,grav,name,orbit,spin
0,4878,0.38,Mercury,88.0,59.0
1,12104,0.9,Venus,224.0,243.0
2,12756,1.0,Earth,365.25,0.997
3,6794,0.38,Mars,687.0,1.025
4,142984,2.64,Jupiter,4329.0,0.413
5,120536,1.16,Saturn,10592.25,0.44375
6,51118,1.11,Uranus,30681.0,0.71805
7,49532,1.21,Neptune,60193.2,0.67153


Notice how the columns are sorted by name?

## … From a `dict` of `dict`s
One `dict` per column, with the names as keys.

In [14]:
planets_dict_of_dicts= {
    'diam':{'Mercury':4878,'Venus':12104,'Earth':12756},
    'spin':{'Mercury':59,'Venus':243,'Earth':0.997},
    'orbit':{'Mercury':88,'Venus':0.9,'Earth':365.25}
}
planets2=pd.DataFrame(planets_dict_of_dicts)

In [15]:
planets2

Unnamed: 0,diam,spin,orbit
Earth,12756,0.997,365.25
Mercury,4878,59.0,88.0
Venus,12104,243.0,0.9


The planet names (keys) become the indexes (labels), with rows listed in alpha order. The order of the keys in a `dict` is considered arbitrary, so pandas helpfully sorts them for us. 

## … From a dict of NumPy Arrays
One array per column. Specify index on DataFrame creation.

In [16]:
planets_dict_of_arrays = {
    'diam':np.array([4878,12104,12756]),
    'spin':np.array([59,243,0.997]),
    'orbit':np.array([88,0.9,365.25])
}
planets3=pd.DataFrame(planets_dict_of_arrays, index=['Mercury','Venus','Earth'])

In [17]:
planets3

Unnamed: 0,diam,spin,orbit
Mercury,4878,59.0,88.0
Venus,12104,243.0,0.9
Earth,12756,0.997,365.25


This time the rows are listed in the order given in the index.

## … From a 2D NumPy Array

In [18]:
planets_2d_array = \
  np.array([[4878,12104,12756],
            [59,243,0.997],
            [88,0.9,365.25]])
planets_2d_array

array([[4.8780e+03, 1.2104e+04, 1.2756e+04],
       [5.9000e+01, 2.4300e+02, 9.9700e-01],
       [8.8000e+01, 9.0000e-01, 3.6525e+02]])

Specify both row indexes and column names ...

In [19]:
planets4=pd.DataFrame(planets_2d_array,
                   columns=['Mercury','Venus','Earth'],
                   index=['diam','spin','orbit'])

In [20]:
planets4

Unnamed: 0,Mercury,Venus,Earth
diam,4878.0,12104.0,12756.0
spin,59.0,243.0,0.997
orbit,88.0,0.9,365.25


Now the data appears in the same order as given by `index` and `columns`.

## … from a NumPy `rec.array`
A `rec.array` with `dtype` used to spec column names and data types:

In [21]:
planets_rec_array = \
    np.rec.array([('Mercury',4878,12104,12756),
                  ('Venus',59,243,0.997),
                  ('Earth',88,0.9,365.25)],
                  dtype=[('name','U10'),('diam',float),
                         ('spin',float),('orbit',float)])
planets_rec_array

rec.array([('Mercury', 4878., 1.2104e+04, 1.2756e+04),
           ('Venus',   59., 2.4300e+02, 9.9700e-01),
           ('Earth',   88., 9.0000e-01, 3.6525e+02)],
          dtype=[('name', '<U10'), ('diam', '<f8'), ('spin', '<f8'), ('orbit', '<f8')])

In [22]:
planets5=pd.DataFrame(planets_rec_array)
planets5

Unnamed: 0,name,diam,spin,orbit
0,Mercury,4878.0,12104.0,12756.0
1,Venus,59.0,243.0,0.997
2,Earth,88.0,0.9,365.25


Note that `name` is a regular column, not the index. 

To make `name` the index you have to specify that on creation 

In [23]:
planets6 = pd.DataFrame(planets_rec_array,index=planets_rec_array.name)
planets6

Unnamed: 0,name,diam,spin,orbit
Mercury,Mercury,4878.0,12104.0,12756.0
Venus,Venus,59.0,243.0,0.997
Earth,Earth,88.0,0.9,365.25


... and then delete the extra column

In [24]:
del planets6['name']
planets6

Unnamed: 0,diam,spin,orbit
Mercury,4878.0,12104.0,12756.0
Venus,59.0,243.0,0.997
Earth,88.0,0.9,365.25


# Missing Data
Note that the 'pop' dictionary below only has data for Earth.

In [25]:
planets_dict_of_dicts_with_missing_data= {
    'diam':{'Mercury':4878,'Venus':12104,'Earth':12756},
    'spin':{'Mercury':59,'Venus':243,'Earth':0.997},
    'orbit':{'Mercury':88,'Venus':0.9,'Earth':365.25},
    'pop':{'Earth':7500000000}
}
planets7=pd.DataFrame(planets_dict_of_dicts_with_missing_data)
planets7

Unnamed: 0,diam,spin,orbit,pop
Earth,12756,0.997,365.25,7500000000.0
Mercury,4878,59.0,88.0,
Venus,12104,243.0,0.9,


# Adding and Deleting Columns
Since a DataFrame acts like a `dict` of columns (Series), it's not surprising that **adding and deleting columns works just like a dictionary.**

In [26]:
# add the 'pop' column to planets2
planets2['pop'] = pd.Series({'Earth':7500000000},index=['Mercury','Venus','Earth'])
planets2

Unnamed: 0,diam,spin,orbit,pop
Earth,12756,0.997,365.25,7500000000.0
Mercury,4878,59.0,88.0,
Venus,12104,243.0,0.9,


In [27]:
# delete the 'pop' column from planets2
del planets2['pop']
planets2

Unnamed: 0,diam,spin,orbit
Earth,12756,0.997,365.25
Mercury,4878,59.0,88.0
Venus,12104,243.0,0.9


# Input / Output
## HTML, CSV, Excel, SQL, JSON, Google Big Query, etc.

![IO Tools](img/L6_IO_Tools.png)  
Ref: http://pandas.pydata.org/pandas-docs/stable/io.html#io-tools-text-csv-hdf5

# HTML Files
To write an HTML table to a file (or a string or a stream), just use the DataFrame's `to_html()` method:
```python
planets6.to_html("planets.html")
```
It really could not be simpler.  

Reading HTML files is also supported via the `read_html()` function. However, unless the HTML is very small and well-structured as a table, you will likely find more luck with a dedicated HTML parsing library like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).

# CSV Files
```python
# write to a CSV file called "planets.csv"
planets6.to_csv("planets.csv")

# reading is also very straightforward
planets7 = pd.read_csv("planets.csv",index_col=0)

# To read a CSV file over the web just use a URL
planets9 = pd.read_csv("https://planets.org/data.csv")
```

# And so forth ...
By providing a consistent interface for the I/O functions and methods, pandas makes it pretty easy to guess how to deal with new formats. 

There may be some optional arguments that vary according to format, but usually the defaults do pretty much what you expect them to do. 

As always, [RTFM](http://pandas.pydata.org/pandas-docs/stable/io.html#io-tools-text-csv-hdf5) if you need something special. Heck, read the docs anyway. It's good for you. 

# Classwork (Start here in class)
- If time permits, start in on your homework 

# Homework (Do at home)
The following is due before class next week:
- Any remaining classwork from tonight

Please email chuntley@fairfield.edu if you have any problems or questions.