# Reference

- Learning the Pandas Library
- Python Tools for Data Munging, Data Analysis, and Visualization

# Pandas Introduction

- pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
- pandas is an in memory nosql database, that has sql-like constructions, basic statistical and analytic support, as well as graphing capability. Because it is built on top of Cython, it has less memory overhead and runs quicker.
- Many people are using pandas to replace Excel, perform ETL, process tabular data, load CSV or JSON files, and more.

# Data Structures
- One of the keys to understanding pandas is to understand the data model. At the core of pandas are three data structures:

| Data Structure |Dimensionality |SpreadSheet Analog |
|:-----------|:------------|:------------|
| Series | 1D | Column |
| DataFrame | 2D | Single Sheet |
| Panel | 3D | Multiple Sheets |

- The most widely used data structures are the Series and the DataFrame that deal with array data and tabluar data respectively.
- An analogy with the spreadsheet world illustrates the basic differences between these types.

# Series
- A Series is used to model one dimensional data, similar to a list in Python.
The Series object also has a few more bits of data, including an index and a name.
A common ide through pandas is the notion of an  axis.
Below is a table of counts of songs artists composed:

|Artist | Data |
|:-------|:-------|
|0|145|
|1|142|
|2|38|
|3|13|

- To present this data in pure Python, you could use a data structure similar to the one that follows. It is a dictionary that has a list of the data points, stored under the 'data' key. In addition to an entry in the dictionary for the actual data, there is an explicit entry for the corresponding index values for the data , as well as an entry for the name of the data.

In [3]:
ser = {'index': [0,1,2,3],
          'data':[145,142,38,13],
           'name':'songs'}
ser

{'data': [145, 142, 38, 13], 'index': [0, 1, 2, 3], 'name': 'songs'}

- There is a trick up panda's sleeves. Byallowing non-integer values, the data structure actually spports other index types such as strings, dates, as well as arbitary ordered indices or even duplicate index values.
- Below is an example that has string values for the index:

In [4]:
songs={
    'index':['Paul','John','George','Ringo'],
    'data':[145,142,38,13],
    'name':'counts'
}

In [18]:
def get(ser,idx):
    value_idx=ser['index'].index(idx)
    return ser['data'][value_idx]

In [26]:
get(ser,1) , get(songs,'John')

(142, 142)

## The Pandas Series
- With that back ground in mind, let's look at how to create a Series in pandas.
- It is easy to create a Series object from a list:

In [28]:
import pandas as pd
songs = pd.Series([145,142,38,13],
                        name='counts')
songs

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

- When the interpreter prints our series, pandas makes a best effort to format it for the current terminal size.
- The left most column is the index column which condains entries for the index. 
- The generic name for an index is an axis, and the values of the index 0,1,2,3 ... are called axis labes.
- The two dimensional structure in pandas - a DataFrame - has two axes, one for the rows and another for the columns.

In [33]:
songs=pd.Series([145,142,38,13],
    index=['Paul','John','George','Ringo'],
    name='counts')
songs

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

- The actual data for a series does not have to be numeric or homogeneous. We can insert Python objects into a series.
- In the below case, the dtype - datatype - of the Series is object (meaning a Python object). This can be good or bad.
- The object data type is used for strings. But, it is also used for values that have heterogenous types. If you have numeric data, you would not want it tobe stored as a Python object, but rather as an int64 or float64, which allow you to do vectorized numeric operations.

In [35]:
class Foo:
    pass

ringo=pd.Series(
    ['Richard','Starkey',13,Foo()],
    name='ringo')
ringo

0                                 Richard
1                                 Starkey
2                                      13
3    <__main__.Foo object at 0x10e3a1b00>
Name: ringo, dtype: object