The Python package `Pandas` is uses for manipulating and transforming data. In what follows we examine some of it's useful attributes.

In [20]:
# The following allows multiple outputs in a single output cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## But First, Need a Python Quick Review?

If you are less familiar with Python, you might want to review the materials in the [DeCART Boot Camp, Part 2: Introduction to Python](https://github.com/UUDeCART/decart_bootcamp_part2), in particular our [Python crash course](https://github.com/UUDeCART/decart_bootcamp_part2/blob/master/modules/module1/python_crash_course.ipynb).

## Here's [Pandas](http://pandas.pydata.org/)

Pandas is a Python package for working with tabular data that was developed in the finance community. Pandas will be our main framework for working with data and standard Python packages for predictive analytics and machine learning, like [scikit-learn](http://scikit-learn.org/stable), [statsmodels](http://www.statsmodels.org/stable/index.html), [scipy](https://www.scipy.org), and [seaborn](https://seaborn.pydata.org/).  All work natively with Pandas DataFrames and Series.

A rather elegant Pandas cheat sheet can be found here: [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)

# Let's Get Pandas

Here's the conventionial way of importing Pandas into a Python session. Pandas relies on numpy in many ways, so we import it at the same time.

In [21]:
import pandas as pd
import numpy as np

# Data in Pandas

Pandas can input, store, reorganize and serialize (store) data. It has data "objects" called Series, DataFrames, and Panels.

A Series is essentially a vector with metadata. The data in a Series can be of different types, like integer, float, or character. Pandas called the latter the "object" type.  Here's a Series:

In [22]:
mySeries=pd.Series([1,2,3,4,6])
mySeries
len(mySeries)
mySeries.index  # A Series has an index, which defaults to a sequence of ints
type(mySeries.values) # 

0    1
1    2
2    3
3    4
4    6
dtype: int64

5

RangeIndex(start=0, stop=5, step=1)

numpy.ndarray

A DataFrame is like a two dimensional array with metadata.  Like a Series, it has an index. It can have column names, and the columns can be of different data types. Here's an example:

In [23]:
myDataFrame=pd.DataFrame(
    {'nums':[10,30,40],
     'wordies':['Time flies like an arrow',
                'Fruit flies like a banana',
               'I\'m Groucho!']})

In [24]:
type(myDataFrame)
myDataFrame.shape
myDataFrame.columns
myDataFrame.index
type(myDataFrame.values)  #A np array is hiding inside!

pandas.core.frame.DataFrame

(3, 2)

Index(['nums', 'wordies'], dtype='object')

RangeIndex(start=0, stop=3, step=1)

numpy.ndarray

### Let's Construct a Pandas DataFrame

In [25]:
myDF=pd.DataFrame({'a':[4,5,6,6],'b':[7,8,12.8,9],'c':[10,11,12,10]},
  index=[1,2,3,4],
  columns=['a','b','c'])
print (myDF)

   a     b   c
1  4   7.0  10
2  5   8.0  11
3  6  12.8  12
4  6   9.0  10


### If we let the Notebook evaluate a Pandas DataFrame (e.g. `L`), it will provide a nice HTML table

In [26]:
myDF

Unnamed: 0,a,b,c
1,4,7.0,10
2,5,8.0,11
3,6,12.8,12
4,6,9.0,10


### Pandas DataFrame Object
 
Python is primarily an [object oriented programming language](https://en.wikipedia.org/wiki/Object-oriented_programming).

In an object oriented programming language, objects have attributes and **methods**. Methods are special functions that operate on the attributes of the object.

In our example object, `L`, the attributes are the data values (contained in the columns `a`, `b`, and `c` and rows `1`,  `2`, `3`, and `4`.

If we want to learn what methods the `L` has, we can use the `help` function.

In [27]:
help(myDF)

Help on DataFrame in module pandas.core.frame object:

class DataFrame(pandas.core.generic.NDFrame)
 |  Two-dimensional size-mutable, potentially heterogeneous tabular data
 |  structure with labeled axes (rows and columns). Arithmetic operations
 |  align on both row and column labels. Can be thought of as a dict-like
 |  container for Series objects. The primary pandas data structure.
 |  
 |  Parameters
 |  ----------
 |  data : numpy ndarray (structured or homogeneous), dict, or DataFrame
 |      Dict can contain Series, arrays, constants, or list-like objects
 |  
 |      .. versionchanged :: 0.23.0
 |         If data is a dict, argument order is maintained for Python 3.6
 |         and later.
 |  
 |  index : Index or array-like
 |      Index to use for resulting frame. Will default to RangeIndex if
 |      no indexing information part of input data and no index provided
 |  columns : Index or array-like
 |      Column labels to use for resulting frame. Will default to
 |      Ra

### Example DataFrame and Series methods

Here are some useful methods summarizing data in pandas

* [`describe()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)
    * A method for either DataFrames or Series
* [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
    * A method for Series
* [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)

In [28]:
myDF.describe()

Unnamed: 0,a,b,c
count,4.0,4.0,4.0
mean,5.25,9.2,10.75
std,0.957427,2.535087,0.957427
min,4.0,7.0,10.0
25%,4.75,7.75,10.0
50%,5.5,8.5,10.5
75%,6.0,9.95,11.25
max,6.0,12.8,12.0


In [29]:
myDF['a'].value_counts()

6    2
5    1
4    1
Name: a, dtype: int64

In [40]:
# **MINI-EXERCISE**  (A stupid Pandas trick.)

#Create two Python dicts:

aDict={'a':[1,2,3,4],'b':[10,11,12,13]}

bDict={'a':[50,60,70,80],'c':[100,101,102,103]}

# Create a DataFrame from each of these dicts,
# like e.g. aDF=pd.DataFrame(aDict)

# Then, append the two DataFrames, like aDF.append(bDF,ignore_index=True)

#  What does the result look like?


## There's a Lot More To Learn About Pandas

There's a lot more to Pandas.  To get a relatively complete overview in the shortest time possible see:

<div style="text-align: center">
[10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
</div>

Forgot to mention this: a Pandas `panel` is like a three dimensional array with metadata. Think, a DataFrame with one more dimension. (There's always something...)