# The pandas library

The [pandas library](https://pandas.pydata.org/) was created by [Wes McKinney](http://wesmckinney.com/) in 2010. pandas provides **data structures** and **functions** 
for manipulating, processing, cleaning and crunching data. In the Python ecosystem pandas is the state-of-the-art tool for working with tabular or spreadsheet-like data  in which each column may be a different type (`string`, `numeric`, `date`, or otherwise). pandas provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. pandas relies on other packages, such as [NumPy](http://www.numpy.org/) and [SciPy](https://scipy.org/scipylib/index.html). 
Further pandas integrates [matplotlib](https://matplotlib.org/) for plotting. 

If you are new to pandas we strongly recommend to visit the very well written [__pandas tutorials__](https://pandas.pydata.org/pandas-docs/stable/tutorials.html), which cover all relevant sections for new users to properly get started.


Once installed (for details refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/install.html)), pandas is imported by using the canonical alias `pd`.

In [None]:
import pandas as pd

In [None]:
import numpy as np

The pandas library has two workhorse data structures: __*Series*__ and __*DataFrame*__.

* one dimensional `pd.Series` object
* two dimensional `pd.DataFrame` object

***

## The `pd.Series` object

Data generation

In [None]:
# import the random module from numpy
from numpy import random 
# set seed for reproducibility
random.seed(123) 
# generate 26 random integers between -10 and 10
my_data = random.randint(low=-10, high=10, size=26)
# print the data
my_data

In [None]:
type(my_data)

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its _index_. We create a `pd.Series` object by calling the `pd.Series()` function. 

In [None]:
# Uncomment to look up the documentation

# docstring
#?pd.Series      

# source
#??pd.Series    

In [None]:
# create a pd.Series object
s = pd.Series(data=my_data)
s

In [None]:
type(s)

***

### `pd.Series` attributes

Python objects in general and the `pd.Series` in particular offer useful object-specific *attributes*.

* _attribute_ $\to$ `OBJECT.attribute` $\qquad$     _Note that the attribute is called without parenthesis_

In [None]:
s.dtypes

In [None]:
s.index

We can use the `index` attribute to assign an index to a `pd.Series` object.

Consider the letters of the alphabet....

In [None]:
import string
letters = string.ascii_uppercase
letters

By providing an array-type object we assign a new index to the `pd.Series` object.

In [None]:
s.index = list(letters)

In [None]:
s.index

In [None]:
s

***
### `pd.Series` methods

Methods are functions that are called using the attribute notation.  Hence they are called by appending a dot (`.`) to the Python object, followed by the name of the method, parentheses `()` and in case one or more arguments (`arg`). 

* _method_  $\to$ `OBJECT.method_name(arg1, arg2, ...)`

In [None]:
s.sum()

In [None]:
s.mean()

In [None]:
s.max()

In [None]:
s.min()

In [None]:
s.median()

In [None]:
s.quantile(q=0.5)

In [None]:
s.quantile(q=[0.25, 0.5, 0.75])

***
### Element-wise arithmetic


A very useful feature of `pd.Series` objects is that we may apply arithmetic operations *element-wise*.

In [None]:
s+10
#s*0.1
#10/s
#s**2
#(2+s)*1**3
#s+s

***
### Selection and Indexing

Another main data operation is indexing and selecting particular subsets of the data object. pandas comes with a very [rich set of methods](https://pandas.pydata.org/pandas-docs/stable/indexing.html) for these type of tasks.  

In its simplest form we index a Series numpy-like, by using the `[]` operator to select a particular `index` of the Series.

In [None]:
s

In [None]:
s[3]

In [None]:
s[2:6]

In [None]:
s["C"]

In [None]:
s["C":"K"]

***

## The `pd.DataFrame` object

The primary pandas data structure is the `DataFrame`. It is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with both row and column labels. Arithmetic operations align on both row and column labels. Basically, the `DataFrame` can be thought of as a `dictionary`-like container for Series objects. 




**Generate a `DataFrame` object from scratch** 

pandas facilitates the import of different data types and sources, however, for the sake of this tutorial we generate a `DataFrame` object from scratch. 

Source: http://duelingdata.blogspot.de/2016/01/the-beatles.html

In [None]:
df  = pd.DataFrame({"id" : range(1,5),
                    "Name" : ["John", "Paul", "George", "Ringo"],
                    "Last Name" : ["Lennon", "McCartney", "Harrison", "Star"],
                    "dead" : [True, False, True, False],
                    "year_born" : [1940, 1942, 1943, 1940],
                    "no_of_songs" : [62, 58, 24, 3]
                   })
df

***
### `pd.DataFrame` attributes

In [None]:
df.dtypes

In [None]:
# axis 0
df.columns

In [None]:
# axis 1
df.index

***
### `pd.DataFrame` methods


**Get a quick overview of the data set**

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include="all")

**Change index to the variable `id`**

In [None]:
df

In [None]:
df.set_index("id")

In [None]:
df

Note that nothing changed!!

For the purpose of memory and computation efficiency `pandas` returns a view of the object, rather than a copy. Hence, if we want to make a permanent change we have to assign/reassign the object to a variable:

    df = df.set_index("id") 
    
or, some methods have the `inplace=True` argument:

    df.set_index("id", inplace=True)   

In [None]:
df = df.set_index("id")

In [None]:
df

**Arithmetic methods**

In [None]:
df

In [None]:
df.sum(axis=0)

In [None]:
df.sum(axis=1)

#### `groupby` method
[Hadley Wickham 2011: The Split-Apply-Combine Strategy for Data Analysis, Journal of Statistical Software, 40(1)](https://www.jstatsoft.org/article/view/v040i01)

<img src="_img/split-apply-combine.svg" width="800">

Image source: [Jake VanderPlas 2016, Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)

In [None]:
df

In [None]:
df.groupby("dead")

In [None]:
df.groupby("dead").sum()

In [None]:
df.groupby("dead")["no_of_songs"].sum()

In [None]:
df.groupby("dead")["no_of_songs"].mean()

In [None]:
df.groupby("dead")["no_of_songs"].agg(["mean", "max", "min", "sum"])

#### Family of `apply`/`map` methods

* `apply` works on a row (`axis=0`, default) / column (`axis=1`) basis of a `DataFrame`
* `applymap` works __element-wise__ on a `DataFrame`
* `map` works __element-wise__ on a `Series`.


In [None]:
df

In [None]:
# (axis=0, default)
df[["Last Name", "Name"]].apply(lambda x: x.sum())

In [None]:
# (axis=1)
df[["Last Name", "Name"]].apply(lambda x: x.sum(), axis=1)

_... maybe a more useful case ..._

In [None]:
df.apply(lambda x: " ".join(x[["Name", "Last Name"]]), axis=1)

***
### Selection and Indexing

**Column index**

In [None]:
df["Name"]

In [None]:
df[["Name", "Last Name"]]

In [None]:
df.dead

**Row index**

In addition to the `[]` operator pandas ships with other indexing operators such as `.loc[]` and `.iloc[]`, among others.

* `.loc[]` is primarily __label based__, but may also be used with a boolean array.
* `.iloc[]` is primarily __integer position based__ (from 0 to length-1 of the axis), but may also be used with a boolean array. 


In [None]:
df.head(2)

In [None]:
df.loc[1]

In [None]:
df.iloc[1]

**Row and Columns indices**

`df.loc[row, col]`

In [None]:
df.loc[1, "Last Name"]

In [None]:
df.loc[2:4, ["Name", "dead"]]

**Logical indexing**

In [None]:
df

In [None]:
df["no_of_songs"] > 50

In [None]:
df.loc[df["no_of_songs"] > 50]

In [None]:
df.loc[(df["no_of_songs"] > 50) & (df["year_born"] >= 1942)]

In [None]:
df.loc[(df["no_of_songs"] > 50) & (df["year_born"] >= 1942), ["Last Name", "Name"]]

***

### Manipulating columns, rows and particular entries

**Add a row to the data set**

In [None]:
from numpy import nan
df.loc[5] = ["Mickey", "Mouse", nan, 1928, nan]
df

In [None]:
df.dtypes

_Note that the variable `dead` changed. Its values changed from `True`/`False` to `1.0`/`0.0`. Consequently its `dtype` changed from `bool` to `float64`._

**Add a column to the data set**

In [None]:
pd.datetime.today()

In [None]:
now = pd.datetime.today().year
now

In [None]:
df["age"] = now - df.year_born
df

**Change a particular entry**

In [None]:
df.loc[5, "Name"] = "Minnie" 

In [None]:
df

***
## Plotting

The plotting functionality in pandas is built on top of matplotlib. It is quite convenient to start the visualization process with basic pandas plotting and to switch to matplotlib to customize the pandas visualization.

### `plot` method

In [None]:
# this call causes the figures to be plotted below the code cells
%matplotlib inline

In [None]:
df

In [None]:
df[["no_of_songs", "age"]].plot()

In [None]:
df["dead"].plot.hist()

In [None]:
df["age"].plot.bar()

## ...some notes on plotting with Python


Plotting is an essential component of data analysis. However, the Python visualization world can be a frustrating place. There are many different options and choosing the right one is a challenge. (If you dare take a look at the [Python Visualization Landscape](https://github.com/rougier/python-visualization-landscape).)


[matplotlib](https://matplotlib.org/) is probably the most well known 2D plotting Python library. It allows to produce publication quality figures in a variety of formats and interactive environments across platforms. However, matplotlib is the cause of frustration due to the complex syntax and due to existence of two interfaces, a __MATLAB like state-based interface__ and an __object-oriented interface__. Hence, __there is always more than one way to build a visualization__. Another source of confusion is that matplotlib is well integrated into other Python libraries, such as [pandas](http://pandas.pydata.org/index.html), [seaborn](http://seaborn.pydata.org/index.html), [xarray](http://xarray.pydata.org/en/stable/), among others. Hence, there is confusion as when to use pure matplotlib or a tool that is built on top of matplotlib.

We import the `matplotlib` library and matplotlib's `pyplot` module using the canonical commands

    import matplotlib as mpl
    import matplotlib.pyplot as plt

With respect to matplotlib terminology it is important to understand that the __`Figure`__ is the final image that may contain one or more axes, and that the __`Axes`__ represents an individual plot.

To create a `Figure` object we call

    fig = plt.figure()

However, a more convenient way to create a `Figure` object and an `Axes` object at once, is to call

    fig, ax = plt.subplots() 

Then we can use the `Axes` object to add data for ploting. 

In [None]:
import matplotlib.pyplot as plt

# create a Figure and Axes object
fig, ax = plt.subplots(figsize=(10,5)) 

# plot the data and reference the Axes object
df["age"].plot.bar(ax=ax)

# add some customization to the Axes object
ax.set_xticklabels(df["Name"], rotation=0)
ax.set_xlabel("")
ax.set_ylabel("Age", size=14)
ax.set_title("The Beatles and ... something else", size=18);

Note that we are only scratching the surface of the plotting capabilities with pandas. Refer to the pandas online documentation ([here](https://pandas.pydata.org/pandas-docs/stable/visualization.html)) for a comprehensive overview.


***