# Introduction to Pandas

Welcome to the Data Analysis course - now that (presumably) you have a solid grasp of the principles surrounding Numerical computing in NumPy, we will move on to data management in Python. The most common way to do this is in **tabular** format (i.e in a table) with relational databases. The most commonly used powerful library which provides in-memory database-like data handling is **Pandas**. Pandas is well suited for:

* **Tabular** data with heterogeneously-typed columns, such as in an SQL database or Excel spreadsheet.
* Ordered and unordered **time-series** data.
* Arbitrary **matrix** data with row and column labels.

Some of the interesting features include:

* Handling missing data fluently
* Size mutability
* Easy-to-use *data alignment*
* Label-based *slicing*, *fancy indexing* and *subsetting*
* Intuitive *merging* and *joining* of datasets by label
* Hierarchical labelling of axes
* Decent IO tools for importing from an array of different formats
* Flexible reshaping and *pivoting* of tables

For advanced information and API, check out the [cookbook](https://pandas.pydata.org/pandas-docs/stable/cookbook.html) and the [website documentation](https://pandas.pydata.org/). 

In [None]:
import pandas as pd

`pandas` is broken down into two primary classes:

1. **Series**: think of this as an any-type (templated) unordered array with an index. A generalized *numpy array*.
2. **DataFrame**: think of this as a 2-D heterogeneous table with a *Series* for each column.

## Pandas.Series object

A series is a *one-dimensional* labeled array capable of holding **any** data type (integers, strings, floating points, Python objects, etc). The axis labels are collectively referred to as the **index**. The basic method to create a *Series* is to call:

In [None]:
counts = pd.Series(data=[644, 1276, 3554, 154])
counts

In [None]:
data = pd.Series([.25, .5, .75, 1.])
data

`data` can be many different things:

- a list
- a Python dict
- a `numpy.ndarray`
- a scalar value

We can also specify an **index** which needs to be the same length as `data`. If we don't specify an index, a default sequence of integers (from `np.arange()`) is assigned as the index. A numpy array comprises the values of the *Series*, which the index is another *Pandas* object: 

In [None]:
counts.values

In [None]:
counts.index

Like with a NumPy array, the `series` can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
counts[1]

In [None]:
data[1:3]

### `Series` as a generalized NumPy array

The essential difference between a NumPy array and `pd.Series` is the presence of an `.index` object: whilst the NumPy array has an implicitly defined integer index used, `pd.Series` has an *explicitly* defined index associated with each value that doesn't have to be numerical:

In [None]:
foods = pd.Series([644, 1276, 3554, 154], index=['Oranges', 'Apples', 'Melons', 'Pumpkins'])
foods

In [None]:
foods["Apples"]

### `Series` as a dictionary

In this way, `pd.Series` is viewed like a specialized Python dictionary object. A dictionary is a structure that maps arbitrary keys to a set a of arbitrary values (key-value pairs). One of the key differences is that the keys and values respectively must be the same type for each value since `pd.Series` is built on top of NumPy, which in turn is built in C.

In [None]:
food_d = {
    'Oranges': 644,
    'Apples': 1276,
    'Melons': 3554,
    'Pumpkins': 154
}

food_s = pd.Series(food_d)
food_s

This can also be achieved via separate lists:

In [None]:
labels = ['Oranges', 'Apples', 'Melons', 'Pumpkins']
counts = [644, 1276, 3554, 154]
pd.Series(dict(zip(labels,counts)))

Unlike a dictionary, `pd.Series` supports array-style operations such as slicing:

In [None]:
food_s["Oranges":"Melons"]

## Pandas.DataFrame

A dataframe is a *2-dimensional* labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts different kinds of input:

- Dict of 1D `numpy.ndarray`s, lists, dicts, or Series
- 2-D `numpy.ndarray`
- A Series
- Another `DataFrame`

Along with the data you can optionally pass **index** and **columns** arguments. If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

One of the really nice aspects about Dataframes, particularly in Jupyter notebook, is the automatic HTML/Javascript generated when visualizing tables:

In [None]:
population = {'California': 38332521,'Texas': 26448193,
              'New York': 19651127,'Florida': 19552860,
              'Illinois': 12882135}
area = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
states = pd.DataFrame({"population":population, "area":area})

In [None]:
states

We can extract the column names as:

In [None]:
states.columns

Again the index is accessible:

In [None]:
states.index

Individual `pd.Series` can be returned using square-bracket notation to refer to the *columns* in a dataframe:

In [None]:
states["area"]

Other ways to construct a `pd.DataFrame`:

In [None]:
pd.DataFrame(np.random.rand(3, 2),columns=["foo","bar"], index=["a","b","c"])

## Pandas.Index

Both `Series` and `DataFrame` contain an explicit index that lets you reference and modify the rows of the data. This `Index` object is of course an interesting structure, and can be thought of as an *immutable* array or *ordered set*.

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

The `Index` operates like an array in many ways, for instance using slices:

In [None]:
ind[1]

In [None]:
ind[::2]

Also familiar are the ubiquitous shape and dimension functions common to NumPy arrays:

In [None]:
print(ind.shape, ind.size, ind.ndim, ind.dtype)

**IMPORTANT**: `Index` objects are *Immutable*:

In [None]:
ind[1] = 0

The `Index` object also features **set operations**, such as joins across datasets, which depend on set theory. It follows many conventions from Pythons' in-built `set` data structure:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB

In [None]:
indA | indB

In [None]:
indA ^ indB

These operations have corresponding object methods, i.e `indA.intersection(indB)`.

## Data Indexing and Selection

For `Series`, we can use straight bracket-notation when selecting indices:

In [None]:
data = pd.Series([.25, .5, .75, 1.], index=["a","b","c","d"])
data

In [None]:
data["b"]

We can use dictionary-like expressions and methods to treat the `Series` as a dictionary:

In [None]:
"a" in data

In [None]:
data.keys()

### Indexers: **loc**, **iloc** and **ix**

Because there are many slicing and indexing conventions, where some are explicit indices, and others implicit, **Pandas** deploys a nuymber of special indexer attributes that expose certain indexing schemes.

The `loc` attribute allows indexing and slicing that always references the explicit index:

In [None]:
data.loc["a"]

In [None]:
data.loc["a":"c"]

`iloc` instead exposes the *implicit* positional Python-style indexing:

In [None]:
data.iloc[1]

In [None]:
data.iloc[:2]

The `ix` form of indexing is a hybrid of the two; it is recommended not to use and has been discontinued in later `pandas` versions, but we mention it for educational purposes.

### Selection in DataFrame

Let's use the principles we've learnt to see how to select elements in a `pd.DataFrame`.

In [None]:
states.loc["California","population"]

In [None]:
states.iloc[0, :]

In [None]:
states.area

We can modify the object to add a column by selecting two columns and performing an operation, for instance:

In [None]:
states["density"] = states["population"] / states["area"]
states

This performs straightforward element-by-element arithmetic between `Series` objects.

We can think of a DataFrame as a two-dimensional array, if all the values are the same type:

In [None]:
states.values

From this, familar array-like observations can be done to the DataFrame, for example using transpose:

In [None]:
states.T

In [None]:
states.iloc[:3, :2]

In [None]:
states.loc[:"Illinois",:"population"]

We can combine familiar Numpy-style *masks* to these indexers:

In [None]:
states.loc[states.density > 100, ["population","density"]]

These objects returned give you direct access to the object, which allows for modification as you would in a NumPy array:

In [None]:
states.iloc[0, 2] = 90
states

## Pandas Operations

Like NumPy, Pandas allows the ability to perform fast element-wise operations, both with basic arithmetic and more sophisticated operations (trigonometric, exponential, etc). `pandas` inherits the UFuncs template from NumPy.

Operations that *preserve* index and column integrity, pandas will automatically align indices when passing object to *ufunc*. This means that keeping the context of data and combining data from different sources-both potentially error-prone tasks with raw NumPy arrays-become essentially foolproof ones with `pandas`. 

Let's see some examples:

In [None]:
import numpy as np

In [None]:
ser = pd.Series(np.random.randint(0, 10, 4))
df = pd.DataFrame(np.random.randint(0, 10, (3, 4)), columns=["a","b","c","d"])
ser

If we apply a NumPy ufunc on either of these objects, the result is another Pandas object with *the indices preserved*:

In [None]:
np.exp(ser)

In [None]:
np.sin(df * np.pi / 4)

When UFuncs is applied on binary operations when data is incomplete, operations return `nan`:

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [None]:
population / area

The resulting array contains the *union* of indices of the two input arrays, which is determined by Python `set`:

In [None]:
area.index | population.index

Any item which only exists in either one is marked with `NaN`, or *Not a Number*. This is Pandas' way of marking missing data.

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

To get around this behaviour, a fill value can be precomputed when performing certain operations:

In [None]:
A.add(B, fill_value=0)

The same rules apply when using UFuncs in a DataFrame.

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s) |
| ------------ | ---------------- |
| `+` | `add()` |
| `-` | `sub()`, `subtract()` |
| `*` | `mul()`, `multiply()` |
| `/` | `truediv()`, `div()`, `divide()` |
| `//` | `floordiv()` |
| `%` | `mod()` |
| `**` | `pow()` |

###  Operations between DataFrame and Series

When performing operations between a Frame and Series, the index and column alignment is similarly preserved and similar to NumPy array operations. One common operation would be the difference of a 2-d array and one of it's rows:

In [None]:
A = np.random.randint(10, size=(3,4))
A

In [None]:
A - A[0]

Similarly in Pandas, the convention operates row-wise:

In [None]:
Adf = pd.DataFrame(A, columns=list("QRST"))
Adf - Adf.iloc[0]

If you want to operator column-wise, use the object methods, specifyign the `axis` keyword:

In [None]:
Adf.sub(Adf["Q"], axis=0)

## Missing Values

One of the big differences between tutorials and the real world is that real-world data is rarely **clean** and homoegenous. Most datasets will have *some amount of data missing*. To complicate matters still, different data sources often indicate missing data in different ways. 

In Pandas, we represent missing values as `NaN`, `NA` or `null`, depending on data type and other factors. Pandas chooses two representations: the Pythonic `None` object, and the IEEE floating-point value `NaN` sentinel-based approach.

In [None]:
vals = np.array([1, None, 3, 4])
vals

### `None`: Pythonic missing data

This is the Python singleton object that is used for missing data in Python code. 

The `dtype=object` means that the best representation NumPy can infer is that every element is a Python object. This means that operations on this object are done at the Python level and not in C, meaning very slow indeed!

In [None]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

`None` also prevents common aggregation functions like `sum` or `min` across an array.

In [None]:
vals.sum()

### `NaN`: Missing numerical data

This is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [None]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

NumPy has a native floating-point type for this array: meaning that this array supports fast operations pushed into compiled code. Be warned though: `NaN` can infect UFunc operations to result in `NaN`s:

In [None]:
1 + np.nan

In [None]:
100**5 * np.nan

In [None]:
vals2.sum()

NumPy provides special aggregations that do ignore missing values...

In [None]:
np.nansum(vals2)

Pandas is built to handle the two interchangeably, converting where appropriate:

In [None]:
pd.Series([1, np.nan, 2, None])

## Reading and Writing Files

The Pandas I/O API is a set of top level reader functions accessed like `pandas.read_csv()` that generally return a Pandas object. The corresponding *writer* functions are object methods that accessed like `DataFrame.to_csv()`. Below is a table containing a sample of different readers and writers:

| Format Type | Data Description | Reader  | Writer |
| ----- | ---- | ------ | ----- | 
| text | CSV | `read_csv` | `to_csv` |
| text | JSON | `read_json` | `to_json` |
| text | HTML | `read_html` | `to_html` |
| binary | MS Excel | `read_excel` | `to_excel` |
| binary | HDF5 format | `read_hdf` | `to_hdf` |
| SQL | SQL | `read_sql` | `to_sql` |

Some important parameters to functions like `read_csv()` include:

- __filepath__: The path to the file or URL
- __sep__: The delimiter to use (for instance .csv is comma-separated, other favourites are tab-delimited \t)
- __header__: The row number to use a column names (and the start of the data)
- __index_col__: The column to use as row labels of the DataFrame
- __prefix__: Allows a prefix to be added to the column names

In [None]:
titanic = pd.read_excel("datasets/titanic.xlsx")
titanic.head()

We can also extract from csv or any other flat-file k-delimited style format. This can be specified in the 'sep' argument within a call to `read_csv` or `read_table`.

## Basic Aggregation

The toys of NumPy are back in a similar form: max, min, mean, sum etc.

Below are a list of built-in Pandas aggregations:

| **Aggregation** | **Description** |
| -------------- | ----------------- |
| `count()` | Total number of non-NA items |
| `first()`, `last()` | First and last item |
| `mean()` | Arithmetic mean |
| `median()` | Middle value |
| `min()`, `max()` | Smallest, largest value |
| `std()`, `var()` | Standard deviation and variance |
| `mad()` | Mean absolute deviation |
| `prod()` | Product of all items |
| `sum()` | Summation of all items |

In [None]:
titanic.sum()

In [None]:
titanic.Age.mean()

In [None]:
titanic.describe()

We can choose to aggregate by more than one feature, to generate a `pandas.DataFrame` whereby the index/column names become the type of aggregation we desire:

In [None]:
titanic.agg(['min','max'])

## Sorting

We can also sort the data in our Dataframes, either by sorting the values themselves, or by the index.

In [None]:
titanic.sort_values(by='Age', ascending=False).head(3)

In [None]:
titanic.sort_index(ascending=False).head(3)

In [None]:
titanic.sort_values(by=['n_parents','Fare'], ascending=[False,True]).head()

We can `rank()` each value relative to the others if desired:

In [None]:
titanic.Fare.rank().head()

## Counts

We can count the number of unique values in a column with `value_counts()` - incredibly useful!

In [None]:
titanic.Survived.value_counts()

In [None]:
titanic.Sex.value_counts()

## Handling Complex String columns

We may wish to break down the 'name' category into title, first and last names.

In [None]:
titanic.Name.head()

In [None]:
complex_names = titanic.Name.str.extract("(?P<Surname>[a-zA-Z]+),\s(?P<Title>[a-zA-Z]+).\s(?P<Forename>[a-zA-Z]+)",
                         expand=True)
complex_names.head()

In [None]:
# or alternatively, splitting a string by a common character, such as comma
titanic.Name.str.split(" ", expand=True).head()

## Tasks

You'll be working with the **tips dataset**, which contains data regarding customers in a restaurant, how much they paid and tipped, and some characteristics about the customers such as whether they smoked or not. Most of the data is preprocessed for you already.

In [None]:
tips = pd.read_csv("datasets/tips.csv")
tips.head()

### Task 1

Select all of the customers that ate at dinner time, didn't smoke, and paid more than \$25 for their total bill **or** tipped more than \$4.

In [None]:
# your code here

### Task 2

Calculate the Pearson correlation between the total bill per customer and the tip.

In [None]:
# your code here

### Task 3

Calculate the mean total bill and tip per customer, by day and gender.

In [None]:
# your code here

### Task 4

Sort customers by the tips and by smokers, and select the top 10 tippers who smoke.

In [None]:
# your code here

## Solutions

**WARNING**: _Please attempt to solve the problems before fetching the solutions!_

See the solutions to all of the problems here:

In [None]:
%load solutions/01_solutions.py