# Introduction to Pandas

Welcome to the 2nd part of this course - now that (presumably) you have a solid grasp of the principles surrounding Numerical computing in NumPy, we will move on to data management in Python. The most common way to do this is in **tabular** format (i.e in a table) with relational databases. The most commonly used powerful library which provides in-memory database-like data handling is **Pandas**. Pandas is well suited for:

* **Tabular** data with heterogeneously-typed columns, such as in an SQL database or Excel spreadsheet.
* Ordered and unordered **time-series** data.
* Arbitrary **matrix** data with row and column labels.

Some of the interesting features include:

* Handling missing data fluently
* Size mutability
* Easy-to-use *data alignment*
* Label-based *slicing*, *fancy indexing* and *subsetting*
* Intuitive *merging* and *joining* of datasets by label
* Hierarchical labelling of axes
* Decent IO tools for importing from an array of different formats
* Flexible reshaping and *pivoting* of tables

For advanced information and API, check out the [cookbook](https://pandas.pydata.org/pandas-docs/stable/cookbook.html) and the [website documentation](https://pandas.pydata.org/). 

In [1]:
import pandas as pd

`pandas` is broken down into two primary classes:

1. **Series**: think of this as an any-type (templated) unordered array with an index. A generalized *numpy array*.
2. **DataFrame**: think of this as a 2-D heterogeneous table with a *Series* for each column.

## Pandas.Series object

A series is a *one-dimensional* labeled array capable of holding **any** data type (integers, strings, floating points, Python objects, etc). The axis labels are collectively referred to as the **index**. The basic method to create a *Series* is to call:

In [2]:
counts = pd.Series(data=[644, 1276, 3554, 154])
counts

0     644
1    1276
2    3554
3     154
dtype: int64

In [58]:
data = pd.Series([.25, .5, .75, 1.])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

`data` can be many different things:

- a list
- a Python dict
- a `numpy.ndarray`
- a scalar value

We can also specify an **index** which needs to be the same length as `data`. If we don't specify an index, a default sequence of integers (from `np.arange()`) is assigned as the index. A numpy array comprises the values of the *Series*, which the index is another *Pandas* object: 

In [3]:
counts.values

array([ 644, 1276, 3554,  154])

In [4]:
counts.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, the `series` can be accessed by the associated index via the familiar Python square-bracket notation:

In [60]:
counts[1]

1276

In [61]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### `Series` as a generalized NumPy array

The essential difference between a NumPy array and `pd.Series` is the presence of an `.index` object: whilst the NumPy array has an implicitly defined integer index used, `pd.Series` has an *explicitly* defined index associated with each value that doesn't have to be numerical:

In [62]:
foods = pd.Series([644, 1276, 3554, 154], index=['Oranges', 'Apples', 'Melons', 'Pumpkins'])
foods

Oranges      644
Apples      1276
Melons      3554
Pumpkins     154
dtype: int64

In [64]:
foods["Apples"]

1276

### `Series` as a dictionary

In this way, `pd.Series` is viewed like a specialized Python dictionary object. A dictionary is a structure that maps arbitrary keys to a set a of arbitrary values (key-value pairs). One of the key differences is that the keys and values respectively must be the same type for each value since `pd.Series` is built on top of NumPy, which in turn is built in C.

In [67]:
food_d = {
    'Oranges': 644,
    'Apples': 1276,
    'Melons': 3554,
    'Pumpkins': 154
}

food_s = pd.Series(food_d)
food_s

Oranges      644
Apples      1276
Melons      3554
Pumpkins     154
dtype: int64

This can also be achieved via separate lists:

In [68]:
labels = ['Oranges', 'Apples', 'Melons', 'Pumpkins']
counts = [644, 1276, 3554, 154]
pd.Series(dict(zip(labels,counts)))

Oranges      644
Apples      1276
Melons      3554
Pumpkins     154
dtype: int64

Unlike a dictionary, `pd.Series` supports array-style operations such as slicing:

In [69]:
food_s["Oranges":"Melons"]

Oranges     644
Apples     1276
Melons     3554
dtype: int64

## Pandas.DataFrame

A dataframe is a *2-dimensional* labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts different kinds of input:

- Dict of 1D `numpy.ndarray`s, lists, dicts, or Series
- 2-D `numpy.ndarray`
- A Series
- Another `DataFrame`

Along with the data you can optionally pass **index** and **columns** arguments. If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

One of the really nice aspects about Dataframes, particularly in Jupyter notebook, is the automatic HTML/Javascript generated when visualizing tables:

In [76]:
population = {'California': 38332521,'Texas': 26448193,
              'New York': 19651127,'Florida': 19552860,
              'Illinois': 12882135}
area = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
states = pd.DataFrame({"population":population, "area":area})

In [77]:
states

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


We can extract the column names as:

In [78]:
states.columns

Index(['population', 'area'], dtype='object')

Again the index is accessible:

In [79]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

Individual `pd.Series` can be returned using square-bracket notation to refer to the *columns* in a dataframe:

In [80]:
states["area"]

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

Other ways to construct a `pd.DataFrame`:

In [81]:
pd.DataFrame(np.random.rand(3, 2),columns=["foo","bar"], index=["a","b","c"])

Unnamed: 0,foo,bar
a,0.271922,0.341065
b,0.604413,0.364944
c,0.270858,0.391063


## Pandas.Index

Both `Series` and `DataFrame` contain an explicit index that lets you reference and modify the rows of the data. This `Index` object is of course an interesting structure, and can be thought of as an *immutable* array or *ordered set*.

In [83]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

The `Index` operates like an array in many ways, for instance using slices:

In [84]:
ind[1]

3

In [85]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

Also familiar are the ubiquitous shape and dimension functions common to NumPy arrays:

In [86]:
print(ind.shape, ind.size, ind.ndim, ind.dtype)

(5,) 5 1 int64


**IMPORTANT**: `Index` objects are *Immutable*:

In [87]:
ind[1] = 0

TypeError: Index does not support mutable operations

The `Index` object also features **set operations**, such as joins across datasets, which depend on set theory. It follows many conventions from Pythons' in-built `set` data structure:

In [88]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [89]:
indA & indB

Int64Index([3, 5, 7], dtype='int64')

In [90]:
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [91]:
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

These operations have corresponding object methods, i.e `indA.intersection(indB)`.

## Data Indexing and Selection

For `Series`, we can use straight bracket-notation when selecting indices:

In [92]:
data = pd.Series([.25, .5, .75, 1.], index=["a","b","c","d"])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [93]:
data["b"]

0.5

We can use dictionary-like expressions and methods to treat the `Series` as a dictionary:

In [94]:
"a" in data

True

In [95]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

### Indexers: **loc**, **iloc** and **ix**

Because there are many slicing and indexing conventions, where some are explicit indices, and others implicit, **Pandas** deploys a nuymber of special indexer attributes that expose certain indexing schemes.

The `loc` attribute allows indexing and slicing that always references the explicit index:

In [97]:
data.loc["a"]

0.25

In [98]:
data.loc["a":"c"]

a    0.25
b    0.50
c    0.75
dtype: float64

`iloc` instead exposes the *implicit* positional Python-style indexing:

In [101]:
data.iloc[1]

0.5

In [102]:
data.iloc[:2]

a    0.25
b    0.50
dtype: float64

The `ix` form of indexing is a hybrid of the two; it is recommended not to use and has been discontinued in later `pandas` versions, but we mention it for educational purposes.

### Selection in DataFrame

Let's use the principles we've learnt to see how to select elements in a `pd.DataFrame`.

In [104]:
states.loc["California","population"]

38332521

In [106]:
states.iloc[0, :]

population    38332521
area            423967
Name: California, dtype: int64

In [109]:
states.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

We can modify the object to add a column by selecting two columns and performing an operation, for instance:

In [112]:
states["density"] = states["population"] / states["area"]
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763
New York,19651127,141297,139.076746
Texas,26448193,695662,38.01874


This performs straightforward element-by-element arithmetic between `Series` objects.

We can think of a DataFrame as a two-dimensional array, if all the values are the same type:

In [113]:
states.values

array([[  3.83325210e+07,   4.23967000e+05,   9.04139261e+01],
       [  1.95528600e+07,   1.70312000e+05,   1.14806121e+02],
       [  1.28821350e+07,   1.49995000e+05,   8.58837628e+01],
       [  1.96511270e+07,   1.41297000e+05,   1.39076746e+02],
       [  2.64481930e+07,   6.95662000e+05,   3.80187404e+01]])

From this, familar array-like observations can be done to the DataFrame, for example using transpose:

In [114]:
states.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
population,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
area,423967.0,170312.0,149995.0,141297.0,695662.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


In [115]:
states.iloc[:3, :2]

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995


In [116]:
states.loc[:"Illinois",:"population"]

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135


We can combine familiar Numpy-style *masks* to these indexers:

In [118]:
states.loc[states.density > 100, ["population","density"]]

Unnamed: 0,population,density
Florida,19552860,114.806121
New York,19651127,139.076746


These objects returned give you direct access to the object, which allows for modification as you would in a NumPy array:

In [119]:
states.iloc[0, 2] = 90
states

Unnamed: 0,population,area,density
California,38332521,423967,90.0
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763
New York,19651127,141297,139.076746
Texas,26448193,695662,38.01874


## Pandas Operations

Like NumPy, Pandas allows the ability to perform fast element-wise operations, both with basic arithmetic and more sophisticated operations (trigonometric, exponential, etc). `pandas` inherits the UFuncs template from NumPy.

Operations that *preserve* index and column integrity, pandas will automatically align indices when passing object to *ufunc*. This means that keeping the context of data and combining data from different sources-both potentially error-prone tasks with raw NumPy arrays-become essentially foolproof ones with `pandas`. 

Let's see some examples:

In [120]:
import numpy as np

In [125]:
ser = pd.Series(np.random.randint(0, 10, 4))
df = pd.DataFrame(np.random.randint(0, 10, (3, 4)), columns=["a","b","c","d"])
ser

0    0
1    2
2    4
3    3
dtype: int64

If we apply a NumPy ufunc on either of these objects, the result is another Pandas object with *the indices preserved*:

In [126]:
np.exp(ser)

0     1.000000
1     7.389056
2    54.598150
3    20.085537
dtype: float64

In [127]:
np.sin(df * np.pi / 4)

Unnamed: 0,a,b,c,d
0,1.0,0.707107,-1.0,1.224647e-16
1,1.0,0.707107,0.7071068,1.224647e-16
2,-1.0,-0.707107,-2.449294e-16,0.7071068


When UFuncs is applied on binary operations when data is incomplete, operations return `nan`:

In [128]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [129]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the *union* of indices of the two input arrays, which is determined by Python `set`:

In [130]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item which only exists in either one is marked with `NaN`, or *Not a Number*. This is Pandas' way of marking missing data.

In [131]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

To get around this behaviour, a fill value can be precomputed when performing certain operations:

In [132]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

The same rules apply when using UFuncs in a DataFrame.

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s) |
| ------------ | ---------------- |
| `+` | `add()` |
| `-` | `sub()`, `subtract()` |
| `*` | `mul()`, `multiply()` |
| `/` | `truediv()`, `div()`, `divide()` |
| `//` | `floordiv()` |
| `%` | `mod()` |
| `**` | `pow()` |

###  Operations between DataFrame and Series

When performing operations between a Frame and Series, the index and column alignment is similarly preserved and similar to NumPy array operations. One common operation would be the difference of a 2-d array and one of it's rows:

In [133]:
A = np.random.randint(10, size=(3,4))
A

array([[3, 7, 6, 5],
       [5, 3, 3, 3],
       [8, 6, 2, 2]])

In [134]:
A - A[0]

array([[ 0,  0,  0,  0],
       [ 2, -4, -3, -2],
       [ 5, -1, -4, -3]])

Similarly in Pandas, the convention operates row-wise:

In [135]:
Adf = pd.DataFrame(A, columns=list("QRST"))
Adf - Adf.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,2,-4,-3,-2
2,5,-1,-4,-3


If you want to operator column-wise, use the object methods, specifyign the `axis` keyword:

In [137]:
Adf.sub(Adf["Q"], axis=0)

Unnamed: 0,Q,R,S,T
0,0,4,3,2
1,0,-2,-2,-2
2,0,-2,-6,-6


## Missing Values

One of the big differences between tutorials and the real world is that real-world data is rarely **clean** and homoegenous. Most datasets will have *some amount of data missing*. To complicate matters still, different data sources often indicate missing data in different ways. 

In Pandas, we represent missing values as `NaN`, `NA` or `null`, depending on data type and other factors. Pandas chooses two representations: the Pythonic `None` object, and the IEEE floating-point value `NaN` sentinel-based approach.

In [139]:
vals = np.array([1, None, 3, 4])
vals

array([1, None, 3, 4], dtype=object)

### `None`: Pythonic missing data

This is the Python singleton object that is used for missing data in Python code. 

The `dtype=object` means that the best representation NumPy can infer is that every element is a Python object. This means that operations on this object are done at the Python level and not in C, meaning very slow indeed!

In [141]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
44.3 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.15 ms ± 4.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



`None` also prevents common aggregation functions like `sum` or `min` across an array.

In [140]:
vals.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### `NaN`: Missing numerical data

This is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [143]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

NumPy has a native floating-point type for this array: meaning that this array supports fast operations pushed into compiled code. Be warned though: `NaN` can infect UFunc operations to result in `NaN`s:

In [144]:
1 + np.nan

nan

In [146]:
100**5 * np.nan

nan

In [147]:
vals2.sum()

nan

NumPy provides special aggregations that do ignore missing values...

In [148]:
np.nansum(vals2)

8.0

Pandas is built to handle the two interchangeably, converting where appropriate:

In [149]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

## Reading and Writing Files

The Pandas I/O API is a set of top level reader functions accessed like `pandas.read_csv()` that generally return a Pandas object. The corresponding *writer* functions are object methods that accessed like `DataFrame.to_csv()`. Below is a table containing a sample of different readers and writers:

| Format Type | Data Description | Reader  | Writer |
| ----- | ---- | ------ | ----- | 
| text | CSV | `read_csv` | `to_csv` |
| text | JSON | `read_json` | `to_json` |
| text | HTML | `read_html` | `to_html` |
| binary | MS Excel | `read_excel` | `to_excel` |
| binary | HDF5 format | `read_hdf` | `to_hdf` |
| SQL | SQL | `read_sql` | `to_sql` |

Some important parameters to functions like `read_csv()` include:

- __filepath__: The path to the file or URL
- __sep__: The delimiter to use (for instance .csv is comma-separated, other favourites are tab-delimited \t)
- __header__: The row number to use a column names (and the start of the data)
- __index_col__: The column to use as row labels of the DataFrame
- __prefix__: Allows a prefix to be added to the column names

In [13]:
titanic = pd.read_excel("datasets/titanic.xlsx")
titanic.head()

Unnamed: 0,PassengerId,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
0,1,22.0,,Southampton,7.25,"Braund, Mr. Owen Harris",0,3rd class,male,1,0,A/5 21171
1,2,38.0,C85,Cherbourg,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1st class,female,1,1,PC 17599
2,3,26.0,,Southampton,7.925,"Heikkinen, Miss. Laina",0,3rd class,female,0,1,STON/O2. 3101282
3,4,35.0,C123,Southampton,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1st class,female,1,1,113803
4,5,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450


We can also extract from csv or any other flat-file k-delimited style format. This can be specified in the 'sep' argument within a call to `read_csv` or `read_table`.

## Basic Aggregation

The toys of NumPy are back in a similar form: max, min, mean, sum etc.

Below are a list of built-in Pandas aggregations:

| **Aggregation** | **Description** |
| -------------- | ----------------- |
| `count()` | Total number of non-NA items |
| `first()`, `last()` | First and last item |
| `mean()` | Arithmetic mean |
| `median()` | Middle value |
| `min()`, `max()` | Smallest, largest value |
| `std()`, `var()` | Standard deviation and variance |
| `mad()` | Mean absolute deviation |
| `prod()` | Product of all items |
| `sum()` | Summation of all items |

In [32]:
titanic.sum()

Age                                                     31255.7
Fare                                                    43550.5
Name          Braund, Mr. Owen HarrisCumings, Mrs. John Brad...
n_parents                                                   504
Pclass        3rd class1st class3rd class1st class3rd class3...
Sex           malefemalefemalefemalemalemalemalemalefemalefe...
n_siblings                                                  653
Survived                                                    494
Ticket        A/5 21171PC 17599STON/O2. 31012821138033734503...
dtype: object

In [33]:
titanic.Age.mean()

29.881137667304014

In [34]:
titanic.describe()

Unnamed: 0,Age,Fare,n_parents,n_siblings,Survived
count,1046.0,1308.0,1309.0,1309.0,1309.0
mean,29.881138,33.295479,0.385027,0.498854,0.377387
std,14.413493,51.758668,0.86556,1.041658,0.484918
min,0.17,0.0,0.0,0.0,0.0
25%,21.0,7.8958,0.0,0.0,0.0
50%,28.0,14.4542,0.0,0.0,0.0
75%,39.0,31.275,0.0,1.0,1.0
max,80.0,512.3292,9.0,8.0,1.0


We can choose to aggregate by more than one feature, to generate a `pandas.DataFrame` whereby the index/column names become the type of aggregation we desire:

In [37]:
titanic.agg(['min','max'])

Unnamed: 0,Age,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
min,0.17,0.0,"Abbing, Mr. Anthony",0,1st class,female,0,0,110152
max,80.0,512.3292,"van Melkebeke, Mr. Philemon",9,3rd class,male,8,1,WE/P 5735


## Sorting

We can also sort the data in our Dataframes, either by sorting the values themselves, or by the index.

In [42]:
titanic.sort_values(by='Age', ascending=False).head(3)

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
631,80.0,A23,Southampton,30.0,"Barkworth, Mr. Algernon Henry Wilson",0,1st class,male,0,1,27042,2.666667
988,76.0,C46,Southampton,78.85,"Cavendish, Mrs. Tyrell William (Julia Florence...",0,1st class,female,1,1,19877,0.963855
852,74.0,,Southampton,7.775,"Svensson, Mr. Johan",0,3rd class,male,0,0,347060,9.517685


In [43]:
titanic.sort_index(ascending=False).head(3)

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1309,,,Cherbourg,22.3583,"Peter, Master. Michael J",1,3rd class,male,1,0,2668,
1308,,,Southampton,8.05,"Ware, Mr. Frederick",0,3rd class,male,0,0,359309,
1307,38.5,,Southampton,7.25,"Saether, Mr. Simon Sivertsen",0,3rd class,male,0,0,SOTON/O.Q. 3101262,5.310345


In [44]:
titanic.sort_values(by=['n_parents','Fare'], ascending=[False,True]).head()

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1234,,,Southampton,69.55,"Sage, Mr. John George",9,3rd class,male,1,0,CA. 2343,
1257,,,Southampton,69.55,"Sage, Mrs. John (Annie Bullen)",9,3rd class,female,1,1,CA. 2343,
679,43.0,,Southampton,46.9,"Goodwin, Mrs. Frederick (Augusta Tyler)",6,3rd class,female,1,0,CA 2144,0.916844
1031,40.0,,Southampton,46.9,"Goodwin, Mr. Charles Frederick",6,3rd class,male,1,0,CA 2144,0.852878
886,39.0,,Queenstown,29.125,"Rice, Mrs. William (Margaret Norton)",5,3rd class,female,0,0,382652,1.339056


We can `rank()` each value relative to the others if desired:

In [45]:
titanic.Fare.rank().head()

PassengerId
1     108.5
2    1155.5
3     349.0
4    1091.5
5     391.5
Name: Fare, dtype: float64

## Counts

We can count the number of unique values in a column with `value_counts()` - incredibly useful!

In [46]:
titanic.Survived.value_counts()

0    815
1    494
Name: Survived, dtype: int64

In [47]:
titanic.Sex.value_counts()

male      843
female    466
Name: Sex, dtype: int64

## Handling Complex String columns

We may wish to break down the 'name' category into title, first and last names.

In [48]:
titanic.Name.head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

In [49]:
complex_names = titanic.Name.str.extract("(?P<Surname>[a-zA-Z]+),\s(?P<Title>[a-zA-Z]+).\s(?P<Forename>[a-zA-Z]+)",
                         expand=True)
complex_names.head()

Unnamed: 0_level_0,Surname,Title,Forename
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Braund,Mr,Owen
2,Cumings,Mrs,John
3,Heikkinen,Miss,Laina
4,Futrelle,Mrs,Jacques
5,Allen,Mr,William


In [50]:
# or alternatively, splitting a string by a common character, such as comma
titanic.Name.str.split(" ", expand=True).head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,"Braund,",Mr.,Owen,Harris,,,,,,,,,,
2,"Cumings,",Mrs.,John,Bradley,(Florence,Briggs,Thayer),,,,,,,
3,"Heikkinen,",Miss.,Laina,,,,,,,,,,,
4,"Futrelle,",Mrs.,Jacques,Heath,(Lily,May,Peel),,,,,,,
5,"Allen,",Mr.,William,Henry,,,,,,,,,,


## Tasks

You'll be working with the **tips dataset**, which contains data regarding customers in a restaurant, how much they paid and tipped, and some characteristics about the customers such as whether they smoked or not. Most of the data is preprocessed for you already.

In [52]:
tips = pd.read_csv("datasets/tips.csv")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Task 1

Select all of the customers that ate at dinner time, didn't smoke, and paid more than \$25 for their total bill **or** tipped more than \$4.

In [53]:
tips.query("((total_bill > 25) | (tip > 4)) & (smoker == 'No') & (time == 'Dinner')")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
5,25.29,4.71,Male,No,Sun,Dinner,4
7,26.88,3.12,Male,No,Sun,Dinner,4
11,35.26,5.0,Female,No,Sun,Dinner,4
20,17.92,4.08,Male,No,Sat,Dinner,2
23,39.42,7.58,Male,No,Sat,Dinner,4
28,21.7,4.3,Male,No,Sat,Dinner,2
39,31.27,5.0,Male,No,Sat,Dinner,3
44,30.4,5.6,Male,No,Sun,Dinner,4
46,22.23,5.0,Male,No,Sun,Dinner,2
47,32.4,6.0,Male,No,Sun,Dinner,4


### Task 2

Calculate the Pearson correlation between the total bill per customer and the tip.

In [54]:
tips[["total_bill","tip"]].corr()

Unnamed: 0,total_bill,tip
total_bill,1.0,0.675734
tip,0.675734,1.0


### Task 3

Calculate the mean total bill and tip per customer, by day and gender.

In [55]:
tips.groupby(["day","sex"]).agg({"total_bill":"mean", "tip":"mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip
day,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,Female,14.145556,2.781111
Fri,Male,19.857,2.693
Sat,Female,19.680357,2.801786
Sat,Male,20.802542,3.083898
Sun,Female,19.872222,3.367222
Sun,Male,21.887241,3.220345
Thur,Female,16.715312,2.575625
Thur,Male,18.714667,2.980333


### Task 4

Sort customers by the tips and by smokers, and select the top 10 tippers who smoke.

In [56]:
tips.sort_values(by=["smoker","tip"], ascending=[False,False]).iloc[:10,:]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
170,50.81,10.0,Male,Yes,Sat,Dinner,3
183,23.17,6.5,Male,Yes,Sun,Dinner,4
214,28.17,6.5,Female,Yes,Sat,Dinner,3
181,23.33,5.65,Male,Yes,Sun,Dinner,2
211,25.89,5.16,Male,Yes,Sat,Dinner,4
172,7.25,5.15,Male,Yes,Sun,Dinner,2
73,25.28,5.0,Female,Yes,Sat,Dinner,2
83,32.68,5.0,Male,Yes,Thur,Lunch,2
197,43.11,5.0,Female,Yes,Thur,Lunch,4
95,40.17,4.73,Male,Yes,Fri,Dinner,4


## Solutions

**WARNING**: _Please attempt to solve the problems before fetching the solutions!_

See the solutions to all of the problems here:

In [1]:
%load solutions/01_solutions.py