# Data Wrangling

Crash course into data generation, handling, manipulation, transformatio and preparation. This can lead to further analysis: statistical or visual (which we will explore in workshop 3) or training of predictive models. This assumes the data is already cleaned and structured, not in raw form (which we will briefly explore in workshop 4).

Pre-requisites: basic understanding of programming, Python is a plus.

### Table of contents

- Numpy Array

   - Creation
   - Properties
     - Shape
     - Data Type
  - Generation
    - Randomness
  - Accessing
    - Iteration
  - Copying
  - Array Operations
    - Arithmetic
    - Logical
    - Reshaping
    - Broadcasting
    - Masking
   - Extension

  

- Pandas Dataframes
   - Creation
   - Properties
   - Accessing
   - Extension
   - Missing values
   - Deletion
   - Operations
      - Arithmetic
      - Logical
      - String Functions
      - Arbitrary Functions
      - Statistical
      - Ordering
   - Data Transformations
      - Group By
      - Pivot
      - Melt
      - Join
   - Time Series
   - Hierarchical Indices
   - Loading and Saving Dataframes



- Further Reading

## Numpy Arrays

Numpy offers powerful array objects. It is part of the de-facto [Python ecosystem](https://www.scipy.org) for mathematics, science and engineering, and sits at the foundation of most scientific computation libraries.

They are homogenous (all elements are the same type) containers, usually for numbers, that indexed by integers. They are similar to lists but offer much more functionality and performance.

In [1]:
import numpy as np  # import the package into our namespace, under the usual name `np`

### Array Creation

Create an array from a regular `list`s object:

In [2]:
squares = np.array([0, 1, 4, 9, 16, 25, 36, 49])

Create a 2D array/a matrix:

In [3]:
m = np.array([
    [5, 2, 3],
    [4, 5, 1],
    [7, 1, 2],
    [6, 2, 9],
])

Arbitrarily many dimensions dimensions:

In [4]:
# pixels, which are 3-dimensional points:
R = [1, 0, 0]  # red
B = [0, 0, 1]  # blue
W = [1, 1, 1]  # white

In [5]:
image = np.array([
    [B, B, R, R],
    [B, B, W, W],
    [R, R, R, R],
    [W, W, W, W],
    [R, R, R, R],
    [W, W, W, W],
])

In [6]:
image

array([[[0, 0, 1],
        [0, 0, 1],
        [1, 0, 0],
        [1, 0, 0]],

       [[0, 0, 1],
        [0, 0, 1],
        [1, 1, 1],
        [1, 1, 1]],

       [[1, 0, 0],
        [1, 0, 0],
        [1, 0, 0],
        [1, 0, 0]],

       [[1, 1, 1],
        [1, 1, 1],
        [1, 1, 1],
        [1, 1, 1]],

       [[1, 0, 0],
        [1, 0, 0],
        [1, 0, 0],
        [1, 0, 0]],

       [[1, 1, 1],
        [1, 1, 1],
        [1, 1, 1],
        [1, 1, 1]]])

**💪 Exercise**: create a numpy array of three rows, two columns of arbitrary numbers from `0` to `5`:

In [7]:
np.array([
    [1, 3],
    [3, 1],
    [3, 2],
])

array([[1, 3],
       [3, 1],
       [3, 2]])

#### Array Shape

The _shape_ of each array is the size of each dimension:

In [8]:
squares.shape  # 8 elements

(8,)

In [9]:
m.shape  # 4 rows, 3 columns

(4, 3)

In [10]:
image.shape  # 6 rows, 4 columns, each element containing 3 coordinates

(6, 4, 3)

An array's _rank_ is the number of dimensions.

---

All dimensions must have equal size, meaning we can't have something like:

In [11]:
a = np.array([
    [1,2,3], 
    [1,2]
])
a

array([list([1, 2, 3]), list([1, 2])], dtype=object)

In [12]:
a.shape

(2,)

It just interpreted it as an array of two elements, each element being a list. More details in the next sub-section.

#### Array Data Types

An array's data type is the type of the object they are holding.

In [13]:
squares.dtype

dtype('int64')

In [14]:
m.dtype

dtype('int64')

In [15]:
np.array([1.5, 2.3, 4.9]).dtype

dtype('float64')

In [16]:
np.array([True, True, False]).dtype

dtype('bool')

In [17]:
np.array(['abc', 'def', 'xy']).dtype  # unicode with 3 or fewer characters

dtype('<U3')

Compatible datatypes are "up-scaled":

In [18]:
a = np.array([True, 5])
a

array([1, 5])

In [19]:
a.dtype

dtype('int64')

In [20]:
np.array([1, 2.5]).dtype

dtype('float64')

In [21]:
np.array([True, 2, 3.5]).dtype

dtype('float64')

In [22]:
np.array([7, 'abc']).dtype  # calls `str` on them

dtype('<U21')

Incompatible datatypes are put under the `object` umbrella:

In [23]:
s = {1, 2, 3}
np.array([s, 5]).dtype

dtype('O')

You can also call type conversion manually:

In [24]:
np.array([2, 4, 0]).astype(bool)

array([ True,  True, False])

In [25]:
np.array([2, 4, 0]).astype(float)

array([2., 4., 0.])

Or upon creation:

In [26]:
np.array([2, 4], dtype=float)

array([2., 4.])

**💪 Exercise**: create an array of three bolleans, with dtype `str`:

In [27]:
np.array([True, True, False], dtype='str')

array(['True', 'True', 'False'], dtype='<U5')

#### Array Generation

Similar to the built-in `range`, generate an array of sequential numbers:

In [28]:
np.arange(7)

array([0, 1, 2, 3, 4, 5, 6])

**ℹ️ Tip**: it's called `a range` as in `an interval`, not `arrange` as in `align` — it confused me for the longest time.

A more powerful, non-integer counterpart:

In [29]:
np.linspace(start=0, stop=15, num=5)

array([ 0.  ,  3.75,  7.5 , 11.25, 15.  ])

Similarly, there is a logarithmic counterpart:

In [30]:
np.logspace(0, 3, num=4, base=10.)

array([   1.,   10.,  100., 1000.])

---

Generate an array of equal elements:

In [31]:
np.ones(4)

array([1., 1., 1., 1.])

In [32]:
np.zeros((3, 2))  # any shape

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

Specify shape based on another array's:

In [33]:
np.zeros_like(squares)  # same shape and dtype as `squares`

array([0, 0, 0, 0, 0, 0, 0, 0])

If you just want to instantiate an array, filling the elements later, you skip the filling step, and create one with bogus elements:

In [34]:
np.empty(50)  # the address where it is assigned is arbitrary, so you will likely see different results each time this is ran

array([-0.00000000e+000, -0.00000000e+000,  1.33397724e-322,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000, -0.00000000e+000,
       -0.00000000e+000,  9.88131292e-323,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000, -0.00000000e+000, -0.00000000e+000,
        6.42285340e-323,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  4.57222660e-071, -0.00000000e+000,
       -0.00000000e+000,  4.82337433e+228,  6.14415221e-144,
        6.17582057e-322,

**💪 Exercise**: generate an array of 6 rows, 3 columns of ones:

In [35]:
np.ones((6, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

#### Array Copying

Assign a new "label" to the same array object (similar to `&` references in C-like languages):

In [36]:
a = np.ones(3)  # create an array, and "assign the label" `a` to it
a

array([1., 1., 1.])

In [37]:
b = a  # `b` is now another label for the same array

Modifications in any label affect the base object:

In [38]:
b[0] = 5
b

array([5., 1., 1.])

In [39]:
a  # modified indirectly

array([5., 1., 1.])

---

To "clone" the object, use `copy` instead:

In [40]:
a = np.ones(3)
a

array([1., 1., 1.])

In [41]:
b = np.copy(a)

In [42]:
b[0] = 5
b

array([5., 1., 1.])

In [43]:
a

array([1., 1., 1.])

**ℹ️ Tip**: this still fails if you store non-primitive data types:

In [44]:
a = np.array([
    {1, 2, 3},  # a set
    {5, 4},
])
a

array([{1, 2, 3}, {4, 5}], dtype=object)

In [45]:
b = np.copy(a)

In [46]:
b[0].remove(3)
b

array([{1, 2}, {4, 5}], dtype=object)

In [47]:
a  # still affected

array([{1, 2}, {4, 5}], dtype=object)

In this case, `deepcopy` would be useful, from the [copy built-in library](https://docs.python.org/3.7/library/copy.html).

#### Randomness

Generating random numbers sees more use in training predictive models, but they can also be relevant in terms of example data. It also has some uses in some advanced data visualizations.

Uniformly distributed $V \sim U(0, 1)$:

In [48]:
np.random.rand(4, 2)

array([[0.63309625, 0.03335629],
       [0.66657849, 0.83209351],
       [0.90248498, 0.37580042],
       [0.71807952, 0.05045895]])

Normal standard distribution, zero-centered and unit deviation $\sim N(0, 1)$:

In [49]:
np.random.randn(3)

array([-0.30190248,  0.00974553, -0.63762446])

Uniformly distributed integers in an interval $V \sim U_{\mathbb{Z}}(a, b)$:

In [50]:
np.random.randint(0, 10, size=3)

array([7, 1, 3])

Sampling elements from a given set, with or without replcement:

In [51]:
np.random.choice(['red', 'green', 'blue'], size=5)

array(['red', 'blue', 'red', 'red', 'red'], dtype='<U5')

Generating permutations $\sigma \in \mathbb{N}_n$:

In [52]:
np.random.permutation([4, 2, 1])

array([2, 1, 4])

**ℹ️ Tip**: setting the random seed allows for reproducibility of results when randomness is involved. The results are still random, but they are the same ones, every time. Since most scientific libraries delegate their random generation to numpy, `np.random.seed(123)` is sufficient for all. Read more about [random number generation](https://en.wikipedia.org/wiki/Pseudorandom_number_generator).

**💪 Exercise**: generate an array of 6 rows, 3 columns of random integers between `0` and `9`:

In [53]:
np.random.randint(0, 9, size=(6, 3))

array([[1, 4, 7],
       [1, 2, 0],
       [6, 2, 2],
       [7, 1, 5],
       [2, 7, 8],
       [7, 5, 5]])

### Array Accessing

Index, and slice accessing is similar to `list`s:

In [54]:
squares  # to remember what it contains

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [55]:
squares[2]  # remember, zero-indexed

4

In [56]:
squares[2:6]  # slices

array([ 4,  9, 16, 25])

---

It extends naturally to multi-dimensional arrays:

In [57]:
m  # to remember what it contains

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [58]:
m[:2]  # first two rows

array([[5, 2, 3],
       [4, 5, 1]])

In [59]:
m[:, :2]  # all rows, first two columns

array([[5, 2],
       [4, 5],
       [7, 1],
       [6, 2]])

In [60]:
m[:2, :2]  # first two rows of the first two columns

array([[5, 2],
       [4, 5]])

---

_Fancy_ indexing (this is the actual term) allows for accessing multiple elements at once:

In [61]:
indices = [4, 2, 2]  # can also repeat
squares[indices]

array([16,  4,  4])

In [62]:
row_indices = [0, 1]
col_indices = [0, 2]
m[row_indices, col_indices]

array([5, 1])

**💪 Exercise**: access all rows, columns 2 through 3 of `m`:

In [63]:
m[:, 1:3]

array([[2, 3],
       [5, 1],
       [1, 2],
       [2, 9]])

### Array Iteration

Iteration works the same:

In [64]:
for sq in squares:
    print(sq)

0
1
4
9
16
25
36
49


Enumeration has its n-dimensional counterpart:

In [65]:
for index, element in np.ndenumerate(m):
    print('index', index, 'element', element)

index (0, 0) element 5
index (0, 1) element 2
index (0, 2) element 3
index (1, 0) element 4
index (1, 1) element 5
index (1, 2) element 1
index (2, 0) element 7
index (2, 1) element 1
index (2, 2) element 2
index (3, 0) element 6
index (3, 1) element 2
index (3, 2) element 9


### Array Operations

Arithmetic operations are _vectorized_ — extended for array operations:

In [66]:
squares  # to remember what it contains

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [67]:
squares + 100  # add 100 to each element

array([100, 101, 104, 109, 116, 125, 136, 149])

In [68]:
squares ** .5  # raise every element to the power 0.5 (square root it)

array([0., 1., 2., 3., 4., 5., 6., 7.])

**ℹ️ Tip**: an array containing all `5`s is generated by `np.ones(dim) * 5`

---

Conditional operators as well, and their result is boolean:

In [69]:
squares > 5

array([False, False, False,  True,  True,  True,  True,  True])

In [70]:
squares == 1

array([False,  True, False, False, False, False, False, False])

In [71]:
odd = (squares % 2 == 0)
odd

array([ True, False,  True, False,  True, False,  True, False])

---

There are also unary operators, such as negation:

In [72]:
~odd

array([False,  True, False,  True, False,  True, False,  True])

In [73]:
-squares

array([  0,  -1,  -4,  -9, -16, -25, -36, -49])

---

Aggregations and other more complex operations are available as methods:

In [74]:
squares.sum()  # sum of all elements

140

In [75]:
sum(squares)  # equivalent

140

In [76]:
np.log(squares + 1)

array([0.        , 0.69314718, 1.60943791, 2.30258509, 2.83321334,
       3.25809654, 3.61091791, 3.91202301])

In [77]:
squares.mean()  # equivalent to sum/len

17.5

In [78]:
squares.std()  # standard deviation

16.680827317612277

In [79]:
squares.cumsum()  # cumulative sum

array([  0,   1,   5,  14,  30,  55,  91, 140])

---

In [80]:
a = np.array([2, 0, 4])

In [81]:
a.argmin()  # the index of the minimum element

1

In [82]:
a.argsort()  # the indices that would sort the array

array([1, 0, 2])

---

Operators naturally extend to multiple dimensions as well:

In [83]:
m * 10

array([[50, 20, 30],
       [40, 50, 10],
       [70, 10, 20],
       [60, 20, 90]])

In [84]:
m == 5

array([[ True, False, False],
       [False,  True, False],
       [False, False, False],
       [False, False, False]])

Element-wise application of operators between to arrays:

In [85]:
a

array([2, 0, 4])

In [86]:
b = np.array([9, 6, 6])

In [87]:
a + b

array([11,  6, 10])

In [88]:
a * b

array([18,  0, 24])

---

Binary operations:

In [89]:
a = squares > 5
a

array([False, False, False,  True,  True,  True,  True,  True])

In [90]:
b = (squares % 2 == 0)
b

array([ True, False,  True, False,  True, False,  True, False])

In [91]:
a & b

array([False, False, False, False,  True, False,  True, False])

In [92]:
a | b

array([ True, False,  True,  True,  True,  True,  True,  True])

---

In [93]:
x = np.linspace(0, np.pi, num=5)
x

array([0.        , 0.78539816, 1.57079633, 2.35619449, 3.14159265])

In [94]:
np.sin(x).round(3)

array([0.   , 0.707, 1.   , 0.707, 0.   ])

---

Functions, when applied to multi-dimensional arrays, allow you to specify an axis. In a 2D matrix, that means either column-wise or row-wise:

In [95]:
m  # refresher

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [96]:
m.sum()  # overall sum of all elements, no axis specified

47

In [97]:
m.sum(axis=0)  # first axis, column wise — one result for each column

array([22, 10, 15])

In [98]:
m.sum(axis=1)  # per each row

array([10, 10, 10, 17])

---

The `*` operator gives the hadamard product (element-wise multiplication) between matrices:

In [99]:
m * m

array([[25,  4,  9],
       [16, 25,  1],
       [49,  1,  4],
       [36,  4, 81]])

Matrix multiplication is done using the `@` operator (previously, using `a.dot(b)`):

In [100]:
m @ m.transpose()

array([[ 38,  33,  43,  61],
       [ 33,  42,  35,  43],
       [ 43,  35,  54,  62],
       [ 61,  43,  62, 121]])

---

**💪 Exercise**: re-generate the `squares` array, but using numpy:

In [101]:
n = len(squares)
np.arange(n) ** 2

array([ 0,  1,  4,  9, 16, 25, 36, 49])

#### Reshaping

Arrays can be morph into a different (compatible) shape:

In [102]:
squares  # original

array([ 0,  1,  4,  9, 16, 25, 36, 49])

In [103]:
squares.reshape(2, 4)  # 2 rows, 4 columns

array([[ 0,  1,  4,  9],
       [16, 25, 36, 49]])

In [104]:
squares.reshape(4, 2)  # 4 rows, 2 columns

array([[ 0,  1],
       [ 4,  9],
       [16, 25],
       [36, 49]])

Flatten an array of any shape with `.reshape(-1)`:

In [105]:
m.reshape(-1)

array([5, 2, 3, 4, 5, 1, 7, 1, 2, 6, 2, 9])

In [106]:
image.reshape(-1).shape

(72,)

Transposition (axis inversion):

In [107]:
m.T

array([[5, 4, 7, 6],
       [2, 5, 1, 2],
       [3, 1, 2, 9]])

---

Generating new axes can be useful when certain functions require the data in a particular shape, even if it is degenerated:

In [108]:
squares[:, np.newaxis]  # make each element be a list

array([[ 0],
       [ 1],
       [ 4],
       [ 9],
       [16],
       [25],
       [36],
       [49]])

In [109]:
squares[np.newaxis, :]  # wrap the array

array([[ 0,  1,  4,  9, 16, 25, 36, 49]])

In [110]:
squares.shape  # original shape

(8,)

In [111]:
squares[:, np.newaxis].shape

(8, 1)

In [112]:
squares[np.newaxis, :].shape

(1, 8)

**ℹ️ Tip**: shapes are simply "views" of the underlying data, which is stored the same way, regardless of assigned shape. Read more about how [data is stored internally](https://docs.scipy.org/doc/numpy-1.13.0/reference/internals.html).

**💪 Exercise**: in how many ways can an array of size 12 be reshaped?

In [113]:
a = np.arange(12)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [114]:
a.shape

(12,)

In [115]:
a.reshape(2, 6)

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

In [116]:
a.reshape(3, 4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [117]:
a.reshape(4, 3)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [118]:
a.reshape(6, 2)

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [119]:
a.reshape(12, 1)

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11]])

**👾 Trivia**: [this is why](https://www.youtube.com/watch?v=U6xJfP7-HCc) a base-12 number system would make arithmetic easier.

### Broadcasting

The same operation, `+`, is used both when adding a constant to each element, and also when performing element-wise addition. This concept is extended to arbitrarily many dimensions. The right-hand-side of the operator is _broadcasted_ until it reaches the left-hand-side's shape.

In [120]:
squares + 10

array([10, 11, 14, 19, 26, 35, 46, 59])

In [121]:
tens = [10] * 8  # eight elements, each equal to 10
squares + tens  # behind the scenes, the rhs is broadcasted to match the lhs' shape

array([10, 11, 14, 19, 26, 35, 46, 59])

It becomes non-arbitrary in higher dimensions:

In [122]:
m  # content refresher

array([[5, 2, 3],
       [4, 5, 1],
       [7, 1, 2],
       [6, 2, 9]])

In [123]:
m + [100, 10, 0]  # add these values to each row
# for each row, the first element 

array([[105,  12,   3],
       [104,  15,   1],
       [107,  11,   2],
       [106,  12,   9]])

In [124]:
m + [[1000], [100], [10], [0]]  # add these values to each column
# 

array([[1005, 1002, 1003],
       [ 104,  105,  101],
       [  17,   11,   12],
       [   6,    2,    9]])

Read more about [broadcasting](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html).

### Array Masking

Boolean indexing — access only those elements where the indexing array is `True`:

In [125]:
mask = (squares > 5)

In [126]:
mask

array([False, False, False,  True,  True,  True,  True,  True])

In [127]:
squares[mask]

array([ 9, 16, 25, 36, 49])

**💪 Exercise**: select only even `squares`:

In [128]:
squares[squares % 2 == 0]

array([ 0,  4, 16, 36])

### Array Extension

Since `+` is reserved for addition, array concatenation is done by function:

In [129]:
np.concatenate([squares, squares])

array([ 0,  1,  4,  9, 16, 25, 36, 49,  0,  1,  4,  9, 16, 25, 36, 49])

In the multi-dimensional case:

In [130]:
a = np.arange(6).reshape(3, 2)
a

array([[0, 1],
       [2, 3],
       [4, 5]])

In [131]:
b = np.ones((2, 2))
b

array([[1., 1.],
       [1., 1.]])

In [132]:
np.concatenate([a, b])

array([[0., 1.],
       [2., 3.],
       [4., 5.],
       [1., 1.],
       [1., 1.]])

---

In [133]:
c = np.zeros((2, 2))

In [134]:
np.vstack([b, c])  # on top of eachother

array([[1., 1.],
       [1., 1.],
       [0., 0.],
       [0., 0.]])

In [135]:
np.hstack([b, c])  # next to eachother

array([[1., 1., 0., 0.],
       [1., 1., 0., 0.]])

## Pandas Dataframes

Dataframes are easy-to-use, functionality-packed data structures for data handling and analysis. They can be thought of as non-homogenous matrices  for row/column labeled data, which also offer a lot of extra functionality.

In [136]:
import pandas as pd  # the usual abbreviation

### Creation

Instantiate from a 2D array:

In [137]:
pd.DataFrame(m)

Unnamed: 0,0,1,2
0,5,2,3
1,4,5,1
2,7,1,2
3,6,2,9


When working with labeled data, provide a dict of `column_name : value_for_each_row`:

In [138]:
students = pd.DataFrame({
    'height':    np.random.randint(150, 200, size=5),
    'weight':    np.random.randint(50,  100, size=5),
    'graduated': np.random.random(size=5) > .5,  # same as np.random.randint(0, 1, size=10).astype(bool)
})

Each row is an observation (i.e.: a student), each column is a variable (i.e.: a measurement):

In [139]:
students

Unnamed: 0,height,weight,graduated
0,155,88,False
1,187,67,True
2,155,75,False
3,160,64,True
4,185,98,True


_Note_: since randomness is involved, when you run this notebook, you'll likely see different results.

Dataframes are implicitly indexed by integers, but rows can be assigned more descriptive indices:

In [140]:
students.index = list('abcde')

In [141]:
students

Unnamed: 0,height,weight,graduated
a,155,88,False
b,187,67,True
c,155,75,False
d,160,64,True
e,185,98,True


**💪 Exercise**: create a new dataframe, `food_stats`, which contains ratings, on a 1-3 scale for `tasty`, `healthy` and whether you had it recently (`had_recently`) for the following types of food: `pizza`, `carrot`, `chocolate`, `banana`:

In [142]:
food_stats = pd.DataFrame({
    'pizza':     (2, 2, True),
    'carrot':    (1, 3, True),
    'chocolate': (3, 2, False),
    'banana':    (3, 3, True),
}, index=['tasty', 'healthy', 'had_recently']).T


food_stats

Unnamed: 0,tasty,healthy,had_recently
pizza,2,2,True
carrot,1,3,True
chocolate,3,2,False
banana,3,3,True


In [143]:
food_stats = pd.DataFrame({
    'tasty':        [2, 1, 3, 2],
    'healthy':      [1, 3, 2, 3],
    'had_recently': [True, True, False, True]
}, index=['pizza', 'carrot', 'chocolate', 'banana'])

food_stats

Unnamed: 0,tasty,healthy,had_recently
pizza,2,1,True
carrot,1,3,True
chocolate,3,2,False
banana,2,3,True


### Properties

Shape and data type are extended to dataframes as well:

In [144]:
len(students)  # size of first dimension, i.e.: number of rows

5

In [145]:
students.shape

(5, 3)

In [146]:
students.dtypes

height       int64
weight       int64
graduated     bool
dtype: object

In [147]:
students.height.astype(float)

a    155.0
b    187.0
c    155.0
d    160.0
e    185.0
Name: height, dtype: float64

**💪 Exercise**: check the `shape` and `dtypes` of your `food_stats`:

In [148]:
food_stats.shape

(4, 3)

In [149]:
food_stats.dtypes

tasty           int64
healthy         int64
had_recently     bool
dtype: object

### Accessing

Access rows elements by index:

In [150]:
students.loc['a']  # the observations for student A

height         155
weight          88
graduated    False
Name: a, dtype: object

Access rows by their position, regardless of index name:

In [151]:
students.iloc[0]

height         155
weight          88
graduated    False
Name: a, dtype: object

Slices are extended row-wise:

In [152]:
students[:3]

Unnamed: 0,height,weight,graduated
a,155,88,False
b,187,67,True
c,155,75,False


---

Access a specific column:

In [153]:
students.graduated  # we can also access column-wise

a    False
b     True
c    False
d     True
e     True
Name: graduated, dtype: bool

Access multiple columns at once:

In [154]:
columns = ['weight', 'height']
students[columns]

Unnamed: 0,weight,height
a,88,155
b,67,187
c,75,155
d,64,160
e,98,185


Masking is performed row-wise by providing a boolean array with the same length as the number of rows:

In [155]:
students[students.graduated]

Unnamed: 0,height,weight,graduated
b,187,67,True
d,160,64,True
e,185,98,True


---

Randomly sample rows:

In [156]:
students.sample(3)

Unnamed: 0,height,weight,graduated
d,160,64,True
c,155,75,False
a,155,88,False


In [157]:
students.sample(frac=.5)  # half of all rows

Unnamed: 0,height,weight,graduated
b,187,67,True
a,155,88,False


In [158]:
len(students)

5

---

Iterating over the dataframe defaults to going over its columns:

In [159]:
for column in students:
    print(column)

height
weight
graduated


`iterrows()` is used to iterate over each row:

In [160]:
for student, row in students.iterrows():
    print(student, row.height, row.weight)

a 155 88
b 187 67
c 155 75
d 160 64
e 185 98


**💪 Exercise**: access the `tasty` column, rows 2 through 3 of `food_stats`:

In [161]:
food_stats[1:3].tasty

carrot       1
chocolate    3
Name: tasty, dtype: int64

**💪 Exercise**: select rows for students yet to graduate:

In [162]:
students[~students.graduated]

Unnamed: 0,height,weight,graduated
a,155,88,False
c,155,75,False


### Extension

To add a new row, just specify it's index (`loc`) or positional index (`iloc`) and the value for each column:

In [163]:
students.loc['x'] = (170, 70, True)

In [164]:
students

Unnamed: 0,height,weight,graduated
a,155,88,False
b,187,67,True
c,155,75,False
d,160,64,True
e,185,98,True
x,170,70,True


**ℹ️ Tip**: even though `df.iloc[len(df) - 1] = ...` can be used to append at the end of any dataframe, this is not recommended. If you wish to create a dataframe iteratively, instead of appending each element, store them in a different container and convert the data to a dataframe at the end. This is also the reason why there is no `append` function for dataframes.

To add a new column, assign it directly to the dataframe and specify a value for each row:

In [165]:
students['age'] = np.random.randint(18, 24, size=len(students))

In [166]:
students

Unnamed: 0,height,weight,graduated,age
a,155,88,False,21
b,187,67,True,19
c,155,75,False,18
d,160,64,True,20
e,185,98,True,20
x,170,70,True,18


**ℹ️ Tip**: this syntax must be used for column creation. `df.column` only works for accessing existing columns.

Create a column based on another:

In [167]:
students['can_ride'] = (students.height > 170)  # "you must be this tall to ride the roller coaster"

In [168]:
students

Unnamed: 0,height,weight,graduated,age,can_ride
a,155,88,False,21,False
b,187,67,True,19,True
c,155,75,False,18,False
d,160,64,True,20,False
e,185,98,True,20,True
x,170,70,True,18,False


---

Add multiple new rows, from another dataframe:

In [169]:
new_students = pd.DataFrame({
    'height': [160, 180],
    'weight': [ 60,  80],
})

new_students

Unnamed: 0,height,weight
0,160,60
1,180,80


In [170]:
students = pd.concat([students, new_students], sort=False)
students

Unnamed: 0,height,weight,graduated,age,can_ride
a,155,88,False,21.0,False
b,187,67,True,19.0,True
c,155,75,False,18.0,False
d,160,64,True,20.0,False
e,185,98,True,20.0,True
x,170,70,True,18.0,False
0,160,60,,,
1,180,80,,,


_Note_ it's `pd.concat` but `np.concatenate`

Add multiple new columns, from another dataframe:

In [171]:
n_students = len(students)

new_info = pd.DataFrame({
    'fav_number':   np.random.randint(0, 100, size=n_students),
    'fav_icecream': np.random.choice(['vanilla', 'chocolate', 'strawberry'], size=n_students),
})

In [172]:
new_info.index = students.index  # set the same index, to make merging (and viewing) easier
new_info

Unnamed: 0,fav_number,fav_icecream
a,61,strawberry
b,36,vanilla
c,73,strawberry
d,38,vanilla
e,1,chocolate
x,73,strawberry
0,7,vanilla
1,29,strawberry


In [173]:
students = students.merge(new_info, left_index=True, right_index=True)
students

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
x,170,70,True,18.0,False,73,strawberry
0,160,60,,,,7,vanilla
1,180,80,,,,29,strawberry


**💪 Exercise**: add a new entry for `hamburger` in the `food_stats` dataframe:

In [174]:
food_stats.loc['hamburger'] = (3, 1, False)
food_stats

Unnamed: 0,tasty,healthy,had_recently
pizza,2,1,True
carrot,1,3,True
chocolate,3,2,False
banana,2,3,True
hamburger,3,1,False


### Missing Values

Since we only provided the `height` and `weight` measurements, for the new students, `nan` (not a number) is placed by default for the other columns:

In [175]:
students

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
x,170,70,True,18.0,False,73,strawberry
0,160,60,,,,7,vanilla
1,180,80,,,,29,strawberry


Missing value detection:

In [176]:
pd.isna(students)

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,False,False,False,False,False,False,False
b,False,False,False,False,False,False,False
c,False,False,False,False,False,False,False
d,False,False,False,False,False,False,False
e,False,False,False,False,False,False,False
x,False,False,False,False,False,False,False
0,False,False,True,True,True,False,False
1,False,False,True,True,True,False,False


Column-wise:

In [177]:
pd.isna(students.age)

a    False
b    False
c    False
d    False
e    False
x    False
0     True
1     True
Name: age, dtype: bool

The easiest method for handling missing data is dropping the observation alltogether:

In [178]:
students.dropna()

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
x,170,70,True,18.0,False,73,strawberry


Another method is filling it with some default value:

In [179]:
students.fillna({
    'graduated': False,
    'can_ride':  False,
    'age':       20,
})

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
x,170,70,True,18.0,False,73,strawberry
0,160,60,False,20.0,False,7,vanilla
1,180,80,False,20.0,False,29,strawberry


More methods for handling missing value will be explored in workshop 4.

**ℹ️ Tip**: the reason why we must use `pd.isna` to identify NANs, instead of `== np.nan` is because `np.nan` is a special element, which, by design, is not equal to anything, not even `np.nan`. Read more about [ternary logic](https://en.wikipedia.org/wiki/Three-valued_logic) and non-[finite numpy numbers](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.isfinite.html#numpy.isfinite).

### Deleting

Delete some columns:

In [180]:
students.drop(['weight', 'age'], axis=1)

Unnamed: 0,height,graduated,can_ride,fav_number,fav_icecream
a,155,False,False,61,strawberry
b,187,True,True,36,vanilla
c,155,False,False,73,strawberry
d,160,True,False,38,vanilla
e,185,True,True,1,chocolate
x,170,True,False,73,strawberry
0,160,,,7,vanilla
1,180,,,29,strawberry


Delete some rows, index-wise:

In [181]:
students.drop(['a', 'x'], axis=0)

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
0,160,60,,,,7,vanilla
1,180,80,,,,29,strawberry


---

Deleting rows based on a boolean filtering is done by masking:

In [182]:
mask = (students.age > 21)  # filter underage ones
students[mask]

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream


Dropping duplicate rows can be done using `drop_duplicate`, but since our dataframe contains no such rows, we will restrict it to duplicates on just the `can_ride` and `fav_icecream` columns:

In [183]:
students.drop_duplicates(subset=['can_ride', 'fav_icecream'], keep='first')

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
0,160,60,,,,7,vanilla
1,180,80,,,,29,strawberry


---

Deleting, much like any operation in the next subsection, does not operate _in place_. This means the `drop` function returns a new dataframe object, which is created by (deep) copying the original one and applying the operation on it. Writing your own functions in such a way helps with preventing unexpected and hard to trace side effects. It also allows for method chaining `df.transpose().mean().round()`. Read more about [functional programming](https://hackernoon.com/learn-functional-python-in-10-minutes-to-2d1651dece6f) and [immutability in Python](https://www.pythonforthelab.com/blog/mutable-and-immutable-objects/).

If you do wish to "update" the same object, assign the result to the same variable:

In [184]:
students = students.dropna()

Some functions also offer the `inplace` option:

In [185]:
students.dropna(inplace=True)  # equivalent to above

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


**💪 Exercise**: Drop the `graduated` column from `students` (not in place!):

In [186]:
students.drop('graduated', axis=1)

Unnamed: 0,height,weight,age,can_ride,fav_number,fav_icecream
a,155,88,21.0,False,61,strawberry
b,187,67,19.0,True,36,vanilla
c,155,75,18.0,False,73,strawberry
d,160,64,20.0,False,38,vanilla
e,185,98,20.0,True,1,chocolate
x,170,70,18.0,False,73,strawberry


### Dataframe Operations

Array-wise functions and operations are naturally extended to dataframes

**ℹ️ Tip**: a dataframe is composed by multiple `pd.Series`. Each column can be a series, and each row can be a series. A series is a "labeled list" — where each `value` has an `index`.

In [187]:
students  # contents refresher

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
x,170,70,True,18.0,False,73,strawberry


#### Arithmetic

In [188]:
students.weight - 10  # if only losing weight was this easy 😅

a    78
b    57
c    65
d    54
e    88
x    60
Name: weight, dtype: int64

In [189]:
students.height + students.weight  # note that the resulting series still has the same indices

a    243
b    254
c    230
d    224
e    283
x    240
dtype: int64

In [190]:
students.sum()

height                                                       1012
weight                                                        462
graduated                                                       4
age                                                           116
can_ride                                                        2
fav_number                                                    282
fav_icecream    strawberryvanillastrawberryvanillachocolatestr...
dtype: object

In [191]:
students.mean()

height        168.666667
weight         77.000000
graduated       0.666667
age            19.333333
can_ride        0.333333
fav_number     47.000000
dtype: float64

**ℹ️ Tip**: The sum of the boolean series `graduated` is the number of people that graduated. The mean of `graduated` is the sum divided by the total number of students, which is precisely the percentage of students that graduated.

Most operations accept an `axis` argument, which can either be `0` (column-wise, default), or `1` (row-wise):

In [192]:
students.mean(axis=1)  # the average for each student, of their height, weight and graduation status.. which doesn't make much sense

a    81.25
b    77.25
c    80.25
d    70.50
e    76.00
x    82.75
dtype: float64

**💪 Exercise**: get the sums of the `tasty` and `healthy` ratings in `food_stats`:

In [193]:
food_stats.tasty + food_stats.healthy

pizza        3
carrot       4
chocolate    5
banana       5
hamburger    4
dtype: int64

#### Logical

In [194]:
students.age > 21

a    False
b    False
c    False
d    False
e    False
x    False
Name: age, dtype: bool

In [195]:
students.graduated & ~students.can_ride  # students who graduated but cannot ride

a    False
b     True
c    False
d     True
e     True
x     True
dtype: bool

In [196]:
students.graduated.any()

True

In [197]:
students.graduated.all()

False

In [198]:
students == 21

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,False,False,False,True,False,False,False
b,False,False,False,False,False,False,False
c,False,False,False,False,False,False,False
d,False,False,False,False,False,False,False
e,False,False,False,False,False,False,False
x,False,False,False,False,False,False,False


**💪 Exercise**: which tasty ($\ge 2$) food items have you had recently?

In [200]:
(food_stats.tasty >= 2) & food_stats.had_recently

pizza         True
carrot       False
chocolate    False
banana        True
hamburger    False
dtype: bool

#### String Functions

The raw string values of textual variables can be accessed with `.str`:

In [201]:
students.fav_icecream

a    strawberry
b       vanilla
c    strawberry
d       vanilla
e     chocolate
x    strawberry
Name: fav_icecream, dtype: object

In [202]:
students.fav_icecream.str.title()

a    Strawberry
b       Vanilla
c    Strawberry
d       Vanilla
e     Chocolate
x    Strawberry
Name: fav_icecream, dtype: object

In [203]:
students.fav_icecream.str.replace('straw', 'banned-')

a    banned-berry
b         vanilla
c    banned-berry
d         vanilla
e       chocolate
x    banned-berry
Name: fav_icecream, dtype: object

In [204]:
students.fav_icecream.str.contains('e')

a     True
b    False
c     True
d    False
e     True
x     True
Name: fav_icecream, dtype: bool

**💪 Exercise**: make the name (the `index`) of the food items in `food_stats` uppercase:

In [205]:
food_stats.index.str.upper()

Index(['PIZZA', 'CARROT', 'CHOCOLATE', 'BANANA', 'HAMBURGER'], dtype='object')

#### Arbitrary Functions

In [206]:
students.height.apply(lambda w: (w // 10) * 10)  # apply to each element of a column

a    150
b    180
c    150
d    160
e    180
x    170
Name: height, dtype: int64

In [207]:
students.apply(lambda row: row.height + row.weight, axis=1)  # apply row-wise

a    243
b    254
c    230
d    224
e    283
x    240
dtype: int64

In [208]:
def relabel_boolean(x):
    # if the argument is not a boolean, leave it as it is
    if type(x) is not bool:
        return x
    return 'yes' if x is True else 'no'

In [209]:
students.applymap(relabel_boolean)  # apply element-wise

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,no,21.0,no,61,strawberry
b,187,67,yes,19.0,yes,36,vanilla
c,155,75,no,18.0,no,73,strawberry
d,160,64,yes,20.0,no,38,vanilla
e,185,98,yes,20.0,yes,1,chocolate
x,170,70,yes,18.0,no,73,strawberry


#### Statistical

One of the most useful shortcuts is `describe`, which quickly provides a list of descriptive statistics about each numeric column:

In [210]:
students.describe().round(2)

Unnamed: 0,height,weight,age,fav_number
count,6.0,6.0,6.0,6.0
mean,168.67,77.0,19.33,47.0
std,14.51,13.3,1.21,27.81
min,155.0,64.0,18.0,1.0
25%,156.25,67.75,18.25,36.5
50%,165.0,72.5,19.5,49.5
75%,181.25,84.75,20.0,70.0
max,187.0,98.0,21.0,73.0


90% of the values are lower than the 90th percentile. This can be used for outlier detection, which will be explored in workshop 4.

In [211]:
students.describe(percentiles=[.1, .5, .9, .95]).round(3)

Unnamed: 0,height,weight,age,fav_number
count,6.0,6.0,6.0,6.0
mean,168.667,77.0,19.333,47.0
std,14.514,13.297,1.211,27.806
min,155.0,64.0,18.0,1.0
10%,155.0,65.5,18.0,18.5
50%,165.0,72.5,19.5,49.5
90%,186.0,93.0,20.5,73.0
95%,186.5,95.5,20.75,73.0
max,187.0,98.0,21.0,73.0


---

Categorical variables aggregations:

In [212]:
students.fav_icecream.value_counts()

strawberry    3
vanilla       2
chocolate     1
Name: fav_icecream, dtype: int64

In [213]:
students.fav_icecream.unique()

array(['strawberry', 'vanilla', 'chocolate'], dtype=object)

In [214]:
students.fav_icecream.nunique()  # when you need just the amount of unique items, not their actual values

3

---

Two variables have high _pair-wise correlation_ (Pearson) when one is dependent on the other: there is a linear relationship between the two:

In [215]:
students.corr()

Unnamed: 0,height,weight,age,fav_number
height,1.0,0.145084,-0.083439,-0.687326
weight,0.145084,1.0,0.484382,-0.443024
age,-0.083439,0.484382,1.0,-0.469186
fav_number,-0.687326,-0.443024,-0.469186,1.0


Two variables have high _covariance_ when they then to show similar behavior: greater values in one correspond to greater values in the other:

In [216]:
students.cov()

Unnamed: 0,height,weight,age,fav_number
height,210.666667,28.0,-1.466667,-277.4
weight,28.0,176.8,7.8,-163.8
age,-1.466667,7.8,1.466667,-15.8
fav_number,-277.4,-163.8,-15.8,773.2


Positive _skewness_ indicates left-leaning distributions:

In [217]:
students.age.skew()

0.07506571125862004

Positive _kurtosis_ indicates that the distribution has heavy tails and sharp peaks:

In [218]:
students.age.kurt()

-1.5495867768595044

Read more about [distribution measures](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/basic-statistics/inference/supporting-topics/data-concepts/how-skewness-and-kurtosis-affect-your-distribution/).

**💪 Exercise**: `describe` your `food_stats`:

In [219]:
food_stats.describe().round(3)

Unnamed: 0,tasty,healthy
count,5.0,5.0
mean,2.2,2.0
std,0.837,1.0
min,1.0,1.0
25%,2.0,1.0
50%,2.0,2.0
75%,3.0,3.0
max,3.0,3.0


#### Ordering

Sort by the index:

In [220]:
students.sort_index()

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
a,155,88,False,21.0,False,61,strawberry
b,187,67,True,19.0,True,36,vanilla
c,155,75,False,18.0,False,73,strawberry
d,160,64,True,20.0,False,38,vanilla
e,185,98,True,20.0,True,1,chocolate
x,170,70,True,18.0,False,73,strawberry


Sort by (a combination of) column(s):

In [221]:
students.sort_values(by=['fav_number', 'age'], ascending=True)

Unnamed: 0,height,weight,graduated,age,can_ride,fav_number,fav_icecream
e,185,98,True,20.0,True,1,chocolate
b,187,67,True,19.0,True,36,vanilla
d,160,64,True,20.0,False,38,vanilla
a,155,88,False,21.0,False,61,strawberry
c,155,75,False,18.0,False,73,strawberry
x,170,70,True,18.0,False,73,strawberry


Compute each row's rank (average ranks in case of equality):

In [222]:
students.fav_number.rank()

a    4.0
b    2.0
c    5.5
d    3.0
e    1.0
x    5.5
Name: fav_number, dtype: float64

**💪 Exercise**: sort your `food_stats` by `tasty`est first:

In [223]:
food_stats.sort_values('tasty', ascending=False)

Unnamed: 0,tasty,healthy,had_recently
chocolate,3,2,False
hamburger,3,1,False
pizza,2,1,True
banana,2,3,True
carrot,1,3,True


### Data Transformations

Restructuring operations. Same data, but different view, more fitted for the downstream task.

Transpose rows and columns (keeping labels):

In [224]:
students.T

Unnamed: 0,a,b,c,d,e,x
height,155,187,155,160,185,170
weight,88,67,75,64,98,70
graduated,False,True,False,True,True,True
age,21,19,18,20,20,18
can_ride,False,True,False,False,True,False
fav_number,61,36,73,38,1,73
fav_icecream,strawberry,vanilla,strawberry,vanilla,chocolate,strawberry


Transform a categorical variable into dummy variables:

In [225]:
pd.get_dummies(students.fav_icecream)

Unnamed: 0,chocolate,strawberry,vanilla
a,0,1,0
b,0,0,1
c,0,1,0
d,0,0,1
e,1,0,0
x,0,1,0


### Group By

_Grouping_ puts together rows according to the values for a certain variable:

In [226]:
# exemplify on a new dataframe
performance = pd.DataFrame([
    ('Alice', 'CS 101', 4.0),
    ('Alice', 'CS 102', 3.0),
    ('Alice', 'EE 201', 4.0),
    ('Bob',   'CS 101', 3.0),
    ('Bob',   'EE 201', 4.0),
], columns=['student', 'class', 'grade'])

In [227]:
performance

Unnamed: 0,student,class,grade
0,Alice,CS 101,4.0
1,Alice,CS 102,3.0
2,Alice,EE 201,4.0
3,Bob,CS 101,3.0
4,Bob,EE 201,4.0


In order to see the effects of grouping, we apply an aggregation on all the rows for each student:

In [228]:
performance.groupby('student').grade.mean()

student
Alice    3.666667
Bob      3.500000
Name: grade, dtype: float64

---

Group by multiple variables:

In [229]:
students.groupby(['graduated', 'fav_icecream']).age.mean()

graduated  fav_icecream
False      strawberry      19.5
True       chocolate       20.0
           strawberry      18.0
           vanilla         19.5
Name: age, dtype: float64

---

Iterate over the `groupby` object, feature value (student name here) and the rows for that value:

In [230]:
for student, classes in performance.groupby('student'):
    print(student, 'took', len(classes), 'classes, with an average of', classes.grade.mean().round(2))

Alice took 3 classes, with an average of 3.67
Bob took 2 classes, with an average of 3.5


**💪 Exercise**: get the maximum grade of each student `performance`:

In [231]:
performance.groupby('student').grade.max()

student
Alice    4.0
Bob      4.0
Name: grade, dtype: float64

### Pivot

_Pivoting_ "flips" the data according and applies an function. Select a discrete variable for the columns, one for the rows, and you get the unique values for each. Each observations in the original dataframe, falls into one such value intersection. Pick an aggregation to apply to that set of observations.

For example, if want to know the average `height` and `weight` for those students that `graduated` and those that did not:

In [232]:
pd.pivot_table(
    students,
    index='graduated',
    values=['height', 'weight'],
    aggfunc='mean',
)

Unnamed: 0_level_0,height,weight
graduated,Unnamed: 1_level_1,Unnamed: 2_level_1
False,155.0,81.5
True,175.5,74.75


Or the maximum `height` and `weight` instead:

In [233]:
pd.pivot_table(
    students,
    index='graduated',
    values=['height', 'weight'],
    aggfunc='max',
)

Unnamed: 0_level_0,height,weight
graduated,Unnamed: 1_level_1,Unnamed: 2_level_1
False,155,88
True,187,98


---

A special case of pivoting is _cross tabulation_, which returns the counts at each feature value intersection:

In [234]:
pd.crosstab(students.fav_icecream, students.graduated, margins=True)

graduated,False,True,All
fav_icecream,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chocolate,0,1,1
strawberry,2,1,3
vanilla,0,2,2
All,2,4,6


**💪 Exercise**: get `mean` of `age` and `height` for students who `can_ride` and those who can't:

In [235]:
pd.pivot_table(
    students,
    index='can_ride',
    values=['age', 'height'],
    aggfunc='mean',
)

Unnamed: 0_level_0,age,height
can_ride,Unnamed: 1_level_1,Unnamed: 2_level_1
False,19.25,160
True,19.5,186


### Melt

_Melting_ can be thought of as the reverse of pivoting.

In [236]:
# exemplify on a new dataframe
height_evolution = pd.DataFrame({
    'Alice': np.linspace(160, 190, num=5),
    'Bob':   np.linspace(170, 180, num=5),
    'year':  range(2000, 2005),
})

In [237]:
height_evolution

Unnamed: 0,Alice,Bob,year
0,160.0,170.0,2000
1,167.5,172.5,2001
2,175.0,175.0,2002
3,182.5,177.5,2003
4,190.0,180.0,2004


Currently, each student has their own column. But that is the same type of information. So, we can melt them into a single `student` column:

In [238]:
melted = height_evolution.melt(
    id_vars='year',
    value_vars=['Alice', 'Bob'],

    var_name='student',
    value_name='height',
)
melted

Unnamed: 0,year,student,height
0,2000,Alice,160.0
1,2001,Alice,167.5
2,2002,Alice,175.0
3,2003,Alice,182.5
4,2004,Alice,190.0
5,2000,Bob,170.0
6,2001,Bob,172.5
7,2002,Bob,175.0
8,2003,Bob,177.5
9,2004,Bob,180.0


### Join

_Joining_ combines two tables, based on a common variable

In [239]:
height_stats = height_evolution.describe().T[['min', 'max']].rename(columns=lambda c: c + '_height')
height_stats

Unnamed: 0,min_height,max_height
Alice,160.0,190.0
Bob,170.0,180.0
year,2000.0,2004.0


_Note_: since these are column-wise statistics, we also get the smallest and largest values for `year`.

Both tables have the `student` column, which is the one we join on. For every row where `student` is `"Alice"`, combine the information from the `performance` dataframe (`class` and `grade`) with the information from the the other dataframe (`min_height` and `max_height`).

In [240]:
performance.join(height_stats, on='student')

Unnamed: 0,student,class,grade,min_height,max_height
0,Alice,CS 101,4.0,160.0,190.0
1,Alice,CS 102,3.0,160.0,190.0
2,Alice,EE 201,4.0,160.0,190.0
3,Bob,CS 101,3.0,170.0,180.0
4,Bob,EE 201,4.0,170.0,180.0


_Note_: since `year` is not among `performance.student` values, the default left join ignores that entry. Learn more about [join types](http://www.sql-join.com/sql-join-types/).

### Time Series

In [241]:
from datetime import datetime

In [242]:
# number of pages read for the first week of classes
pages_read = pd.DataFrame({
    'Alice': np.random.randint(0, 50, size=7),
    'Bob':   np.random.randint(0, 20, size=7),
    'date': pd.date_range('7 Jan 2019', periods=7)
})

pages_read

Unnamed: 0,Alice,Bob,date
0,20,17,2019-01-07
1,28,6,2019-01-08
2,4,11,2019-01-09
3,13,0,2019-01-10
4,1,12,2019-01-11
5,16,15,2019-01-12
6,24,16,2019-01-13


Comparison operations can be done against a `datetime`-compatible object:

In [243]:
late_start = datetime(year=2019, month=1, day=10)  # it's not fair to start counting that early

In [244]:
pages_read[pages_read.date > late_start]

Unnamed: 0,Alice,Bob,date
4,1,12,2019-01-11
5,16,15,2019-01-12
6,24,16,2019-01-13


While dates represent specific timepoints (of various granularity), the difference between two such objects is a _time delta_: a duration, not a date:

In [245]:
late_start - pages_read.date.iloc[0]

Timedelta('3 days 00:00:00')

It can be instantiated by parsing natural language:

In [246]:
pd.Timedelta('7 days 5 hours 3 minutes')

Timedelta('7 days 05:03:00')

Timedeltas can be used to offset date objects:

In [247]:
pages_read.date + pd.Timedelta(7, 'd')  # much better, a whole week later

0   2019-01-14
1   2019-01-15
2   2019-01-16
3   2019-01-17
4   2019-01-18
5   2019-01-19
6   2019-01-20
Name: date, dtype: datetime64[ns]

### Hierarchical Indices

In [248]:
enrollment = pd.DataFrame({
    'level':    np.random.choice(['grad', 'undergrad', 'phd'], size=20),
    'school':   np.random.choice(['Viterbi', 'Price', 'Marshall', 'Dornsife'], size=20),
    'students': np.random.randint(200, 5_000, size=20),
    'faculty':  np.random.randint(50,  500,   size=20),
}).drop_duplicates(subset=['level', 'school'])

enrollment

Unnamed: 0,level,school,students,faculty
0,phd,Viterbi,3593,224
1,phd,Marshall,3585,248
2,undergrad,Viterbi,2241,101
4,undergrad,Marshall,1963,337
5,grad,Viterbi,507,50
6,grad,Marshall,4364,353
8,phd,Dornsife,909,432
10,grad,Price,1428,235
11,undergrad,Price,1358,55


In [249]:
enrollment.set_index(['school', 'level']).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,students,faculty
school,level,Unnamed: 2_level_1,Unnamed: 3_level_1
Dornsife,phd,909,432
Marshall,grad,4364,353
Marshall,phd,3585,248
Marshall,undergrad,1963,337
Price,grad,1428,235
Price,undergrad,1358,55
Viterbi,grad,507,50
Viterbi,phd,3593,224
Viterbi,undergrad,2241,101


---

In [250]:
locations = pd.DataFrame({
    'Alice': ('San Francisco', 'CA', 'Los Angeles', 'CA'),
    'Bob':   ('Rochester', 'NY', 'Los Angeles', 'CA'),
    'Chris': ('Las Vegas', 'NV', 'Pennsylvania', 'PA'),
}).T

locations.columns = pd.MultiIndex.from_product([
    ['home', 'school'],
    ['city', 'state']
], names=['purpose', 'address'])

locations

purpose,home,home,school,school
address,city,state,city,state
Alice,San Francisco,CA,Los Angeles,CA
Bob,Rochester,NY,Los Angeles,CA
Chris,Las Vegas,NV,Pennsylvania,PA


Read more about [advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

### Loading and Saving Dataframes

Data will almost always be loaded from an external source.

Load from JSON (open the file in the file browser to check out the source):

In [251]:
pd.read_json('example_files/objects.json')

Unnamed: 0,grade,name,year
0,3.9,Alice,2
1,3.8,Bob,3
2,3.85,Chris,1


Load from CSV (again, the file is in the `example_files` folder):

In [252]:
pd.read_csv('example_files/tabular.csv')

Unnamed: 0,grade,name,year
0,3.9,Alice,2
1,3.8,Bob,3
2,3.85,Chris,1


**ℹ️ Tip**: the CSV format is extremely common. There is a huge number of options available for loading such files. Read more about them [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

Load directly from an URL, letting Pandas do the downloading (paste that link into your browser to check the source):

In [253]:
pd.read_csv('https://raw.githubusercontent.com/stefan-niculae/viz-workshop/master/example_files/tabular.csv')

Unnamed: 0,grade,name,year
0,3.9,Alice,2
1,3.8,Bob,3
2,3.85,Chris,1


Load directly from an archive, letting Pandas do the uncompressing (find the file and extract it in your local file browser to check the source):

In [254]:
pd.read_csv('example_files/archived.csv.zip')

Unnamed: 0,grade,name,year
0,3.9,Alice,2
1,3.8,Bob,3
2,3.85,Chris,1


---

Saving data (check the result after running in the file browser):

In [255]:
performance.to_csv('students_performance.csv')

**ℹ️ Tip**: if the index is meaningless (e.g.: just the default sequential one), avoid wasting space and slightly encumbering the reading process by omitting it with `ignore_index=True`.

## Further Reading
 - Numpy: 
   - [cheatsheet](https://www.dataquest.io/blog/large_files/numpy-cheat-sheet.pdf)
   - [official quickstart guide](https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html)
   - [official reference](https://docs.scipy.org/doc/numpy/reference/index.html#reference)
 - Scipy: [tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html)
 - Pandas:
   - [visual cheatsheet](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
   - [cookbook](https://chrisalbon.com/#python)
   - [gotchas](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#gotchas)
 - Python/Numpy/Scipy/Matplotlib: [quick tutorial](http://cs231n.github.io/python-numpy-tutorial/)
 
Links to more details about particular concepts are placed at the end of their respective (sub)sections.