In [1]:
import numpy as np
import pandas as pd

# NumPy
* "Fundamental package for scientific computing in Python"
* "At the core of the NumPy package, is the `ndarray` object"
* Provides:
   * multidimensional array object
   * various derived objects:
     * masked arrays
     * matrices
     * ...
   * routines for fast operations on arrays:
     * mathematical
     * logical
     * shape manipulation
     * sorting
     * selecting
     * I/O
     * basic linear algebra
     * basic statistical operations
     * random simulation
     * ...

https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html


## `ndarray`
* "Multidimensional container of items of the **same type and size**"
* "Usually fixed-size"
* Shape of the array ("tuple of N positive integers that specify the sizes of each dimension") defines:
  * Number of dimensions
  * Number of items
* `dtype` specifies the data type of the items in the array
* Supports indexing and slicing

https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html

## `ndarray` vs standard Python list

| -                    | `ndarray`       | Python list                       |
|:---------------------|:----------------|:----------------------------------|
| size                 | fixed [1]       | dynamic                           |
| type                 | homogeneous [2] | heterogeneous (Python `object`s)  |
| adv. math operations | available [3]   | write yourself                    |
| implementation       | C array [4]     | list of list(s)                   |


* [1] "Changing the size of an ndarray will create a new array and delete the original."
* [2] All elements will have the **"same size in memory"** (except in case of an array of objects).
* [3] "Typically, such operations are executed **more efficiently** and with less code than is possible using Python’s built-in sequences."
* [4] "NumPy is implemented in C and offers near C-speed"

In addition, many scientific Python packages use and return NumPy arrays.

https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html

## NumPy Routines
All functions and methods are documented at https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.html#routines
* [Array creation](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.array-creation.html)
* [Array manipulation](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.array-manipulation.html)
* [Data type routines](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.dtype.html)
* [Indexing](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.indexing.html)
* [Linear algebra](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html)
* [Logic functions](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html)
* [Mathematical functions](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html)
* [The matrix library](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.matlib.html)
* [Random sampling](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html)
* [Sorting, searching, and counting](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.sort.html)
* [Statistics](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.statistics.html)
* [Window functions](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.window.html)
* ...

## Create NumPy Arrays

"There are 5 general mechanisms for creating arrays:
1. Conversion from other Python structures (e.g., lists, tuples)
2. Intrinsic numpy array creation objects (e.g., arange, ones, zeros, etc.)
3. Reading arrays from disk, either from standard or custom formats
4. Creating arrays from raw bytes through the use of strings or buffers
5. Use of special library functions (e.g., random)"

https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html

See https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.array-creation.html for an overview of array creation routines. A few important routines are outlined in the following.

### Using the ```array()``` function
```numpy.array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)```

"**Parameters:**
* **object:** *array_like*. An array, any object exposing the array interface, an object whose ```__array__``` method returns an array, or any (nested) sequence.
* **dtype:** *data-type, optional*. The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the ```.astype(t)``` method."
* Check the documentation for the other more advanced parameters.


"**Returns: out:** *ndarray*. An array object satisfying the specified requirements.

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html

The following examples are from the link above.

In [2]:
# create an numpy array from a list
a = data
a

array([1, 2, 3])

In [3]:
a.dtype

dtype('int64')

In [4]:
# upcast the type from an array from integer to float
# note the dtype of all values in the array
np.array([1, 2, 3.])

array([1., 2., 3.])

In [5]:
# create an array with more than one dimension
np.array([[0, 1, 2], [3, 4, 5]])

array([[0, 1, 2],
       [3, 4, 5]])

### Using the ```arange()``` function

```numpy.arange([start, ]stop, [step, ]dtype=None)```

"Return evenly spaced values within a given interval.

Values are generated within the half-open interval ```[start, stop)``` (in other words, the interval including start but excluding stop). For integer arguments the function is equivalent to the Python built-in ```range``` function, but returns an ```ndarray``` rather than a list.

When using a non-integer step, such as ```0.1```, the results will often not be consistent. It is better to use ```linspace``` for these cases."

**Parameters:**
* **start:** *number, optional*. Start of interval. The interval includes this value. The default start value is ```0```.
* **stop:** *number*. End of interval. The interval does not include this value, except in some cases where *step* is not an integer and floating point round-off affects the length of out.
* **step:** *number, optional*. Spacing between values. For any output *out*, this is the distance between two adjacent values, ```out[i+1] - out[i]```. The default step size is ```1```. If *step* is specified, *start* must also be given.
* **dtype:** *dtype*. The type of the output array. If ```dtype``` is not given, infer the data type from the other input arguments.

**Returns: arange:** *ndarray*. Array of evenly spaced values. For floating point arguments, the length of the result is ```ceil((stop - start)/step)```. Because of floating point overflow, this rule may result in the last element of out being greater than *stop*.

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.arange.html

In [6]:
# create an array with stop=3 (and default start and step values)
np.arange(3)

array([0, 1, 2])

In [7]:
# create the same array with float data types
np.arange(3.)

array([0., 1., 2.])

In [8]:
# create an array with start=3 and stop=7 (exclusive)
np.arange(3, 7)

array([3, 4, 5, 6])

In [11]:
# create an array with start=3, stop=7 (exclusive), and step=2
np.arange(3, 7, 2)

array([3, 5])

### Using w/ or w/o placeholder values

Some methods for creating NumPy array with or without placeholder values are described in the following.

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.array-creation.html#ones-and-zeros

#### ```ones()```
```numpy.ones(shape, dtype=None, order='C')```
"Return a new array of given shape and type, filled with ones.

**Parameters:**
* **shape:** *int or sequence of ints*. Shape of the new array, e.g., ```(2, 3)``` or ```2```.
* **dtype:** *data-type, optional*. The desired data-type for the array, e.g., ```numpy.int8```. Default is ```numpy.float64```.
* **order:** *{‘C’, ‘F’}, optional*. Whether to store multidimensional data in C- or Fortran-contiguous (row- or column-wise) order in memory.

**Returns: out:** *ndarray*. Array of ones with the given shape, dtype, and order."

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ones.html

In [12]:
# create an array of ones with 5 values and the default dtype
np.ones(5)

array([1., 1., 1., 1., 1.])

In [13]:
# create the same array with int as dtype
np.ones((5,), dtype=np.int)

array([1, 1, 1, 1, 1])

In [14]:
# create and array of ones with two rows and one column
np.ones((2, 1))

array([[1.],
       [1.]])

#### ```zeros()```
```numpy.zeros(shape, dtype=float, order='C')```

"Return a new array of given shape and type, filled with zeros."

Parameters and return value is the same as ```numpy.ones()```.

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.zeros.html

In [15]:
# create an array of zeros with 5 values and the default dtype
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [16]:
# create the same array with int as dtype
np.zeros((5,), dtype=np.int)

array([0, 0, 0, 0, 0])

In [17]:
# create and array of zeros with two rows and one column
np.zeros((2, 1))

array([[0.],
       [0.]])

#### ```full()```
```numpy.full(shape, fill_value, dtype=None, order='C')```

"Return a new array of given shape and type, filled with fill_value.

**Parameters:**
* **shape:** *int or sequence of ints*. Shape of the new array, e.g., ```(2, 3)``` or ```2```.
* **fill_value:** *scalar*. Fill value.
* **dtype:** *data-type, optional*. The desired data-type for the array. The default, ```None```, means ```np.array(fill_value).dtype```.
* **order:** *{‘C’, ‘F’}, optional*. Whether to store multidimensional data in C- or Fortran-contiguous (row- or column-wise) order in memory.

**Returns: out:** *ndarray*. Array of ones with the given shape, dtype, and order."

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.full.html

In [18]:
# create an array with 2 rows and 2 columns with infinity as value
np.full((2, 2), np.inf)

array([[inf, inf],
       [inf, inf]])

In [19]:
# create an array with 2 rows and 2 columns with 10 as value
np.full((2, 2), 10)

array([[10, 10],
       [10, 10]])

#### ```empty()```
```numpy.empty(shape, dtype=float, order='C')```
"Return a new array of given shape and type, without initializing entries.

```empty```, unlike ```zeros```, does not set the array values to zero, and may therefore be marginally faster. On the other hand, it requires the user to manually set all the values in the array, and should be used with caution."

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.empty.html

In [20]:
# create an array with 2 rows and 2 columns with random value
# and the default data type
np.empty([2, 2])

array([[ 0.00000000e+000,  4.65205137e-310],
       [-3.56769341e+299,  6.94487065e-310]])

In [21]:
# create an array with 2 rows and 2 columns with random int value
np.empty([2, 2], dtype=int)

array([[                  0,      94158568030208],
       [-134813131366783034,     140565746928048]])

## ```ndarray``` Attributes

"The most important attributes of an ```ndarray``` object are:"
https://docs.scipy.org/doc/numpy-1.13.0/user/quickstart.html#the-basics

In [22]:
# create a 2d array
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
x

array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)

In [23]:
# get the type of x
type(x)

numpy.ndarray

In [24]:
# get the shape of the array
x.shape

(2, 3)

In [25]:
# get the data type of the values in the array
x.dtype

dtype('int32')

In [26]:
# get the number of dimensions (axes) in the array
x.ndim

2

In [27]:
# get the total number of elements in the array
x.size

6

In [28]:
# get the size in bytes of each element of the array
x.itemsize

4

In [29]:
# indexing (zero-based) for getting a single element of the array
x[1, 2]

6

## Indexing, Slicing, and Iterating

* "**One-dimensional** arrays can be indexed, sliced and iterated over, much like lists and other Python sequences."
* "**Multidimensional** arrays can have one index per axis."
* "The basic **slice syntax** is ```i:j:k``` where *i* is the starting index, *j* is the stopping index, and *k* is the step ($k\neq0$). (...) If *k* is not given it defaults to 1."
* "When fewer indices are provided than the number of axes, the missing indices are considered complete slices ```:```"
* "The expression within brackets in ```b[i]``` is treated as an ```i``` followed by as many instances of ```:``` as needed to represent the remaining axes. NumPy also allows you to write this using dots as ```b[i,...]```."

https://docs.scipy.org/doc/numpy-1.13.0/user/quickstart.html#indexing-slicing-and-iterating and https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#basic-slicing-and-indexing

Examples are based on https://docs.scipy.org/doc/numpy-1.13.0/user/quickstart.html#indexing-slicing-and-iterating.

In [30]:
# creating a 1d array
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [31]:
# select a single element
a[3]

3

In [32]:
# slice the array and get elements at index 1 to 5 (exclusive end)
a[1:5]

array([1, 2, 3, 4])

In [33]:
# select all elements starting from index 5
a[5:]

array([5, 6, 7, 8, 9])

In [34]:
# select every second element starting from 1 and stopping at 7 (exclusive)
a[1:7:2]

array([1, 3, 5])

In [35]:
# equivalent to a[0:6:2]
a[:6:2]

array([0, 2, 4])

In [40]:
# reverse the array
a[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [None]:
# iterate the 1d array
for i in a:
    print(i*2)

In [50]:
# create a 2d array
def f(x, y):
    return 10*x+y

b = np.fromfunction(f, (5,4), dtype=int)
b

array([[ 0,  1,  2,  3],
       [10, 11, 12, 13],
       [20, 21, 22, 23],
       [30, 31, 32, 33],
       [40, 41, 42, 43]])

In [42]:
# indexing (zero-based) for getting a single element of the array
b[2, 3]

23

In [43]:
# get all elements of the second column
b[:, 1]

array([ 1, 11, 21, 31, 41])

In [44]:
# get all columns of the 2nd and 3rd row
b[1:3, :]

array([[10, 11, 12, 13],
       [20, 21, 22, 23]])

In [45]:
# get the last row of the array
b[-1]

array([40, 41, 42, 43])

In [46]:
# iterate over 2d array
for row in b:
    print(row)

[0 1 2 3]
[10 11 12 13]
[20 21 22 23]
[30 31 32 33]
[40 41 42 43]


In [47]:
# iterate over flattened 2d array
for element in b.flat:
    print(element)

0
1
2
3
10
11
12
13
20
21
22
23
30
31
32
33
40
41
42
43


## Shape Manipulation
An introduction to the manipulation of the shape of arrays is provided at https://docs.scipy.org/doc/numpy-1.13.0/user/quickstart.html#shape-manipulation

### Change the shape
* [```reshape()```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.reshape.html) "gives a new shape to an array without changing its data."
* [```resize()```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.resize.html) changes the "shape and size of the array in-place."
* [```ravel()```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ravel.html) returns the original array (not a copy if possible) as "a contiguous flattened array."
* [```flatten()```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.flatten.html) returns "a copy of the array collapsed into one dimension."
* [```T```](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.T.html) transposes the array.

```reshape()```, ```ravel()```, ```flatten()```, and ```T``` "return a modified array (but do not change the original array"), whereas ```resize()```changes the shape of the original array.

In [51]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [54]:
# reshape the array into 5 rows and 2 columns
c = a.reshape((5, 2))
c

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [55]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [56]:
b

array([[ 0,  1,  2,  3],
       [10, 11, 12, 13],
       [20, 21, 22, 23],
       [30, 31, 32, 33],
       [40, 41, 42, 43]])

In [57]:
# flatten the array (returns a view of the data whenever possible)
b.ravel()

array([ 0,  1,  2,  3, 10, 11, 12, 13, 20, 21, 22, 23, 30, 31, 32, 33, 40,
       41, 42, 43])

In [58]:
# flatten the array (a copy will be returned)
b.flatten()

array([ 0,  1,  2,  3, 10, 11, 12, 13, 20, 21, 22, 23, 30, 31, 32, 33, 40,
       41, 42, 43])

In [59]:
b.shape

(5, 4)

In [60]:
# change the shape of the original array
b.resize((2, 10))
b

array([[ 0,  1,  2,  3, 10, 11, 12, 13, 20, 21],
       [22, 23, 30, 31, 32, 33, 40, 41, 42, 43]])

In [63]:
# return the array transposed
b.T

array([[ 0, 22],
       [ 1, 23],
       [ 2, 30],
       [ 3, 31],
       [10, 32],
       [11, 33],
       [12, 40],
       [13, 41],
       [20, 42],
       [21, 43]])

In [62]:
b.transpose()

array([[ 0, 22],
       [ 1, 23],
       [ 2, 30],
       [ 3, 31],
       [10, 32],
       [11, 33],
       [12, 40],
       [13, 41],
       [20, 42],
       [21, 43]])

### Stacking
"Several arrays can be stacked together along different axes." (https://docs.scipy.org/doc/numpy-1.13.0/user/quickstart.html#stacking-together-different-arrays). The methods ```hstack```, ```vstack```, ```row_stack```, and ```column_stack``` are still supported but ```stack``` or ```concatenate``` should be used for joining arrays.



* [```stack(arrays, axis=0)```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.stack.html) joins "a sequence of arrays along a new axis."
* [```concatenate(arrays, axis=0)```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.concatenate.html) joins "a sequence of arrays along an existing axis."

In [64]:
# create 3 numpy arrays
a = np.array((1, 2, 3))
b = np.array((4, 5, 6))
c = np.array((7, 8, 9))

In [65]:
b

array([4, 5, 6])

In [66]:
# stack arrays along the first axis (row wise)
np.stack((a, b, c))

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [67]:
# stack arrays along the last axis (column wise)
np.stack((a, b, c), axis=1)

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

In [68]:
# concatenate the arrays
np.concatenate((a, b, c))

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [69]:
# stack arrays a, b, and c to create a 3x3 matrix
abc = np.stack((a, b, c))
abc

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [70]:
abc.shape

(3, 3)

In [77]:
# create a new array d
d = np.array([[10, 11, 12]])
d

array([[10, 11, 12]])

In [78]:
d.shape

(1, 3)

In [73]:
# concat abc and d along axis 0 (row-wise)
np.concatenate((abc, d), axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [74]:
# concat abc and d along axis 1 (column-wise)
np.concatenate((abc, d.T), axis=1)

array([[ 1,  2,  3, 10],
       [ 4,  5,  6, 11],
       [ 7,  8,  9, 12]])

### Splitting

https://docs.scipy.org/doc/numpy-1.13.0/user/quickstart.html#splitting-one-array-into-several-smaller-ones
* [```split(a, indicies_or_sections, axis=0)```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html) splits "an array into multiple sub-arrays."
* [```hsplit(a, indicies_or_sections)```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.hsplit.html) splits "an array into multiple sub-arrays horizontally (column-wise)." Equivalent "to ```split``` with ```axis=1```."
* [```vsplit(a, indicies_or_sections)```](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.vsplit.html) splits "an array into multiple sub-arrays vertically (row-wise)." Equivalent "to ```split``` with ```axis=0```."

In [79]:
a = np.arange(9)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [80]:
# split an array into 3 equal arrays
np.split(a, 3)

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

In [89]:
# split an array into along sections
np.split(a, [3, 5, 6, 10])

[array([0, 1, 2]),
 array([3, 4]),
 array([5]),
 array([6, 7, 8]),
 array([], dtype=int64)]

## Lists vs NumPy array

In [90]:
# speed comparison for 2D arrays

# for loop with lists
def for_loop(dim):
    n = []
    # first dimension
    for i in range(dim):
        n.append([])
        # second dimension
        for j in range(dim):
            n[i].append(j)
    return n

# list comprehension
def list_comp(dim):
    # double list comprehension
    n = [[j for j in range(dim)] for i in range(dim)]
    return n

# numpy array
def numpy_array(dim):
    return np.array([np.arange(dim, dtype=np.int32) for i in range(dim)])

In [91]:
for_loop(5)

[[0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4]]

In [92]:
list_comp(5)

[[0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4]]

In [93]:
numpy_array(5)

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]], dtype=int32)

In [94]:
dim = 100

%timeit for_loop(dim)
%timeit list_comp(dim)
%timeit numpy_array(dim)

640 µs ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
203 µs ± 2.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
109 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# NumPy Summary
* ``ndarray``: fast, fixed-size and same type, memory efficient array with advanced build-in mathematical operations
* Basis for many (all?) popular ML libraries
* Use ``ndarray`` (and learn how to use it correctly)! Read the documentation at https://docs.scipy.org/doc/numpy-1.13.0/index.html

# NumPy Exercises
## 1. What is the ```dtype``` of the following arrays?

In [95]:
a = np.array([1, 2, 3])
a.dtype

dtype('int64')

In [96]:
a = np.array([1, 2, 3.0])
a.dtype

dtype('float64')

In [97]:
a = np.array([1, 2, 3.0], dtype=int)
a.dtype

dtype('int64')

In [98]:
a = np.array([1, 2, 3.0, 's'])
a.dtype

dtype('<U32')

In [99]:
a = np.array(['a', 'b', 'c'])
a.dtype

dtype('<U1')

In [100]:
a = np.array([True, True, False])
a.dtype

dtype('bool')

In [101]:
a = np.array([True, True, 'False'])
a.dtype

dtype('<U5')

In [102]:
a = np.array([True, 1., 'False'])
a.dtype

dtype('<U32')

In [103]:
a = np.array([{'name': 'Peter', 'age': 32}, 
              {'name': 'Marie', 'age': 31}])
a.dtype

dtype('O')

## 2. How do you create a NumPy array with the following specifications?
* ```array([0, 1, 2, 3])```
* ```array([[ True,  True], [ True,  True], [ True,  True]])```
* ```array([[0., 0., 0.], [0., 0., 0.]])```
* ```array([[0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]])```

## 3. Flattening arrays: ```flatten()``` vs ```ravel()```

* What are the differences of the ```flatten()``` and the ```ravel()``` methods?
* Why does changing a value after flattening the array using ```flatten()``` does not work? How can you use ```flatten()``` in a way to change the data?
* Does modifying values when using ```ravel()``` always change the orignal data? 
* Bonus questions: What is the equivalent method to the ```ravel()``` method?

In [None]:
a = np.arange(9).reshape(3,3)
a

In [None]:
# flatten the array with the flatten method
a.flatten()

In [None]:
# flatten the array with the ravel method
a.ravel()

In [None]:
a.flatten()[3] = -1

In [None]:
a = a.flatten()
a[3] = -1
a.reshape(3,3)

In [None]:
a.ravel()[3] = -1 
a

In [None]:
a

## 4. Stacking arrays
* What is the different of the methods ```stack()``` and ```concatenate()```?
* Check the cells below. How can you recreate the array a from the arrays b and c?
* How can you add c as a column to b?
* Why does adding d to b throw a ValueError? How can you change d so it can be concatenated to b?

In [None]:
a = np.arange(6).reshape(3,2)
a

In [None]:
# split the original array into two arrays: b as 2x2
b, c = np.split(a, [2], axis=0)
b

In [None]:
c

In [None]:
# recreating array a using b and c

# your code here...
np.concatenate((b, c))

# the result should be:
# array([[0, 1],
#        [2, 3],
#        [4, 5]])

In [None]:
# adding c as a column to be

# your code here...
np.concatenate((b, c.T), axis=1)

# the result should be:
# array([[0, 1, 4],
#        [2, 3, 5]])

In [None]:
d = np.array([6,7]).reshape(2,1)
d

In [None]:
d.shape

In [None]:
# why does concatenating b and d does not work?
# what is the difference from d and c?

# run this cell to see the error:
np.concatenate((b, d))

In [None]:
# how can you change d so the concatenation with b works?

# your code here...
np.concatenate((b, d.T))

# the result should be:
# array([[0, 1],
#        [2, 3],
#        [6, 7]])

# NumPy Exercise Answers
1. You can check the data type of a numpy array with the ```dtype``` attribute. If you do not specify the ```dtype``` when creating an array, it "will be determined as the minimum type required to hold the objects in the sequence" (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html). Documentation about data types can be found at https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html. For this exercise it is important to understand:
  * up- and downcasting when the values have different data types or when specifying a dtype manually.
  * how NumPy efficiently assigns memory to the values in the array and how this can be seen in the dtype attribute (e.g. '<U1' or 'int64').
2. The following statements create the specified arrays:
  * ```np.arange(4)``` creates ```array([0, 1, 2, 3])```
  * ```np.full((3, 2), True)``` creates ```array([[ True,  True], [ True,  True], [ True,  True]])```
  * ```np.zeros((2, 3))``` creates ```array([[0., 0., 0.], [0., 0., 0.]])```
  * ```np.array([np.arange(3) for i in range(4)])``` creates ```array([[0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]])```
3. Flattening arrays using ```flatten()``` and ```ravel()```:
  * ```flatten()``` returns a 1D copy of the array whereas ```ravel()``` returns the 1D original array (if possible).
  * Changing values after using ```flatten()``` will not change the original values. In this example you can use ```flatten()``` in a way to change the original data with overwriting the original array ```a = a.flatten()```, changing the value ```a[3] = -1``` and reshaping the array to its original 3x3 structure ```a.reshape(3,3)```.
  * Modifying values when using ```ravel()``` does not always change the orignal data. A copy will be created if the available memory is not sufficient.
  * Bonus questions: ```a.reshape(-1)``` is equivalent to ```ravel()```.
  * Check https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html, https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html, and https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.reshape.html for the documentation.
4. Stacking:
  * ```stack()``` requires each input array to have the same shape (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.stack.html) while ```concatenate()``` allows a different shape "in the dimension corresponding to the *axis*" (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.concatenate.html)
  * Recreating the array a can be done with the statement ```np.concatenate((b, c))```
  * Adding c as a column to b can be done with the statement ```np.concatenate((b, c.T), axis=1)```
  * Trying to add d to b throws the ValueError "all the input arrays must have same number of dimensions". The issue is that d has the shape ```(2,)``` (1D array) and therefore cannot be concatenated with b that has the shape ```(2, 2)``` (2D array). The ```concatenate()``` method requires the arrays to "have the same shape, except in the dimension corresponding to axis" (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.concatenate.html). d has to be 2D and in the shape ```(1, 2)``` to be concatenated to b.
  * For successfully adding d to b, we have to reshape d to a 2D array using the ```reshape()``` method: ```np.concatenate((b, d.reshape(1,2)))```.


# pandas

* "pandas is a Python package **providing fast, flexible, and expressive data structures** designed to make working with 'relational' or 'labeled' data both easy and intuitive."
* "built on top of NumPy"
* "It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python."
* "pandas is well suited for many different kinds of data:
  * **Tabular data with heterogeneously-typed columns**, as in an SQL table or Excel spreadsheet
  * Ordered and unordered (not necessarily fixed-frequency) **time series data**.
  * Arbitrary **matrix data** (homogeneously typed or heterogeneous) with row and column labels
  * Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure"
* Primary **data structures**:
  * **Series** (1-dimensional)
  * **DataFrame** (2-dimensional)

https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html




# Series

`class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`[[source](https://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/series.py#L102-L4383)]

* "One-dimensional `ndarray` with axis labels (including time series)"
* Like a column in a spreadsheet
* "The axis labels are collectively referred to as the **index**"
* "capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)"
* All elements in a `Series` have the same data type

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series

In [119]:
import pandas as pd
# create a pandas Series
pd.Series([1, 3, 5, 2, 6, 8])

0    1
1    3
2    5
3    2
4    6
5    8
dtype: int64

In [105]:
pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [106]:
# creating a Series using a dict
d = {'a': 0., 'b': 1., 'c': 2.}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [107]:
# creating a Series from a dict but in a specified order
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

"NaN (not a number) is the standard missing data marker used in pandas"

In [108]:
# creating Series using a scalar value. The length of the index is matched 
# with a repeated value
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [109]:
np.random.seed(23)
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.666988
b    0.025813
c   -0.777619
d    0.948634
e    0.701672
dtype: float64

## Series is ndarray-like
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series-is-ndarray-like

In [110]:
# select value at position
s[0]

0.6669880563534684

In [120]:
# slice Series
s[:3]

a    0.666988
b    0.025813
c   -0.777619
dtype: float64

In [121]:
# select elements that are larger than the median
s[s > s.median()]

d    0.948634
e    0.701672
dtype: float64

In [122]:
# select elements with a list of positional numbers (array-based indexing)
s[[4, 3, 1]]

e    0.701672
d    0.948634
b    0.025813
dtype: float64

In [123]:
# converting the Series to a pandas array (ExtensionArray) without the index
s.array

<PandasArray>
[ 0.6669880563534684, 0.02581308106627382, -0.7776194131918178,
  0.9486338224949431,   0.701671794647513]
Length: 5, dtype: float64

In [124]:
# converting a Series to a numpy ndarray
s.to_numpy()

array([ 0.66698806,  0.02581308, -0.77761941,  0.94863382,  0.70167179])

## Series is dict-like
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series-is-dict-like

In [125]:
s

a    0.666988
b    0.025813
c   -0.777619
d    0.948634
e    0.701672
dtype: float64

In [126]:
# select value at index (label)
s['a']

0.6669880563534684

In [127]:
# change value at index
s['e'] = 12.
s

a     0.666988
b     0.025813
c    -0.777619
d     0.948634
e    12.000000
dtype: float64

In [128]:
# check if index is in list
'e' in s

True

In [129]:
'f' in s

False

In [131]:
# s['f'] returns a KeyError exception if 'f' is not in s
#s['f']

In [132]:
# get returns None (or specified default) if the index is missing
print(s.get('f'))

None


In [133]:
# return user-defined value for missing index
s.get('f', 0)

0

## Vectorized operations and label alignment with Series
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#vectorized-operations-and-label-alignment-with-series

In [134]:
s

a     0.666988
b     0.025813
c    -0.777619
d     0.948634
e    12.000000
dtype: float64

In [135]:
# addition
s+s

a     1.333976
b     0.051626
c    -1.555239
d     1.897268
e    24.000000
dtype: float64

In [136]:
# multiplication
s*3

a     2.000964
b     0.077439
c    -2.332858
d     2.845901
e    36.000000
dtype: float64

In [137]:
np.exp(s)

a         1.948360
b         1.026149
c         0.459499
d         2.582180
e    162754.791419
dtype: float64

In [138]:
s[1:]

b     0.025813
c    -0.777619
d     0.948634
e    12.000000
dtype: float64

In [139]:
s[:-1]

a    0.666988
b    0.025813
c   -0.777619
d    0.948634
dtype: float64

In [140]:
# operations between Series automatically align the data based on index (label)
s[1:] + s[:-1]

a         NaN
b    0.051626
c   -1.555239
d    1.897268
e         NaN
dtype: float64

## Time Series

"pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy *datetime64* and *timedelta64* dtypes, pandas has consolidated a large number of features from other Python libraries like *scikits.timeseries* as well as created a tremendous amount of new functionality for manipulating time series data."


https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries

"pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section."

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#time-series


### Time Concepts

"pandas captures 4 general time related concepts:
* Date times: A specific date and time with timezone support. Similar to ```datetime.datetime``` from the standard library.
* Time deltas: An absolute time duration. Similar to ```datetime.timedelta``` from the standard library.
* Time spans: A span of time defined by a point in time and its associated frequency.
* Date offsets: A relative time duration that respects calendar arithmetic. Similar to ```dateutil.relativedelta.relativedelta``` from the dateutil package."


| Concept | Scalar Class | Array Class | pandas Data Type | Primary Creation Method |
|:--------|:-------------|:------------|:-----------------|:------------------------|
| Date times | ```Timestamp``` | ```DatetimeIndex``` | ```datetime64[ns]``` or ```datetime64[ns, tz]``` | ```to_datetime``` or ```date_range``` |
| Time deltas | ```Timedelta``` | ```TimedeltaIndex``` | ```timedelta64[ns]``` | ```to_timedelta``` or ```timedelta_range``` |
| Time spans | ```Period``` | ```PeriodIndex``` | ```period[freq]``` | ```Period``` or ```period_range``` |
| Date offsets | ```DateOffset``` | ```None``` | ```None``` | ```DateOffset``` |


https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#overview

#### Timestamp
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamps-vs-time-spans

In [141]:
# create a timestamp using the datetime module
import datetime
pd.Timestamp(datetime.datetime(2019, 6, 12))

Timestamp('2019-06-12 00:00:00')

In [142]:
# create a timestamp from datetime-like
pd.Timestamp(2019, 6, 12)

Timestamp('2019-06-12 00:00:00')

In [145]:
# create a timestamo from a datetime-like string
t = pd.Timestamp('2019-06-25')
t

Timestamp('2019-06-25 00:00:00')

In [146]:
# get the day name for Timestamp t
t.day_name()

'Tuesday'

#### Timedelta
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

In [147]:
# Calculate a time delta
t + pd.Timedelta('3 day')

Timestamp('2019-06-28 00:00:00')

#### Period
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamps-vs-time-spans

In [148]:
# create a period with an infered frequency
pd.Period('2011-01')

Period('2011-01', 'M')

In [149]:
# create a period with a defined frequency
pd.Period('2012-05', freq='D')

Period('2012-05-01', 'D')

#### Offset

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

In [150]:
t + pd.DateOffset(days=1)

Timestamp('2019-06-26 00:00:00')

### Ranges of Timestamps
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#generating-ranges-of-timestamps

In [151]:
# create a range of dates
rng = pd.date_range('1/1/2019', periods=20, freq='S')
rng

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:00:01',
               '2019-01-01 00:00:02', '2019-01-01 00:00:03',
               '2019-01-01 00:00:04', '2019-01-01 00:00:05',
               '2019-01-01 00:00:06', '2019-01-01 00:00:07',
               '2019-01-01 00:00:08', '2019-01-01 00:00:09',
               '2019-01-01 00:00:10', '2019-01-01 00:00:11',
               '2019-01-01 00:00:12', '2019-01-01 00:00:13',
               '2019-01-01 00:00:14', '2019-01-01 00:00:15',
               '2019-01-01 00:00:16', '2019-01-01 00:00:17',
               '2019-01-01 00:00:18', '2019-01-01 00:00:19'],
              dtype='datetime64[ns]', freq='S')

In [152]:
# create a time series with random values
np.random.seed(23)
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

2019-01-01 00:00:00     83
2019-01-01 00:00:01    230
2019-01-01 00:00:02     40
2019-01-01 00:00:03    457
2019-01-01 00:00:04    438
2019-01-01 00:00:05    488
2019-01-01 00:00:06     31
2019-01-01 00:00:07    237
2019-01-01 00:00:08    460
2019-01-01 00:00:09    347
2019-01-01 00:00:10     39
2019-01-01 00:00:11     90
2019-01-01 00:00:12    153
2019-01-01 00:00:13    435
2019-01-01 00:00:14      6
2019-01-01 00:00:15    379
2019-01-01 00:00:16    429
2019-01-01 00:00:17     12
2019-01-01 00:00:18     49
2019-01-01 00:00:19    194
Freq: S, dtype: int64

In [153]:
# resample time series (change frequency)
ts.resample('2S').sum()

2019-01-01 00:00:00    313
2019-01-01 00:00:02    497
2019-01-01 00:00:04    926
2019-01-01 00:00:06    268
2019-01-01 00:00:08    807
2019-01-01 00:00:10    129
2019-01-01 00:00:12    588
2019-01-01 00:00:14    385
2019-01-01 00:00:16    441
2019-01-01 00:00:18    243
Freq: 2S, dtype: int64

### Time Zones
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-zone-handling

In [154]:
# create a range of dates for the index
rng = pd.date_range('2012-03-06 00:00', periods=5, freq='D')
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

In [155]:
# create a time series
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06   -0.347459
2012-03-07    0.670140
2012-03-08    0.322272
2012-03-09    0.060343
2012-03-10   -1.043450
Freq: D, dtype: float64

In [156]:
# localize to UTC time zone
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00   -0.347459
2012-03-07 00:00:00+00:00    0.670140
2012-03-08 00:00:00+00:00    0.322272
2012-03-09 00:00:00+00:00    0.060343
2012-03-10 00:00:00+00:00   -1.043450
Freq: D, dtype: float64

In [157]:
# convert time series to different time zone
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00   -0.347459
2012-03-06 19:00:00-05:00    0.670140
2012-03-07 19:00:00-05:00    0.322272
2012-03-08 19:00:00-05:00    0.060343
2012-03-09 19:00:00-05:00   -1.043450
Freq: D, dtype: float64

### Converting between representations
"Timestamped data can be converted to PeriodIndex-ed data using ```to_period``` and vice-versa using ```to_timestamp```."

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#converting-between-representations

In [158]:
# create a range of dates
rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng

DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')

In [159]:
# create a time series from index
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31   -1.009942
2012-02-29    0.441736
2012-03-31    1.128877
2012-04-30   -1.838068
2012-05-31   -0.938769
Freq: M, dtype: float64

In [160]:
# change frequency of time series (convert to period)
ps = ts.to_period()
ps

2012-01   -1.009942
2012-02    0.441736
2012-03    1.128877
2012-04   -1.838068
2012-05   -0.938769
Freq: M, dtype: float64

In [161]:
# "Cast to datetimeindex of timestamps, at *beginning* of period."
# note the change to the original Series ts
ps.to_timestamp()

2012-01-01   -1.009942
2012-02-01    0.441736
2012-03-01    1.128877
2012-04-01   -1.838068
2012-05-01   -0.938769
Freq: MS, dtype: float64

### Indexing

"One of the main uses for ```DatetimeIndex``` is as an index for pandas objects. The ```DatetimeIndex``` class contains many time series related optimizations:
* A large range of dates for various offsets are pre-computed and cached under the hood in order to make generating subsequent date ranges very fast (just have to grab a slice).
* Fast shifting using the ```shift``` and ```tshift``` method on pandas objects.
* Unioning of overlapping ```DatetimeIndex``` objects with the same frequency is very fast (important for fast data alignment).
* Quick access to date fields via properties such as ```year```, ```month```, etc.
Regularization functions like ```snap``` and very fast ```asof``` logic."

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#indexing


In [162]:
# create a range of dates (inclusive end!)
rng = pd.date_range('2019-06-01', '2019-06-07')
rng

DatetimeIndex(['2019-06-01', '2019-06-02', '2019-06-03', '2019-06-04',
               '2019-06-05', '2019-06-06', '2019-06-07'],
              dtype='datetime64[ns]', freq='D')

In [163]:
# create a time series
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2019-06-01   -0.201841
2019-06-02    1.045371
2019-06-03    0.538162
2019-06-04    0.812119
2019-06-05    0.241106
2019-06-06   -0.952510
2019-06-07   -0.136267
Freq: D, dtype: float64

In [164]:
# get the first 5 timestamps
ts[:5]

2019-06-01   -0.201841
2019-06-02    1.045371
2019-06-03    0.538162
2019-06-04    0.812119
2019-06-05    0.241106
Freq: D, dtype: float64

In [165]:
# get every second timestamp
ts[::2]

2019-06-01   -0.201841
2019-06-03    0.538162
2019-06-05    0.241106
2019-06-07   -0.136267
Freq: 2D, dtype: float64

## Series Exercises



# DataFrame

`class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)` [[source](http://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/frame.py#L290-L7950)]

* Primary data structure in pandas
* 2D tabular data structure
* Labeled rows and columns (axes)
* Potentially heterogeneous data (different data types)
* Mutable size
* "dict-like container for Series objects" (similar to a spreadsheet)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html and https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe

## Create DataFrame from dict of Series or dicts
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#from-dict-of-series-or-dicts

In [166]:
# data dictionary
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
# creating a DataFrame from a dictionary
pd.DataFrame(d)

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [167]:
# create a DataFrame with user-specified index order
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [168]:
# create a DataFrame with user-specified columns and index order
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


## Create DataFrame from ndarrays or dict of ndarrays / lists
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#from-dict-of-ndarrays-lists

In [169]:
# create a DataFrame from a numpy ndarray
pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
             columns=['a', 'b', 'c'])

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [170]:
# create a dataframe from a dict of lists
d = {'one': [1., 2., 3., 4.],
     'two': [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [171]:
# create a DataFrame from a dict of lists with user-defined indexes
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From a list of dicts
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#from-a-list-of-dicts

In [172]:
# create a DataFrame from a list of dicts
d = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(d)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [173]:
# create a DataFrame from a list of dicts with user-defined index
pd.DataFrame(d, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [174]:
# create a DataFrame from a list of dicts with user-defined columns 
pd.DataFrame(d, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


## From a dict of tuples
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#from-a-dict-of-tuples

In [175]:
# create a DataFrame from a dict of tuples
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


## Creating a DataFrame by passing a dict of objects
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#object-creation

In [202]:
# create a DataFrame from a dict of objects
df = pd.DataFrame({'A': 1.,
                   'B': pd.Timestamp('20130102'),
                   'C': pd.Series(1, index=['a', 'b', 'c', 'd'], dtype='float32'),
                   'D': np.array([3] * 4, dtype='int32'),
                   'E': pd.Categorical(["test", "train", "test", "train"]),
                   'F': 'foo'}, 
                  index=['a', 'b', 'c', 'd'])
df

Unnamed: 0,A,B,C,D,E,F
a,1.0,2013-01-02,1.0,3,test,foo
b,1.0,2013-01-02,1.0,3,train,foo
c,1.0,2013-01-02,1.0,3,test,foo
d,1.0,2013-01-02,1.0,3,train,foo


## Creating an empty DataFrame

In [177]:
# create an empty DataFrame
pd.DataFrame()

In [178]:
# use the .empty attribute to check if a dataFrane is empty
pd.DataFrame().empty

True

In [179]:
# create an empty DataFrame with 5 rows and 3 columns
pd.DataFrame(index=np.arange(5), columns=['A', 'B', 'C'])

Unnamed: 0,A,B,C
0,,,
1,,,
2,,,
3,,,
4,,,


## Basic methods and attributes on a DataFrame
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#minutes-to-pandas

In [181]:
df

Unnamed: 0,A,B,C,D,E,F
a,1.0,2013-01-02,1.0,3,test,foo
b,1.0,2013-01-02,1.0,3,train,foo
c,1.0,2013-01-02,1.0,3,test,foo
d,1.0,2013-01-02,1.0,3,train,foo


In [180]:
# view the dimensionality of the DataFrame: (rows, columns)
df.shape

(4, 6)

In [183]:
# get the data types of the columns (Series) in the DataFrame
df.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [184]:
# get the first x rows of the DataFrame
df.head(2)

Unnamed: 0,A,B,C,D,E,F
a,1.0,2013-01-02,1.0,3,test,foo
b,1.0,2013-01-02,1.0,3,train,foo


In [185]:
# get the last x rows of the DataFrame
df.tail(1)

Unnamed: 0,A,B,C,D,E,F
d,1.0,2013-01-02,1.0,3,train,foo


In [186]:
# get some basic statistics of the DataFrame
df.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


In [203]:
# transpose the DataFrame
df.T

Unnamed: 0,a,b,c,d
A,1,1,1,1
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,1,1,1,1
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


In [204]:
# get a NumPy representation of the DataFrame
df.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

## Cells
Acess single values with row/column pairs using the ```at``` and ```iat``` attributes. 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html and 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html

In [205]:
df

Unnamed: 0,A,B,C,D,E,F
a,1.0,2013-01-02,1.0,3,test,foo
b,1.0,2013-01-02,1.0,3,train,foo
c,1.0,2013-01-02,1.0,3,test,foo
d,1.0,2013-01-02,1.0,3,train,foo


In [206]:
# select a single value using the DataFrame labels
# select value in row b and column D
df.at['b', 'D']

3

In [207]:
# select a single value by integer position
# select value in row 0 ('a') and column 5 ('F') (zero-based)
df.iat[0, 5]

'foo'

In [208]:
df

Unnamed: 0,A,B,C,D,E,F
a,1.0,2013-01-02,1.0,3,test,foo
b,1.0,2013-01-02,1.0,3,train,foo
c,1.0,2013-01-02,1.0,3,test,foo
d,1.0,2013-01-02,1.0,3,train,foo


## Rows and Columns

In [209]:
# create a DataFrame with 2 columns ('one' and 'two') and 
# 4 rows ('a' - 'd')
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd']),
     'three': pd.Series([1., 2., 4.], index=['a', 'b', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


### The ```index``` and ```columns``` attributes

In [210]:
# get the indices of the DataFrame
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [211]:
df.columns

Index(['one', 'two', 'three'], dtype='object')

### Slice DataFrame rows using ```[]```

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#slicing-ranges

In [212]:
# slice rows (exclusive end)
df[0:1]

Unnamed: 0,one,two,three
a,1.0,1.0,1.0


In [213]:
# slice rows default (select all)
df[::]

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [214]:
# reverse order
df[::-1]

Unnamed: 0,one,two,three
d,,4.0,4.0
c,3.0,3.0,
b,2.0,2.0,2.0
a,1.0,1.0,1.0


In [215]:
# select every second row
df[::2]

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
c,3.0,3.0,


### Select DataFrame columns
#### using ```[]```

In [216]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [217]:
# select a Series (single column) of a DataFrame
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [218]:
# select multiple columns of a DataFrame
df[['one', 'two']]

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


#### as attribute

In [219]:
# access an index directly as an attribute
# https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#attribute-access
df.two

a    1.0
b    2.0
c    3.0
d    4.0
Name: two, dtype: float64

#### with the ```get()``` method

In [220]:
# select a Series using the build-in get method (returns a default value)
# https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#dictionary-like-get-method
df.get('two')

a    1.0
b    2.0
c    3.0
d    4.0
Name: two, dtype: float64

### Select rows and columns using the ```df.loc['label']``` and the ```df.iloc[loc]``` methods

https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#indexing-selection and https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection-by-position

In [221]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


#### Select rows and columns using label based indexing

```DataFrame.loc```

"Access a group of rows and columns by label(s) or a boolean array.

```loc[]``` is primarily label based, but may also be used with a boolean array."

"Integers are valid labels, but they refer to the label and not the position".

"The following are valid inputs:

* A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
* A list or array of labels ['a', 'b', 'c'].
* A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the **start and the stop are included**, when present in the index! See [Slicing with labels](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-slicing-with-labels).).
* A boolean array.
* A callable, see [Selection By Callable](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-callable)."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-label

##### Select rows using ```loc```

In [222]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [223]:
# select row by index label
df.loc['b']

one      2.0
two      2.0
three    2.0
Name: b, dtype: float64

In [224]:
# select rows by starting index label
df.loc['c':]

Unnamed: 0,one,two,three
c,3.0,3.0,
d,,4.0,4.0


In [225]:
# select rows by ending index label (inclusive end!)
df.loc[:'b']

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0


In [226]:
# select rows by starting and ending index label (inclusive end!)
df.loc['b':'c']

Unnamed: 0,one,two,three
b,2.0,2.0,2.0
c,3.0,3.0,


##### Select columns using ```loc```

In [227]:
# select columns until 'two' (inclusive end!)
df.loc[:, :'two']

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [228]:
# select a single column
df.loc[:, 'one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

##### Select rows and columns using```loc```

In [229]:
# select rows (until 'c' inclusive!) and 
# columns (starting from 'two')
df.loc[:'c', 'two':]

Unnamed: 0,two,three
a,1.0,1.0
b,2.0,2.0
c,3.0,


In [230]:
# select single row and single column
df.loc['c', 'one']

3.0

#### Select rows and columns by position

```DataFrame.iloc```

"**Purely integer-location based indexing** for selection by position.

```iloc[]``` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array."

* "The semantics follow closely Python and NumPy slicing" (0-based indexing)
* "When slicing, the **start bounds is *included*, while the upper bound is *excluded***"
* "Trying to use a non-integer, even a valid label will raise an ```IndexError```"

"The ```iloc``` attribute is the primary access method. The following are valid inputs:
* An integer e.g. 5.
* A list or array of integers [4, 3, 0].
* A slice object with ints 1:7.
* A boolean array.
* A callable, see [Selection By Callable](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-callable)."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html and
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position

In [231]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


##### Select rows using ```iloc```

In [232]:
# select row by integer location of the index
df.iloc[2]

one      3.0
two      3.0
three    NaN
Name: c, dtype: float64

In [233]:
# select rows by starting index integer
df.iloc[2:]

Unnamed: 0,one,two,three
c,3.0,3.0,
d,,4.0,4.0


In [234]:
# select rows by ending index integer (exclusive)
df.iloc[:2]

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0


##### Select columns using ```iloc```

In [235]:
# select second column (zero-based)
df.iloc[:, 1]

a    1.0
b    2.0
c    3.0
d    4.0
Name: two, dtype: float64

In [236]:
# select columns starting from the second index
df.iloc[:, 1:]

Unnamed: 0,two,three
a,1.0,1.0
b,2.0,2.0
c,3.0,
d,4.0,4.0


In [237]:
# select columns until third index (exclusive)
df.iloc[:, :2]

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


##### Select rows and columns using```iloc```

In [239]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [238]:
# select single value from the second row ('b') and 
# the third column ('three', note: zero-based index selection)
df.iloc[1, 2]

2.0

In [240]:
# select rows and columns by position via integer slicing
# select the first four rows and the second column (exclusive end)
df.iloc[0:5, 1:3]

Unnamed: 0,two,three
a,1.0,1.0
b,2.0,2.0
c,3.0,
d,4.0,4.0


In [241]:
# mixing ranges and single selections is possible:
# select the second and third row and the third column (zero-based)
df.iloc[1:3, 2]

b    2.0
c    NaN
Name: three, dtype: float64

### ```[]``` vs. ```.loc[]``` vs. ```.iloc[]```
For a DataFrame with uppercase letters as column labels ('A', 'B', 'C') and lowercase letters as row labels ('a', 'b', 'c', 'd') the following operations can be applied for selecting or slicing rows or columns (this table shows when exchanging the ```[]``` method with ```loc``` or ```iloc``` returns the same result):

| Operatation                        | ```[]``` method      | ```loc``` method | ```iloc``` method |
|:-----------------------------------|:---------------------|:-----------------|:------------------|
| Select a single column by label    | ```df['A']```        | ```df.loc[:, 'A']```        | -      |
| Select list of columns by label    | ```df[['A', 'C']]``` | ```df.loc[:, ['A', 'C']]``` | -      |
| Slice columns by label             | -                    | ```df.loc[:, 'A':'C']```    |        |
| Select a single column by position | -                    | -                           | ```df.iloc[:, 1]``` |
| Select list of columns by position | -                    | -                           | ```df.iloc[:, [0, 2]]``` |
| Slice columns by position          | -                    | -                           | ```df.iloc[:, 0:2]``` |
| Select a single row by label       | -                    | ```df.loc['b']```           | - |
| Select a list of rows by label     | -                    | ```df.loc[['b', 'd']]```    | - |
| Slice rows by label                | ```df['b':'d']```*   | ```df.loc['b':'d']```*      | - | 
| Select a single row by position    | -                    | -                           | ```df.iloc[1]```|
| Select a list of rows by position  | -                    | -                           | ```df.iloc[[1, 3]]``` |
| Slice rows by position             | ```df[1:4]```        | -                           | ```df.iloc[1:4]``` | 


\* inclusive end of the selection


Note that you could also combine the selection of rows and columns (for the ```loc``` and ```iloc``` methods but not the ```[]``` method).

Pay attention when assigning values to selections. **Always use ```loc``` or ```iloc``` to be guaranteed to modify the original DataFrame**.

https://stackoverflow.com/a/48411543/6270819 and https://stackoverflow.com/a/47098873/6270819

### Select random samples

```DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)```

"Return a random sample of items from an axis of object. You can use *random_state* for reproducibility.

**Parameters:**
* **n:** *int, optional*. Number of items from axis to return. Cannot be used with *frac*. Default ```= 1``` if ```frac = None```.
* **fra:** *float, optional*. Fraction of axis items to return. Cannot be used with *n*.
* **replace:** *bool, default ```False```*. Sample with or without replacement.
* **weights:** *str or ndarray-like, optional*. Default ```None``` results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when ```axis = 0```. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.
* **random_state:** *int or ```numpy.random.RandomState```, optional*. Seed for the random number generator (if int), or numpy RandomState object.
* **axis:** *int or string, optional*. Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames, 1 for Panels).

**Returns: Series or DataFrame:** A new object of same type as caller containing *n* items randomly sampled from the caller object."


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

In [242]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [243]:
df.sample(n=3)

Unnamed: 0,one,two,three
b,2.0,2.0,2.0
a,1.0,1.0,1.0
d,,4.0,4.0


In [244]:
df.sample(frac=.5)

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
d,,4.0,4.0


### Modify DataFrame values

#### Modify single values with ```at```
```at``` "provides label-based lookups" for a single value (similar to ```loc```). "Use ```at``` if you only need to get or set a single value in a DataFrame or Series."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html

In [245]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [246]:
# get a value from a row/column pair
df.at['b', 'two']

2.0

In [247]:
# set a value at a row/column pair
df.at['b', 'two'] = 10
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,10.0,2.0
c,3.0,3.0,
d,,4.0,4.0


#### Modify single values with ```iat```
```iat``` "provides integer-based lookups" ("similar to ```iloc```") "if you only need to get or set a single value in a DataFrame or Series.".

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html

In [248]:
# get a value from a row/column pair (zero-based)
df.iat[1, 1]

10.0

In [249]:
df.iat[1, 1] = 20
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,20.0,2.0
c,3.0,3.0,
d,,4.0,4.0


#### Modify multiple values with ```loc```
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

In [250]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,20.0,2.0
c,3.0,3.0,
d,,4.0,4.0


In [251]:
# Set value for all items matching the list of labels
df.loc[['a', 'b'], ['three']] = 3
df

Unnamed: 0,one,two,three
a,1.0,1.0,3.0
b,2.0,20.0,3.0
c,3.0,3.0,
d,,4.0,4.0


In [252]:
# Set value for an entire row
df.loc['d'] = 0
df

Unnamed: 0,one,two,three
a,1.0,1.0,3.0
b,2.0,20.0,3.0
c,3.0,3.0,
d,0.0,0.0,0.0


In [253]:
# Set value for an entire column
df.loc[:, 'three'] = 1
df

Unnamed: 0,one,two,three
a,1.0,1.0,1
b,2.0,20.0,1
c,3.0,3.0,1
d,0.0,0.0,1


In [254]:
# Set value for rows matching callable condition
df.loc[df['two'] < 2] = 5
df

Unnamed: 0,one,two,three
a,5.0,5.0,5
b,2.0,20.0,1
c,3.0,3.0,1
d,5.0,5.0,5


#### Modify multiple values with ```iloc```
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

The official documentation does only mention the ```loc``` method for setting values. To my experience the ```iloc``` method works just as well but always use ```loc``` when in doubt.

In [255]:
# Set value for an entire row
df.iloc[3] = 1
df

Unnamed: 0,one,two,three
a,5.0,5.0,5
b,2.0,20.0,1
c,3.0,3.0,1
d,1.0,1.0,1


In [256]:
 # Set value for an entire column
df.iloc[:, 0] = 0
df

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1


#### The ```replace``` method

```DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')```

"Replace values given in *to_replace* with *value*.

Values of the DataFrame are replaced with other values dynamically. This differs from updating with ```.loc``` or ```.iloc```, which require you to specify a location to update with some value.

* **to_replace**: str, regex, list, dict, Series, int, float, or None
* **value**: scalar, dict, list, str, regex, default None"

The ```replace``` method returns the "DataFrame object after replacement" (differs from the ```loc```, ```iloc```, ```at```, and ```iat``` methods that edit the DataFrame in place).

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

In [257]:
df

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1


In [258]:
# scala to_replace and value
df.replace(5, 10)

Unnamed: 0,one,two,three
a,0,10.0,10
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1


In [259]:
# List-like to_replace and scalar value
df.replace([0, 1, 2, 3], 4)

Unnamed: 0,one,two,three
a,4,5.0,5
b,4,20.0,4
c,4,4.0,4
d,4,4.0,4


In [260]:
# List-like to_replace and value
df.replace([0, 1, 2, 3], [1, 2, 3, 4])

Unnamed: 0,one,two,three
a,1,5.0,5
b,1,20.0,2
c,1,4.0,2
d,1,2.0,2


In [261]:
df

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1


In [262]:
# dict-like to_replace
df.replace({0: 10, 1: 100})

Unnamed: 0,one,two,three
a,10,5.0,5
b,10,20.0,100
c,10,3.0,100
d,10,100.0,100


In [263]:
df

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1


In [264]:
# dict-like to_replace and scalar value
df.replace({'one': 0, 'two': 5}, 100)

Unnamed: 0,one,two,three
a,100,100.0,5
b,100,20.0,1
c,100,3.0,1
d,100,1.0,1


In [265]:
# dict-like to_replace
df.replace({'two': {1: 100, 3: 300}})

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,300.0,1
d,0,100.0,1


#### The ```update``` method

```DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')```

"Modify in place using non-NA values from another DataFrame.

Aligns on indices. There is no return value.

* **other**: DataFrame, or object coercible into a DataFrame. Should have at least one matching index/column label with the original DataFrame. If a Series is passed, its name attribute must be set, and that will be used as the column name to align with the original DataFrame.
* **join**: {‘left’}, default ‘left’. Only left join is implemented, keeping the index and columns of the original object."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html

In [266]:
# left joining two DataFrames using the update method
df_left = pd.DataFrame({'A': [1, 2, 3],
                        'B': [400, 500, 600]})
df_right = pd.DataFrame({'B': [4, 5, 6],
                         'C': [7, 8, 9]})
df_left

Unnamed: 0,A,B
0,1,400
1,2,500
2,3,600


In [267]:
df_right

Unnamed: 0,B,C
0,4,7
1,5,8
2,6,9


In [268]:
df_left.update(df_right)
df_left

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [269]:
# NaN values in other are not updated
df_right = pd.DataFrame({'B': [-1, np.nan, -3]})
df_left.update(df_right)
df_left

Unnamed: 0,A,B
0,1,-1.0
1,2,5.0
2,3,-3.0


### Append Rows to DataFrame
"concatenate along axis=0, namely the index"

```DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)```

"Append rows of *other* to the end of caller, returning a new object.

Columns in *other* that are not in the caller are added as new columns."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-using-append

In [270]:
df

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1


In [274]:
# append rows along axis=0 (with an outer join)
df.append(df[0:1])

Unnamed: 0,one,two,three
a,0,5.0,5
b,0,20.0,1
c,0,3.0,1
d,0,1.0,1
a,0,5.0,5


### Inserting Columns

In [275]:
# insert column with a scalar value (will be propagated for filling the entire column)
df['foo'] = 'bar'
df

Unnamed: 0,one,two,three,foo
a,0,5.0,5,bar
b,0,20.0,1,bar
c,0,3.0,1,bar
d,0,1.0,1,bar


In [276]:
# insert a coloumn using a boolean statement
df['flag'] = df['two'] > 3
df

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


In [277]:
# inserting a truncated Series (will be conformed to the index of the DataFrame)
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,two,three,foo,flag,one_trunc
a,0,5.0,5,bar,True,0.0
b,0,20.0,1,bar,True,0.0
c,0,3.0,1,bar,False,
d,0,1.0,1,bar,False,


In [278]:
# adding a new column with the result of the multiplication of two columns
# DataFrames are dict-like
# https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#column-selection-addition-deletion
df['four'] = df['two'] * df['three']
df

Unnamed: 0,one,two,three,foo,flag,one_trunc,four
a,0,5.0,5,bar,True,0.0,25.0
b,0,20.0,1,bar,True,0.0,20.0
c,0,3.0,1,bar,False,,3.0
d,0,1.0,1,bar,False,,1.0


### Concat DataFrames
Gives the full flexibility of joining DataFrames.

```pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None,  levels=None, names=None, verify_integrity=False, copy=True)```

"Concatenate pandas objects along a particular axis with optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

* **objs**: *a sequence or mapping of Series, DataFrame, or Panel objects*. If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised

Returns **concatenated**: *object, type of objs*. When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat and 
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects


In [279]:
df

Unnamed: 0,one,two,three,foo,flag,one_trunc,four
a,0,5.0,5,bar,True,0.0,25.0
b,0,20.0,1,bar,True,0.0,20.0
c,0,3.0,1,bar,False,,3.0
d,0,1.0,1,bar,False,,1.0


In [280]:
# recap: slice a DataFrame
df[2:]

Unnamed: 0,one,two,three,foo,flag,one_trunc,four
c,0,3.0,1,bar,False,,3.0
d,0,1.0,1,bar,False,,1.0


In [281]:
# slice DataFrame into pieces
pieces = [df[2:], df[1:2], df[:1]]

# concat pieces of DataFrame
pd.concat(pieces)

Unnamed: 0,one,two,three,foo,flag,one_trunc,four
c,0,3.0,1,bar,False,,3.0
d,0,1.0,1,bar,False,,1.0
b,0,20.0,1,bar,True,0.0,20.0
a,0,5.0,5,bar,True,0.0,25.0


### Join DataFrames
similar to using SQL with the ```merge()``` method:

```pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)```

"Merge DataFrame or named Series objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on."

Returns a DataFrame "of the two merged objects".

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html and
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#join and https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join

In [282]:
# create a DataFrame
df_left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
df_left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [283]:
# create another DataFrame
df_right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
df_right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [284]:
# join the DataFrames on the key values
pd.merge(df_left, df_right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


### Deleting Columns

In [285]:
df

Unnamed: 0,one,two,three,foo,flag,one_trunc,four
a,0,5.0,5,bar,True,0.0,25.0
b,0,20.0,1,bar,True,0.0,20.0
c,0,3.0,1,bar,False,,3.0
d,0,1.0,1,bar,False,,1.0


In [286]:
# delete column using the del statement
del df['one_trunc']
df

Unnamed: 0,one,two,three,foo,flag,four
a,0,5.0,5,bar,True,25.0
b,0,20.0,1,bar,True,20.0
c,0,3.0,1,bar,False,3.0
d,0,1.0,1,bar,False,1.0


In [287]:
# delete column using the pop method
four = df.pop('four')
four

a    25.0
b    20.0
c     3.0
d     1.0
Name: four, dtype: float64

In [288]:
df

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


### Delete Rows

In [289]:
# if you want to delete the frist or last x rows, you can use iloc or loc to slice the rows
df.iloc[2:]

Unnamed: 0,one,two,three,foo,flag
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


In [290]:
# delete rows that do not fulfill a requirement using the [] statement
df[df.two > 3]

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True


### Delete Rows and/or Columns with the ```drop``` method

```DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')```

"Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

**Parameters:**
* **labels**: *single label or list-like.* Index or column labels to drop.
* **axis**: *{0 or ‘index’, 1 or ‘columns’}, default 0*. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
* **index, columns**: *single label or list-like.* Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
* **level**: *int or level name, optional.* For MultiIndex, level from which the labels will be removed.
* **inplace**: *bool, default False.* If True, do operation inplace and return None.
* **errors**: *{‘ignore’, ‘raise’}, default ‘raise’.* If ‘ignore’, suppress error and only existing labels are dropped.

**Returns:** **dropped**: *pandas.DataFrame*"

In [291]:
df

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


In [292]:
# drop single row
df.drop('b')

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


In [293]:
# drop single row using 'index' argument
df.drop(index='d')

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False


In [294]:
# drop multiple rows
df.drop(['b', 'c'])

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
d,0,1.0,1,bar,False


In [295]:
# drop single column
df.drop('two', axis=1)

Unnamed: 0,one,three,foo,flag
a,0,5,bar,True
b,0,1,bar,True
c,0,1,bar,False
d,0,1,bar,False


In [296]:
# drop single column using the 'columns' argument
df.drop(columns='two')

Unnamed: 0,one,three,foo,flag
a,0,5,bar,True
b,0,1,bar,True
c,0,1,bar,False
d,0,1,bar,False


In [297]:
# drop multiple columns
df.drop(columns=['one', 'two'])

Unnamed: 0,three,foo,flag
a,5,bar,True
b,1,bar,True
c,1,bar,False
d,1,bar,False


In [298]:
# drop rows and columns
df.drop(index='d', columns='foo')

Unnamed: 0,one,two,three,flag
a,0,5.0,5,True
b,0,20.0,1,True
c,0,3.0,1,False


### Rename rows or columns

```DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)```

"Alter axes labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

**Parameters:**
* **mapper, index, columns:** *dict-like or function, optional*. dict-like or functions transformations to apply to that axis’ values. Use either ```mapper``` and ```axis``` to specify the axis to target with ```mapper```, or ```index``` and ```columns```.
* **axis:** *int or str, optional*. Axis to target with ```mapper```. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.
* **copy:** *boolean, default ```True```*. Also copy underlying data
* **inplace:** *boolean, default ```False```*. Whether to return a new DataFrame. If ```True``` then value of copy is ignored.
* **level:** *int or level name, default ```None```*. In case of a ```MultiIndex```, only rename labels in the specified level.

**Returns: renamed:** DataFrame.

```DataFrame.rename``` supports two calling conventions
* ```(index=index_mapper, columns=columns_mapper, ...)```
* ```(mapper, axis={'index', 'columns'}, ...)```

We *highly* recommend using keyword arguments to clarify your intent."

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html#pandas.DataFrame.rename

In [299]:
df

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


In [300]:
# rename columns and make index uppercase using the index and columns mappers
df.rename(index=str.upper, columns={"one": 1, "two": 2, "three": 3, "foo": "F"})

Unnamed: 0,1,2,3,F,flag
A,0,5.0,5,bar,True
B,0,20.0,1,bar,True
C,0,3.0,1,bar,False
D,0,1.0,1,bar,False


In [304]:
# rename columns using the mapper function and the targeted axis
df.rename(str.upper, axis=1)

Unnamed: 0,ONE,TWO,THREE,FOO,FLAG
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


## Boolean Indexing
Using "boolean vectors to filter the data" and combine multiple conditional expressions with the logical operators ```&``` for **and**, ```|``` for **or**, and ```~``` for **not** ("**must** be grouped using parentheses").

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#boolean-indexing and http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing 
and https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#indexing-selection



In [305]:
df

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False
d,0,1.0,1,bar,False


In [306]:
# select data using the data of a column
df[df.two > 3]

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True


In [307]:
# use the boolean values in the flag column to select rows
df[df['flag']]

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True


In [308]:
# use a list of booleans to access rows
df[[False, True, True, False]]

Unnamed: 0,one,two,three,foo,flag
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False


In [309]:
# selecting rows that match a more complex criterion
criterion = df['two'].map(lambda x: x >= 3)
df[criterion]

Unnamed: 0,one,two,three,foo,flag
a,0,5.0,5,bar,True
b,0,20.0,1,bar,True
c,0,3.0,1,bar,False


In [310]:
# multiple criteria
df[criterion & (df['flag'] == False)]

Unnamed: 0,one,two,three,foo,flag
c,0,3.0,1,bar,False


# pandas Summary
* Select rows using the ```df.loc[label]``` and ```df.iloc[loc]``` methods and slice rows using the indices ```df[i:]```.
* Select columns by slicing ```df['colname']```, using the index directly as an attribute ```df.colnname```, or using the ```df.get(colname)``` method.
* Use ```df.loc[]``` when possible!
* For modifying a single value, use the ```df.at[]``` or ```df.iat[]```  attribute and for multiple values, use the ```df.loc[]``` or ```df.iloc[]``` attributes.
* For modifying a specific value accross multiple rows and columns use the ```df.replace()``` method.
* For deleting rows/columns use the ```df.drop(index=i_labels, columns=c_labels)``` method.
* It's crutial to know which methods return a new DataFrame and which modify the original DataFrame.

# pandas Exercises

## 1. Slicing Rows of a DataFrame
* Why do the following methods return a different result?

In [311]:
# create the DataFrame
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [312]:
# slice DataFrame rows
df[1:2]

Unnamed: 0,0,1,2
1,4,5,6


In [317]:
# select rows of DataFrame using the label-based .loc method
df.loc[1:2]

Unnamed: 0,0,1,2
1,4,5,6
2,7,8,9


* What do you expect ```df.iloc[1:2]``` returns?

In [None]:
#df.iloc[1:2]

## 2. Swapping Columns
* Why does the ```[]``` method swap the columns but the ```.loc``` method does not work?
* What would be the correct way to swap column values using the ```.loc``` method?

In [318]:
# creating a DataFrame with a date as index and two Columns A and B with random values
np.random.seed(23)

dates = pd.date_range('1/1/2000', periods=8)

df = pd.DataFrame(np.random.randn(8, 2),
                  index=dates, columns=['A', 'B'])
df

Unnamed: 0,A,B
2000-01-01,0.666988,0.025813
2000-01-02,-0.777619,0.948634
2000-01-03,0.701672,-1.051082
2000-01-04,-0.367548,-1.13746
2000-01-05,-1.322148,1.772258
2000-01-06,-0.347459,0.67014
2000-01-07,0.322272,0.060343
2000-01-08,-1.04345,-1.009942


In [319]:
# swapping column values of column A and B
df[['B', 'A']] = df[['A', 'B']]
df

Unnamed: 0,A,B
2000-01-01,0.025813,0.666988
2000-01-02,0.948634,-0.777619
2000-01-03,-1.051082,0.701672
2000-01-04,-1.13746,-0.367548
2000-01-05,1.772258,-1.322148
2000-01-06,0.67014,-0.347459
2000-01-07,0.060343,0.322272
2000-01-08,-1.009942,-1.04345


In [330]:
# swapping columns using the .loc method does not work. why?
df.loc[:, ['B', 'A']] = df[['A', 'B']]
df

Unnamed: 0,A,B
2000-01-01,0.666988,0.025813
2000-01-02,-0.777619,0.948634
2000-01-03,0.701672,-1.051082
2000-01-04,-0.367548,-1.13746
2000-01-05,-1.322148,1.772258
2000-01-06,-0.347459,0.67014
2000-01-07,0.322272,0.060343
2000-01-08,-1.04345,-1.009942


## 3. Setting values
* Why could the following method raise a warning? (Which warning is it? Note that the warning might not always be raised.)
* What happends if you assign a new value?
* How should you rewrite the statement to correctly set a value?
* Are there other statements to correctly set a value? (name them)

In [331]:
# example from
# http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
dfb = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'c': np.arange(7)})

# setting a value?
dfb['c'][dfb.a.str.startswith('o')] = 42

dfb

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,a,c
0,one,42
1,one,42
2,two,2
3,three,3
4,two,4
5,one,42
6,six,6


# 4. Edit original or return new DataFrame?
* Which of the following methods or attributes are modifying the original DataFrame and which are returning a new DataFrame?
  * ```append```
  * ```at``` and ```iat```
  * ```loc``` and ```iloc```
  * ```replace```
  * ```T``` (transpose)
  * ```update```

# pandas Exercise Answers
1. Using the ```.loc``` method to slice a DataFrame using the labels (e.g. ```df.loc[1:2]```), **both** the start and stop labels are included in the result. This is contrary to usual python slices, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html for the documentation.
2. pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc: **column alignment is before value assignment**. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics for the documentation. The correct way to swap columns using the ```.loc``` method is by using the raw values of the columns (with the raw values the column information is lost and the columns are not aligned anymore): ```df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()```
3. The statement ```dfb['c'][dfb.a.str.startswith('o')] = 42``` will raise the ```SettingWithCopyWarning``` b/c the value assignment can take place on a copy of the DataFrame but not the original view of the DataFrame. It is important to mention, that the value assignment could work but there is no guarantee that the original DataFrame is edited (it could just be the copy of the DataFrame). So with the given statement, a value is assigned to the copy or the view of the DataFrame. The correct way to write the value assignment statement could be ```dfb.loc[dfb.a.str.startswith('o'), 'c'] = 42``` (note that there are many other ways to do the value assignment). You can also use the ```.iloc``` method or a loop with the ```.at``` or ```.iat``` methods (and even more creative ways are possible).
4. 
  * Editing the original DataFrame: 
    * ```at``` and ```iat``` attributes
    * ```loc``` and ```iloc``` attributes
    * ```update``` method
  * Returning a new DataFrame: 
    * ```append``` method
    * ```replace``` method
    * ```T``` (transpose) attribute (calls the ```transpose()``` method)


