# Preparing, manipulating and visualizing data in Python
This notebook contains an introduction to using NumPy, Pandas and Matplotlib for machine learning purposes.

## Imports
Let's begin by importing the external dependencies we need

In [1]:
import time
import numpy as np
import pandas as pd

## NumPy
This module primarily concerns creating arrays of various dimensions and performing calculations and other operations on these in an effiecient manner. There is also a submodule for linear algebra algorithms and some simple statistical functions such mean, median and sum.

### Timing some functions
We can create a simple decorator function, allowing us to measure the time of other functions to run by prepending them with `@timer` (which is really just syntactic sugar for calling our function like this: `timer(our_function)(our_function_args)`, every time)

In [2]:
def timer(func):
    def do_timing(*args, **kwargs):
        start = time.time()
        func_ret = func(*args, **kwargs)
        end = time.time()
        print("{} took {:.3f}s to run".format(func.__name__, end-start))
        return func_ret
    return do_timing

Using the timer decorator we just created, we can examine how efficient NumPy really is compared to vanilla Python.

In [3]:
@timer
def sum_trad(upper):
    X = range(upper)
    Y = range(upper)
    Z = []
    for i in range(len(X)):
        Z.append(X[i] + Y[i])

@timer
def sum_compr(upper):
    X = range(upper)
    Y = range(upper)
    Z = [x + y for (x, y) in zip(X, Y)]

@timer
def sum_np(upper):
    X = np.arange(upper)
    Y = np.arange(upper)
    Z = X + Y

In [29]:
upper = 10000000
sum_trad(upper)
sum_compr(upper)
sum_np(upper)

sum_trad took 1.631s to run
sum_compr took 0.592s to run
sum_np took 0.048s to run


### Creating arrays in NumPy
Unlike Python lists, NumPy arrays have a specified type of elements they hold, i.e. while a Python list can happily store strings and numbers together, a NumPy array will not.

In [5]:
arr = np.array([1, 2, 3, 4], float)

print(arr)
print(type(arr))

[1. 2. 3. 4.]
<class 'numpy.ndarray'>


These arrays can be quite simply transformed into normal lists.

In [6]:
# arr_list = list(arr)
arr_list = arr.tolist()

print(arr_list)
print(type(arr_list))

[1.0, 2.0, 3.0, 4.0]
<class 'list'>


Assigning lists between variables does not create new lists, but rather creates a new reference to the same object in memory.

In [7]:
arr1 = np.array([1, 2, 3, 4])
arr2 = arr1

arr2[0] = 0

print(arr1)
print(arr2)

[0 2 3 4]
[0 2 3 4]


To create a new copy of an array, we have to use the `copy` method.

In [8]:
arr1 = np.array([1, 2, 3, 4])
arr2 = arr1.copy()

arr2[0] = 0

print(arr1)
print(arr2)

[1 2 3 4]
[0 2 3 4]


The NumPy array also contains a few convenience functions allowing us to easily generate certain kinds of arrays and matrices. Some of these include:
- Filling an array with one given value
- Generating arrays with random data
- Generating identity matrices
- Generating arrays or matrices with all ones or all zeros
- Combining arrays vertically to create a kind of row matrix

In [9]:
print("Filling an array with one given value")
arr = np.array([1, 2, 3, 4], float)
arr.fill(1)
print(arr)

print("\nGenerating arrays with random data")
print(np.random.permutation(4))
print(np.random.normal(0, 1, 4))
print(np.random.random(4))

print("\nGenerating identity matrices")
print(np.identity(4))
print(np.eye(3, 4, 1))

print("\nGenerating arrays or matrices with all ones or all zeros")
print(np.zeros([2, 3]))
print(np.ones(4))

print("\nCombining arrays vertically to create a kind of row matrix")
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(np.vstack([arr1, arr2]))

Filling an array with one given value
[1. 1. 1. 1.]

Generating arrays with random data
[3 0 1 2]
[-0.61208123  0.37838094 -0.2240822   1.03692881]
[0.21595169 0.0851869  0.99490586 0.90059817]

Generating identity matrices
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

Generating arrays or matrices with all ones or all zeros
[[0. 0. 0.]
 [0. 0. 0.]]
[1. 1. 1. 1.]

Combining arrays vertically to create a kind of row matrix
[[1 2 3]
 [4 5 6]]


### Manipulating arrays
Getting to the core of data science (except not quite because we still haven't gotten to Pandas yet)

Some noteworthy manipulations we can perform include:
- Slicing
- Sorting and arg sorting
- Shuffling
- Testing for equality

In [13]:
arr = np.array([2., 6., 5., 6.])

print("Slicing")
print(f"[1:3] {arr[1:3]}")
print(f"[1:] {arr[1:]}")
print(f"[:3] {arr[:3]}")
print(f"[:-1] {arr[:-1]}")
print(f"[::-1] (step backwards from end to start) {arr[::-1]}")

print("\nSorting and arg sorting")
print(f"Sort: {np.sort(arr)}")
print(f"Arg sort (indices that would sort the array): {np.argsort(arr)}")

print("\nShuffle")
np.random.shuffle(arr)
print(arr)

print("\nTesting for equality")
print(f"arr == [1., 2., 3.]: {np.array_equal(arr, np.array([1., 2., 3.]))}")

Slicing
[1:3] [6. 5.]
[1:] [6. 5. 6.]
[:3] [2. 6. 5.]
[:-1] [2. 6. 5.]
[::-1] (step backwards from end to start) [6. 5. 6. 2.]

Sorting and arg sorting
Sort: [2. 5. 6. 6.]
Arg sort (indices that would sort the array): [0 2 1 3]

Shuffle
[6. 6. 2. 5.]

Testing for equality
arr == [1., 2., 3.]: False


### Matrices
We can also create matrices by nesting several lists into one larger list when using `np.array()`. These are later indexed using two indices as follows: `matrix[row, col]`. We can also use slicing on matrices, using `:` to represent the entire row or column.

In [28]:
matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print("Matrix:")
print(matrix)

print("\nUse slice indexing to get an entire row or column: ")
print(matrix[0, :])

print("\nFlatten a matrix back to a one-dimensional array:")
print(matrix.flatten())

Matrix:
[[1 2 3 4]
 [5 6 7 8]]

Use slice indexing to get an entire row or column: 
[1 2 3 4]

Flatten a matrix back to a one-dimensional array:
[1 2 3 4 5 6 7 8]


Some useful operations on matrices (and also arrays) include:
- Getting the shape
- Reshaping
- Getting the traspose
- Concatenation

In [56]:
matrix = np.array([[1, 2, 3], [4, 5, 6]])

print(f"Shape of the matrix: {matrix.shape}")

print("\nReshaping a matrix")
print(matrix.reshape((6,1)))

print("\nGetting the transpose")
print(matrix.transpose())

print("\nConcatenating matrices (this works for arrays as well)")

arr1 = np.array([[11, 12], [13, 14]])
arr2 = np.array([[21, 22], [23, 24]])

print(np.concatenate((arr1, arr2), axis=0))
print(np.concatenate((arr1, arr2), axis=1))

Shape of the matrix: (2, 3)

Reshaping a matrix
[[1]
 [2]
 [3]
 [4]
 [5]
 [6]]

Getting the transpose
[[1 4]
 [2 5]
 [3 6]]

Concatenating matrices (this works for arrays as well)
[[11 12]
 [13 14]
 [21 22]
 [23 24]]
[[11 12 21 22]
 [13 14 23 24]]


### Arithmetic operations on NumPy arrays
Common operations such as addition, subtraction, multiplication and division are all supported on an **element-wise** manner with NumPy arrays. This means adding two arrays will yeild a new array with the sum of each element from the two original arrays:

`[1, 2, 3] + [4, 5, 6] = [5, 7, 9]`

**Note:** this applies to matrices as well, which means matrix multiplication will not work simply by multiplying two matrices.

If the arrays are not the same size, the smaller one can be *"broadcasted"* onto the larger one. This essentially means the smaller one is replicated to fill the size of the larger one. The axis of this replication can be explicitly stated using slices and `np.newaxis`

In [61]:
print("Broadcasting example")
arr1 = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
arr2 = np.array([10, 11])

print(arr1 + arr2)

Broadcasting example


array([[11, 13],
       [13, 15],
       [15, 17],
       [17, 19]])

### Boolean masking and indexing
NumPy arrays can be indexed by boolean masks: arrays with `True` or `False` determining which elements to pick and which to ignore. We can use this to "query" our arrays. We can compose queries of several boolean expressions using functions like `np.logical_and`

We can also index into arrays using lists or arrays of integer indices. Like the following snippet: `arr[[1, 0, 0, 2, 1]]` which would return a new array with the 2nd, 1st, 1st, 3rd and 2nd element from `arr`.

In [67]:
matrix = np.array([[1, 2], [3, 4]])

print("Example of creating a mask by querying")
print(matrix > 2)

print("\nThe mask can then be used to index the array")
print(matrix[matrix > 2])

print("\nCreating a mask with several queries")
print(np.logical_and(matrix > 2, matrix < 4))

Example of creating a mask by querying
[[False False]
 [ True  True]]

The mask can then be used to index the array
[3 4]

Creating a mask with several queries
[[False False]
 [ True False]]


Two more useful functions to keep in mind here are `take` and `put`, which allow us to index into arrays and modify content at specific indices of an array respectively. Notice that `put` will essentially broadcast the inserted array as the number of indices increase. Try replacing the list of indices with something like `range(len(arr1))` and see what happens!

In [79]:
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([10, 20, 30])

print("Example of take")
print(arr1.take([0, 0, 1, 2]))

print("\nExample of put")
arr1.put([0, 2, 4], arr2)
print(arr1)

Example of take
[1 1 2 3]

Example of put
[10  2 20  4 30]


### Linear algebra operations
We've already seen the transpose in action, but NumPy actually supports even more linear algebra operations (thankfully, this means we don't have to implement them ourselves). Some of these include:
- Dot products
- Inner and outer products
- Cross products

In the submodule `linalg` to NumPy we find even more goodies:
- Determinants
- Inverse matrices
- Eigenvalues and eigenvectors

In [107]:
X = np.arange(9).reshape((3, 3))

print("Example using dot product with the transpose")
print(np.dot(X, X.T))

print("\nSome vector operations")
vec1 = np.array([1, 2, 3])
vec2 = np.array([2, 3, 4])

print(f"Inner product: {np.inner(vec1, vec2)}")
print(f"Dot product: {np.dot(vec1, vec2)}")
print("Outer product:")
print(np.outer(vec1, vec2))
print("Cross product:")
print(np.cross(vec1, vec2))

print("\nSome functions from the linalg module")
Y = np.array([[74, 22, 10], [92, 31, 17], [21, 22, 12]], float)

print(f"Determinant: {np.linalg.det(Y)}")
print("Inverse matrix:")
print(np.linalg.inv(Y))

vals, vecs = np.linalg.eig(Y)
print(f"Eigenvalues: {vals}")
print("Eigenvectors:")
print(vecs)


Example using dot product with the transpose
[[  5  14  23]
 [ 14  50  86]
 [ 23  86 149]]

Some vector operations
Inner product: 20
Dot product: 20
Outer product:
[[ 2  3  4]
 [ 4  6  8]
 [ 6  9 12]]
Cross product:
[-1  2 -1]

Some functions from the linalg module
Determinant: -2852.000000000003
Inverse matrix:
[[ 0.00070126  0.01542777 -0.02244039]
 [ 0.26192146 -0.23772791  0.11851332]
 [-0.48141655  0.4088359  -0.09467041]]
Eigenvalues: [107.99587441  11.33411853  -2.32999294]
Eigenvectors:
[[-0.57891525 -0.21517959  0.06319955]
 [-0.75804695  0.17632618 -0.58635713]
 [-0.30036971  0.96052424  0.80758352]]


### Statistics
NumPy also provides us with a few convenient mathematical functions useful for statistics such as the mean, median and sum of an array.

In [119]:
arr = np.random.rand(8, 4)

print(f"Sum: {np.sum(arr)}")
print(f"Mean: {np.mean(arr)}")
print(f"Median: {np.median(arr)}")
print(f"Max: {np.max(arr)}")
print(f"Argmax (index): {np.argmax(arr)}")
print(f"Min: {np.min(arr)}")
print(f"Argmin (index): {np.argmin(arr)}")

Sum: 15.485175087972028
Mean: 0.48391172149912587
Median: 0.5110197995161547
Max: 0.9665984052061142
Argmax (index): 11
Min: 0.01620373415308829
Argmin (index): 20


## Pandas
Building on the NumPy library, Pandas gives us a more high level approach to manipulating data in the form of the `DataFrame`. This datastructure behaves quite a bit like your typical spreadsheet, containing rows of data with fields represented as columns. Pandas let's us create, import, export, manipulate and perform calculations on these DataFrames with ease.