# NumPy Basics

NumPy - short for numerical Python - is an essential foundational packages for numerical computing in Python. One of Numpy's primary use in data analysis applications is as a container for data to be passed between algorithms and libraries. NumPy is more efficient for storing and manipulating data than the other built-in Python data structures is because it is designed for efficiency on large arrays of data. Reasons for this are:

+ NumPy internally stores data in a contigious block of memory, independent of other built-in Python objects. It operate on this memory without any type checking or other overhead. NumPy also uses much less memory than other built-in Python sequences.
+ NumPy operations perform complex computations on entire arrays without the need for Python `for` loops.

Examples below illustrate the performance difference between computations on a NumPy array and a Python list:

In [14]:
import numpy as np

my_arr = np.arange(1_000_000)    # numpy array
my_list = list(range(1_000_000)) # Python list

# compute and display processing time for comparison
%time for _ in range(10): my_arr2 = my_arr*2
# time < 20 ms
    
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]
# time 400+ ms

CPU times: user 12.5 ms, sys: 0 ns, total: 12.5 ms
Wall time: 12.5 ms
CPU times: user 355 ms, sys: 86.1 ms, total: 441 ms
Wall time: 441 ms


## The NumPy *ndarray*

A key features of NumPy is its N-dimensional array object, or *ndarray* - a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax equivalent operations between scalar elements. Below are some examples of computations on NumPy arrays:

In [3]:
data = np.random.randn(2,3) # generate 2x3 array of random numbers
print(data)
print("")

# multiply all elements by 10 in the array
print(data * 10)
print("")

# add each array element by itself, i.e. double the elements
print(data + data)

[[ 0.70205072  1.33375788  0.38221368]
 [ 1.55293436 -0.86618872  0.77683595]]

[[ 7.02050718 13.33757884  3.82213677]
 [15.52934365 -8.66188718  7.76835946]]

[[ 1.40410144  2.66751577  0.76442735]
 [ 3.10586873 -1.73237744  1.55367189]]


### Creating *ndarrays*

Use the `array` function from the `numpy` class to create an array. For example:

In [6]:
data1 = [6, 7.5, 8, 0, 1]  # a list of numbers
print(data1)

arr1 = np.array(data1) # convert a list to numpy array
print(arr1)
print("")

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]  # a nested list comprising list of numbers
print(data2)

arr2 = np.array(data2) # convert nested lists to numpy array
print(arr2)

[6, 7.5, 8, 0, 1]
[6.  7.5 8.  0.  1. ]

[[1, 2, 3, 4], [5, 6, 7, 8]]
[[1 2 3 4]
 [5 6 7 8]]


Inspecting dimensions of arrays:

In [7]:
# checking dimensions of ndarray
print(arr2.ndim)  # get n dimensions (i.e. rows)

print(arr2.shape) # get n rows and m columns

2
(2, 4)


The `numpy.array()` arrays will attempt to infer an appropriate data type for the array created:

In [8]:
print(arr1.dtype)  # float64
print(arr2.dtype)  # int64

float64
int64


In addition to `np.array`, there are other ways for creating new arrays. As examples, zeros and ones create arrays of 0s and 1s, respectively, with a given length or shape. `empty` creates an array without initialising its values to any particular value. To create higher dimensional array with these methods, pass a tuple for the shape:

In [9]:
# create array of 10 zeroes
print(np.zeros(10))
print("")

# create 3x6 array of zeroes
print(np.zeros((3, 6)))
print("")

# create empty array of 2x3x2
print(np.empty((2, 3, 2)))
print("")

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]

[[[4.65023297e-310 0.00000000e+000]
  [0.00000000e+000 0.00000000e+000]
  [0.00000000e+000 0.00000000e+000]]

 [[0.00000000e+000 0.00000000e+000]
  [0.00000000e+000 0.00000000e+000]
  [0.00000000e+000 0.00000000e+000]]]



**Note**: *`np.empty` does not set array values to zeroes; it may return uninitialised "garbage" values. It is marginally faster than zero array but on the other hand, it requires the user to manually set all the values in the array, thus should be used with caution.*

### Data Types for *ndarrays*

`dtypes` are a source of NumPy's flexibility for interacting with data coming from other systems. They facilitate the ndarray to interpret a chunk of memory as a particular type of data. We can set the `dtype` when creating arrays like below:

In [10]:
arr1 = np.array([1, 2, 3], dtype=np.float64)  # generate array to store float values
arr2 = np.array([1, 2, 3], dtype=np.int32)    # generate array to store integer values

print(arr1.dtype)
print(arr2.dtype)

float64
int32


In most cases, *dtypes* provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran. A standard double-precision floating-point value takes up 8 or 64 bits. Thus, this type is know in Numpy as `float64`. Below is the full listing of NumPy's supported data types:

**Type** | **Type code** | **Description**
--- | --- | ---
int8, uint8 | i1, u1 | Signed and unsigned 8-bit (1 byte) integer types
int16, uint16 | i2, u2 | Signed and unsigned 16-bit integer types
int32, uint32 | i4, u4 | Signed and unsigned 32-bit integer types
int64, uint64 | i8, u8 | Signed and unsigned 64-bit integer types
--- | --- | ---
float16 | f2 | Half-precision floating point
float32 | f4 or f | Standard single-precision floating point; compatible with C float
float64 | f8 or d | Standard double-precision floating point; compatible with C double and Python float object
float128 | f16 or g | Extended-precision floating point
--- | --- | ---
complex64 | c8 | Complex numbers represented by two 32 floats 
complex128 | c16 | Complex numbers represented by two 64 floats 
complex256 | c32 | Complex numbers represented by two 128 floats 
--- | --- | ---
bool | ? | Boolean type storing `True` and `False` values
object | O | Python object type; a value can be any Python object
string_ | S | Fixed-length ASCII type (1 byte per character)
unicode_ | U | Fixed-length Unicode type (number of bytes platform specific)


You can explicitly convert or cast an array from one dtype to another, using `astype` method:

In [11]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype   # dtype('int64')

float_arr = arr.astype(np.float64)  # cast array to float64 type

float_arr.dtype  # dtype('float64')


## casting from float to integer will truncate decimal part of values
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr.dtype  # dtype('float64')

arr.astype(np.int32)  # array([3, -1, -2, 0, 12, 10])

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

**Note**: *Calling `astype` always create a new array(a copy of the data).*

## Basic Indexing and Slicing

Like Python lists, NumPy arrays also support indexing and slicing. For example:

In [12]:
arr = np.arange(10) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

print(arr[5])  # retrieve 6th elem

print(arr[5:8])  # retrieve 6th to 8th items in the array

arr[5:8] = 12

print(arr[5:8])

5
[5 6 7]
[12 12 12]


Accessing two-dimensional array:

In [13]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d[2])  # [7, 8, 9]

print(arr2d[0, 2]) # 3

[7 8 9]
3


**Note:** Unlike a Python list, array slices are *view* on the original array. Thus any value modifications done to the slice will be reflected in the source array.

In [21]:
arr_slice = arr[5:8]  # obtain slice of 6th to 8th elements from array
arr_slice[1] = 12345  # modify 1st item in slice

print(arr_slice)

# changes in slice is reflected in the original array
print(arr)


# this assigns to all values in an array
arr_slice[:] = 12345

print(arr)

[   12 12345    12]
[    0     1     2     3     4    12 12345    12     8     9]
[    0     1     2     3     4 12345 12345 12345     8     9]


To prevent modifying the original array while working on array slices, you can use `copy()` method when slicing:

In [30]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

old_values = arr3d[0].copy()  # copy out the slice instead of getting just the "view"

arr3d[0] = 42  # set values of first vector in the first dimension in the array to 42

print(arr3d)
print("\n")

arr3d[0] = old_values  # reset back to old values

print(arr3d)

[[[42 42 42]
  [42 42 42]]

 [[ 7  8  9]
  [10 11 12]]]


[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


Illustration of two-dimensional array slicing:

<img src="https://www.oreilly.com/library/view/python-for-data/9781449323592/httpatomoreillycomsourceoreillyimages2172114.png" width="300" />

### Boolean Indexing

Instead of numerical indexing, we can also use boolean indexing to select data from array in the examples below:

In [28]:
# generate NumPy array from a list of names
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

# generate random data array with number of rows corresponding to names.ndim (i.e. 7)
data = np.random.randn(7, 4)
print(data)
print("\n")

# return array of same shape comprising boolean result
names == 'Bob'
# array([ True, False, False,  True, False, False, False])


# get array rows based on earlier condition result
print(data[names == 'Bob'])
print("\n")

# return array rows based on false negatives of earlier condition result
print(data[~(names == 'Bob')])
print("")

# does the same as previous one
print(data[names != 'Bob'])

[[ 0.70488258 -0.46513964  0.36526903  0.82231441]
 [ 0.66292426  0.13961025 -1.71985617  0.01384986]
 [ 1.35260495 -0.84656874  1.03487997 -1.0964815 ]
 [-0.51368383  0.45329443 -0.77310113 -1.22828866]
 [ 0.19652006 -1.92183668 -0.27064452  1.16711278]
 [ 0.70684217  0.19542478 -0.32417086  0.02631058]
 [ 1.33816171 -1.53155546  0.59946467  0.25472763]]


[[ 0.70488258 -0.46513964  0.36526903  0.82231441]
 [-0.51368383  0.45329443 -0.77310113 -1.22828866]]


[[ 0.66292426  0.13961025 -1.71985617  0.01384986]
 [ 1.35260495 -0.84656874  1.03487997 -1.0964815 ]
 [ 0.19652006 -1.92183668 -0.27064452  1.16711278]
 [ 0.70684217  0.19542478 -0.32417086  0.02631058]
 [ 1.33816171 -1.53155546  0.59946467  0.25472763]]

[[ 0.66292426  0.13961025 -1.71985617  0.01384986]
 [ 1.35260495 -0.84656874  1.03487997 -1.0964815 ]
 [ 0.19652006 -1.92183668 -0.27064452  1.16711278]
 [ 0.70684217  0.19542478 -0.32417086  0.02631058]
 [ 1.33816171 -1.53155546  0.59946467  0.25472763]]


Combining multiple boolean conditions:

In [29]:
mask = (names == 'Bob') | (names == 'Will')  # get boolean results of either name is 'Bob' or 'Will' (in 1-d array)
# array([ True, False,  True,  True,  True, False, False])

# returns n-th row of data array corresponding to true values in 'mask' array 
print(data[mask])
print("\n")

[[ 0.70488258 -0.46513964  0.36526903  0.82231441]
 [ 1.35260495 -0.84656874  1.03487997 -1.0964815 ]
 [-0.51368383  0.45329443 -0.77310113 -1.22828866]
 [ 0.19652006 -1.92183668 -0.27064452  1.16711278]]


[[0.70488258 0.         0.36526903 0.82231441]
 [0.66292426 0.13961025 0.         0.01384986]
 [1.35260495 0.         1.03487997 0.        ]
 [0.         0.45329443 0.         0.        ]
 [0.19652006 0.         0.         1.16711278]
 [0.70684217 0.19542478 0.         0.02631058]
 [1.33816171 0.         0.59946467 0.25472763]]


In [31]:
# setting values less than zero in the array to zero
data[data < 0] = 0
print(data)
print("\n")

# setting values (in rows corresponding to names not Joe) in the array to 7
data[names != 'Joe'] = 7
print(data)

[[0.70488258 0.         0.36526903 0.82231441]
 [0.66292426 0.13961025 0.         0.01384986]
 [1.35260495 0.         1.03487997 0.        ]
 [0.         0.45329443 0.         0.        ]
 [0.19652006 0.         0.         1.16711278]
 [0.70684217 0.19542478 0.         0.02631058]
 [1.33816171 0.         0.59946467 0.25472763]]


[[7.         7.         7.         7.        ]
 [0.66292426 0.13961025 0.         0.01384986]
 [7.         7.         7.         7.        ]
 [7.         7.         7.         7.        ]
 [7.         7.         7.         7.        ]
 [0.70684217 0.19542478 0.         0.02631058]
 [1.33816171 0.         0.59946467 0.25472763]]


### Fancy Indexing

Fancy Indexing refers to indexing using integer arrays:

In [55]:
arr = np.arange(32).reshape((8, 4))   # generate series of 0-31 numbers and shape them into 8x4 array
print(arr)
print("\n")

## pass in [4, 3, 0, 6] as 1 argument
# get 5th, 4th, 1st and 7th rows of arr
print(arr[[4, 3, 0, 6]])
print("\n")


# contrast with below code (less 1 pair of square brackets, which returns item in 5th row, 4th column
print(arr[4, 3])
print("\n")

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]]


[[16 17 18 19]
 [12 13 14 15]
 [ 0  1  2  3]
 [24 25 26 27]]


19




In the 2nd example above, we pass `[4, 3, 0, 6]` as the first argument for array indexing, which in turn will return array slice comprising the 5th, 4th, 1st and 7th rows (recall indices starts from zero in Python instead of 1) in that order. In the 3rd example, we pass in `4` as first argument and `3` as the second argument to retrieve the element residing in 5th row, 4th column of the array. 

Also note that fancy indexing *always copies* the data into a new array, unlike slicing.

More complex indexing examples below:

In [56]:
# retrieve the 1st, 4th, 2nd and 3rd elements from respective rows in the earlier data subarray
print(arr[[4, 3, 0, 6], [0, 3, 1, 2]])
print("\n")

# retrieve the 4th column from the subarray specified by first argument: [4, 3, 0, 6]
print(arr[[4, 3, 0, 6], 3])
print("\n")

# modifying the values in array returned by fancy indexing doesn't impact the original values
x = arr[[4, 3, 0, 6], 3]
x[0] = 1
print(arr)

[16 15  1 26]


[19 15  3 27]


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]]


In [49]:
x = arr[[4, 3, 0, 6], 3]
x[0] = 1
x

array([ 1, 15,  3, 27])

In [50]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

## Transposing Array, Swapping Axes

The transpose of a array returns a flipped view of the array over its diagonal, whereby the rows and columns are swapped. Transposing returns merely a *view* of the data thus modifying the returned transposed matrix will modify the original data. Arrays have the `transpose` method and also the special `T` attribute to facilitate the transpose operations:

In [78]:
# transpose matrix/array
arr = np.random.randn(6, 3)
print(arr)
print("\n---transposed form---")
print(arr.T)

# this does the same
print("\n---another way to transpose---")
print(arr.transpose(1, 0))   # reorder 2nd (y-axis) with the 1st (x-axis)

[[-0.49701991  0.26270355  0.20002807]
 [ 1.03656432  0.81495315 -0.56119859]
 [-0.42753169  0.01924542  0.67068316]
 [ 0.98083051 -0.61227188  0.89249475]
 [ 0.93769243 -0.84431899 -0.07931656]
 [ 0.40612198 -0.60846418  0.18548966]]

---transposed form---
[[-0.49701991  1.03656432 -0.42753169  0.98083051  0.93769243  0.40612198]
 [ 0.26270355  0.81495315  0.01924542 -0.61227188 -0.84431899 -0.60846418]
 [ 0.20002807 -0.56119859  0.67068316  0.89249475 -0.07931656  0.18548966]]

---another way to transpose---
[[-0.49701991  1.03656432 -0.42753169  0.98083051  0.93769243  0.40612198]
 [ 0.26270355  0.81495315  0.01924542 -0.61227188 -0.84431899 -0.60846418]
 [ 0.20002807 -0.56119859  0.67068316  0.89249475 -0.07931656  0.18548966]]


You can easily perform inner matrix product computations using `np.dot`:

In [62]:
# computation using np.dot
np.dot(arr.T, arr)

array([[ 5.00754243, -0.29651983,  1.03924436],
       [-0.29651983,  3.23207076, -2.88173777],
       [ 1.03924436, -2.88173777, 10.65693623]])

### More complex array swap/transpose

For higher-dimensional arrays, you can use the `transpose` method that takes in a tuple of axes as an argument to permute the axes:

In [99]:
arr = np.arange(16).reshape((2, 2, 4))  # reshape into a 3-dimensional array
print(arr)

print("\n---1st axis, 1st item---")
print(arr[0])

print("\n---2nd axis, 1st item---")
print(arr[:,0])

print("\n---3rd axis, 1st item---")
print(arr[:,:,0])
print("\n")

# reorder axis: swap 2nd axis with the 1st
print(arr.transpose((1, 0, 2)))

# this does the same thing
arr.swapaxes(1, 0)

[[[ 0  1  2  3]
  [ 4  5  6  7]]

 [[ 8  9 10 11]
  [12 13 14 15]]]

---1st axis, 1st item---
[[0 1 2 3]
 [4 5 6 7]]

---2nd axis, 1st item---
[[ 0  1  2  3]
 [ 8  9 10 11]]

---3rd axis, 1st item---
[[ 0  4]
 [ 8 12]]


[[[ 0  1  2  3]
  [ 8  9 10 11]]

 [[ 4  5  6  7]
  [12 13 14 15]]]


array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

In the above example, the `swapaxes` method takes a pair of axis numbers and switches the indicated axes to rearrange the data: