# Module 2: Introduction to Numpy and Pandas

The following tutorial contains **examples of using the numpy and pandas library modules**. Read the step-by-step instructions below carefully. To execute the code, click on the cell and press the `SHIFT-ENTER` keys simultaneously.

## 2.1 Introduction to Numpy

Numpy, which stands for numerical Python, is a Python library package to support numerical computations. The basic data structure in numpy is a multi-dimensional array object called ndarray. Numpy provides a suite of functions that can efficiently manipulate elements of the ndarray. For more details, visit <https://numpy.org/doc/stable/user/basics.html>. If you prefer Korean, visit <http://aikorea.org/cs231n/python-numpy-tutorial/#numpy-arrays>, a translation of a tutorial from Stanford.

To use the `numpy` package, just import!

In [1]:
import numpy as np

Almost all python users import `numpy` as `np` because it is a convention.

### 2.1.1 Creating ndarray

- An ndarray can be created from a list or a tuple object as shown in the examples below. 
- Numpy `ndarray`, which stands for N-dimensional array object, is similar to Python `list`. However, a list can be configured even if the types of components are different, but in an Numpy ndarray, the types of components must all be the same.
- It is possible to create a 1-dimensional or multi-dimensional array from the list objects as well as tuples.


In [2]:
oneDim = np.array([1,2,3,4,5])   # a 1-dimensional array (vector)
print(oneDim)
print("Object type =", type(oneDim))
print("Dimension =", oneDim.shape)
print("Array type =", oneDim.dtype, '\n')

twoDim = np.array([[1,2],[3,4],[5,6],[7,8]])  # a two-dimensional array (matrix)
print(twoDim)
print("Object type =", type(twoDim))
print("Dimension =", twoDim.shape)
print("Array type =", twoDim.dtype, '\n')

[1 2 3 4 5]
Object type = <class 'numpy.ndarray'>
Dimension = (5,)
Array type = int64 

[[1 2]
 [3 4]
 [5 6]
 [7 8]]
Object type = <class 'numpy.ndarray'>
Dimension = (4, 2)
Array type = int64 



In [3]:
# You can define the type of array values, e.g., float, int, ...
a = np.array([1,2,3,4], float)
print(a)
print(type(a[0]))
print(a.dtype)

b = np.array([1,2,3,4], int)
print(b)
print(type(b[0]))
print(b.dtype)

[1. 2. 3. 4.]
<class 'numpy.float64'>
float64
[1 2 3 4]
<class 'numpy.int64'>
int64


There are also built-in functions available in numpy to create the ndarrays. 

In [4]:
print('A 2 x 3 matrix of zeros')
print(np.zeros((2,3)))        # a matrix of zeros

print('\nA 3 x 2 matrix of ones')
print(np.ones((3,2)))         # a matrix of ones

print('\nA 3 x 3 identity matrix')
print(np.eye(3))              # a 3 x 3 identity matrix

A 2 x 3 matrix of zeros
[[0. 0. 0.]
 [0. 0. 0.]]

A 3 x 2 matrix of ones
[[1. 1.]
 [1. 1.]
 [1. 1.]]

A 3 x 3 identity matrix
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [5]:
print('Array of integers between -10 and 10, with step size of 2')
print(np.arange(10))    # similar to range(10), but returns ndarray instead of list
print(np.arange(-10,10))    # similar to range(-10,10), but returns ndarray instead of list
print(np.arange(-10,10,2))    # similar to range(-10,10,2), but returns ndarray instead of list

print('\n2-dimensional array of integers from 0 to 11')
print(np.arange(12).reshape(3,4))  # reshape to a matrix

Array of integers between -10 and 10, with step size of 2
[0 1 2 3 4 5 6 7 8 9]
[-10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7
   8   9]
[-10  -8  -6  -4  -2   0   2   4   6   8]

2-dimensional array of integers from 0 to 11
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [6]:
print('Array of random numbers from a uniform distribution')
print(np.random.rand(5))      # random numbers from a uniform distribution between [0,1]

print('\nArray of random numbers from a normal distribution')
print(np.random.randn(5))     # random numbers from a normal distribution

print('\nArray of values between 0 and 1, split into 10 equally spaced values')
print(np.linspace(0,1,10))    # split interval [0,1] into 10 equally separated values

print('\nArray of values from 10^-3 to 10^3')
print(np.logspace(-3,3,7))    # create ndarray with values from 10^-3 to 10^3

Array of random numbers from a uniform distribution
[8.28416466e-02 5.68213501e-04 9.08608674e-01 4.84450052e-01
 4.12946848e-01]

Array of random numbers from a normal distribution
[-0.36508366  1.1539366   0.54829392 -1.33776362 -0.48516697]

Array of values between 0 and 1, split into 10 equally spaced values
[0.         0.11111111 0.22222222 0.33333333 0.44444444 0.55555556
 0.66666667 0.77777778 0.88888889 1.        ]

Array of values from 10^-3 to 10^3
[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]


You can reshape arrays.

In [7]:
a = np.arange(12)
print(a.reshape(3, 4)) # total number of elements cannot change
print(a.reshape(6, -1)) # use -1 to infer axis shape

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]


You can concatenate multiple arrays into an array.

In [8]:
print('Concatenating multiple arrays')
a = np.ones((4, 1))
b = np.zeros((4, 2))
print(np.concatenate([a, b], axis=1))

print('\nCannot combine arrays of different shape')
c = np.zeros((3, 8))
print(np.concatenate([a, c], axis=1))

Concatenating multiple arrays
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

Cannot combine arrays of different shape


ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 4 and the array at index 1 has size 3

### 2.1.2 Indexing and Slicing
You can access or modify values in NumPy arrays by indexing and slicing as in lists.

In [9]:
a = np.array([1,2,3,4])
print(a)
print(a[0], a[1]) # access the first and second elements
a[0] = 5   # modify the first element to 5
print(a)

[1 2 3 4]
1 2
[5 2 3 4]


In [10]:
b = np.array([[1,2,3],[4,5,6]])
print(b)
print(b[0,0]) # the (1, 1) component of the matrix
print(b[0]) # the first row
print(b[0,:]) # the first row
print(b[:,2]) # the third column
print(b[:,-1]) # the last column, which is the third column
print(b[0][2]) # the value in the first row and third column of the array (i.e. the (1,3) component of the matrix)
print(b[0,2])  # same as above
b[0,2] = 7
print(b)

[[1 2 3]
 [4 5 6]]
1
[1 2 3]
[1 2 3]
[3 6]
[3 6]
3
3
[[1 2 7]
 [4 5 6]]


There are various ways to select a subset of elements within a numpy array. 
- Assigning a numpy array (or a subset of its elements) to another variable will simply pass a reference to the array instead of copying its values. 
- To make a copy of an ndarray, you need to explicitly call the .copy() function.

In [11]:
x = np.arange(-5,5)
print('Before: x =', x)

y = x[3:5]     # y is a slice, i.e., pointer to a subarray in x
print('        y =', y)
y[:] = 1000    # modifying the value of y will change x
print('After : y =', y)
print('        x =', x, '\n')

z = x[3:5].copy()   # makes a copy of the subarray
print('Before: x =', x)
print('        z =', z)
z[:] = 500          # modifying the value of z will not affect x
print('After : z =', z)
print('        x =', x)

Before: x = [-5 -4 -3 -2 -1  0  1  2  3  4]
        y = [-2 -1]
After : y = [1000 1000]
        x = [  -5   -4   -3 1000 1000    0    1    2    3    4] 

Before: x = [  -5   -4   -3 1000 1000    0    1    2    3    4]
        z = [1000 1000]
After : z = [500 500]
        x = [  -5   -4   -3 1000 1000    0    1    2    3    4]


There are many ways to access elements of an ndarray. The following example illustrates the difference between indexing elements of a list and elements of ndarray. 

In [12]:
my2dlist = [[1,2,3,4],[5,6,7,8],[9,10,11,12]]  # a 2-dim list
print('my2dlist =', my2dlist)
print('my2dlist[2] =', my2dlist[2])            # access the third sublist
print('my2dlist[:][2] =', my2dlist[:][2])      # can't access third element of each sublist
# print('my2dlist[:,2] =', my2dlist[:,2])      # invalid way to access sublist, will cause syntax error

my2darr = np.array(my2dlist)
print('\nmy2darr =\n', my2darr)

print('my2darr[2][:] =', my2darr[2][:])      # access the third row
print('my2darr[2,:] =', my2darr[2,:])        # access the third row
print('my2darr[:][2] =', my2darr[:][2])      # access the third row (similar to 2d list)
print('my2darr[:,2] =', my2darr[:,2])        # access the third column
print('my2darr[:2,2:] =\n', my2darr[:2,2:])     # access the first two rows & last two columns

my2dlist = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
my2dlist[2] = [9, 10, 11, 12]
my2dlist[:][2] = [9, 10, 11, 12]

my2darr =
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
my2darr[2][:] = [ 9 10 11 12]
my2darr[2,:] = [ 9 10 11 12]
my2darr[:][2] = [ 9 10 11 12]
my2darr[:,2] = [ 3  7 11]
my2darr[:2,2:] =
 [[3 4]
 [7 8]]


Numpy arrays also support boolean indexing.

In [13]:
my2darr = np.arange(1,13).reshape(3,4)
print('my2darr =\n', my2darr)

divBy3 = my2darr[my2darr % 3 == 0]
print('\nmy2darr[my2darr % 3 == 0] =', divBy3)            # returns all the elements divisible by 3 in an ndarray

divBy3LastRow = my2darr[2:, my2darr[2,:] % 3 == 0]
print('my2darr[2:, my2darr[2,:] % 3 == 0] =', divBy3LastRow)    # returns elements in the last row divisible by 3

my2darr =
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

my2darr[my2darr % 3 == 0] = [ 3  6  9 12]
my2darr[2:, my2darr[2,:] % 3 == 0] = [[ 9 12]]


More indexing examples.

In [14]:
my2darr = np.arange(1,13,1).reshape(4,3)
print('my2darr =\n', my2darr)

indices = [2,1,0,3]    # selected row indices
print('indices =', indices, '\n')
print('my2darr[indices,:] =\n', my2darr[indices,:])  # this will shuffle the rows of my2darr

rowIndex = [0,0,1,2,3]     # row index into my2darr
print('\nrowIndex =', rowIndex)
columnIndex = [0,2,0,1,2]  # column index into my2darr
print('columnIndex =', columnIndex, '\n')
print('my2darr[rowIndex,columnIndex] =', my2darr[rowIndex,columnIndex])

my2darr =
 [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
indices = [2, 1, 0, 3] 

my2darr[indices,:] =
 [[ 7  8  9]
 [ 4  5  6]
 [ 1  2  3]
 [10 11 12]]

rowIndex = [0, 0, 1, 2, 3]
columnIndex = [0, 2, 0, 1, 2] 

my2darr[rowIndex,columnIndex] = [ 1  3  4  8 12]


### 2.1.3 Element-wise Operations

You can apply standard operators such as addition and multiplication on each element of the ndarray.

In [15]:
x = np.array([1,2,3,4,5])

print('x =', x)
print('x + 1 =', x + 1)      # addition
print('x - 1 =', x - 1)      # subtraction
print('x * 2 =', x * 2)      # multiplication
print('x // 2 =', x // 2)     # integer division
print('x ** 2 =', x ** 2)     # square
print('x % 2 =', x % 2)      # modulo  
print('1 / x =', 1 / x)      # division

x = [1 2 3 4 5]
x + 1 = [2 3 4 5 6]
x - 1 = [0 1 2 3 4]
x * 2 = [ 2  4  6  8 10]
x // 2 = [0 1 1 2 2]
x ** 2 = [ 1  4  9 16 25]
x % 2 = [1 0 1 0 1]
1 / x = [1.         0.5        0.33333333 0.25       0.2       ]


In [16]:
x = np.array([2,4,6,8,10])
y = np.array([1,2,3,4,5])

print('x =', x)
print('y =', y)
print('x + y =', x + y)      # element-wise addition
print('x - y =', x - y)      # element-wise subtraction
print('x * y =', x * y)      # element-wise multiplication 
print('x / y =', x / y)      # element-wise division
print('x // y =', x // y)    # element-wise integer division 
print('x ** y =', x ** y)    # element-wise exponentiation

x = [ 2  4  6  8 10]
y = [1 2 3 4 5]
x + y = [ 3  6  9 12 15]
x - y = [1 2 3 4 5]
x * y = [ 2  8 18 32 50]
x / y = [2. 2. 2. 2. 2.]
x // y = [2 2 2 2 2]
x ** y = [     2     16    216   4096 100000]


Numpy provides many built-in mathematical functions available for manipulating elements of an ndarray.

In [17]:
print('np.maximum(x,y) =', np.maximum(x, y))        # element-wise maximum        max(x,y)
print('np.minimum(x,y) =', np.minimum(x, y))        # element-wise minimum        min(x,y)

np.maximum(x,y) = [ 2  4  6  8 10]
np.minimum(x,y) = [1 2 3 4 5]


In [18]:
y = np.array([-1.4, 0.4, -3.2, 2.5, 3.4])
print('y =', y, '\n')

print('np.abs(y) =', np.abs(y))                # convert to absolute values
print('np.floor(y) = ', np.floor(y))           # round down (floor of the input)
print('np.ceil(y) = ', np.ceil(y))             # round up (ceiling of the input)
print('np.rint(y) = ', np.rint(y))             # round to the nearest integer
print('np.sqrt(abs(y)) =', np.sqrt(abs(y)))    # apply square root to each element
print('np.sign(y) =', np.sign(y))              # get the sign of each element
print('np.exp(y) =', np.exp(y))                # apply exponentiation
print('np.sort(y) =', np.sort(y))              # sort array
# statistics
print("Min =", np.min(y))             # min 
print("Max =", np.max(y))             # max 
print('Argmin = ', np.argmin(y))      # argmin
print('Argmax = ', np.argmax(y))      # argmax
print("Average =", np.mean(y))        # mean/average
print("Std deviation =", np.std(y))   # standard deviation
print("Sum =", np.sum(y))             # sum 
# Logical operations
print('y > 0', y > 0)

y = [-1.4  0.4 -3.2  2.5  3.4] 

np.abs(y) = [1.4 0.4 3.2 2.5 3.4]
np.floor(y) =  [-2.  0. -4.  2.  3.]
np.ceil(y) =  [-1.  1. -3.  3.  4.]
np.rint(y) =  [-1.  0. -3.  2.  3.]
np.sqrt(abs(y)) = [1.18321596 0.63245553 1.78885438 1.58113883 1.84390889]
np.sign(y) = [-1.  1. -1.  1.  1.]
np.exp(y) = [ 0.24659696  1.4918247   0.0407622  12.18249396 29.96410005]
np.sort(y) = [-3.2 -1.4  0.4  2.5  3.4]
Min = -3.2
Max = 3.4
Argmin =  2
Argmax =  4
Average = 0.33999999999999997
Std deviation = 2.432776191925595
Sum = 1.6999999999999997
y > 0 [False  True False  True  True]


#### Broadcasting
If given arrays have compatible shapes, numpy will apply the operation using "broadcasting", i.e., the smaller array is “broadcast” across the larger array so that they have compatible shapes.

When operating on multiple arrays, broadcasting rules are used.
- Each dimension must match, from right-to-left 
- Dimensions of size 1 will broadcast (as if the value was repeated). 
- Otherwise, the dimension must have the same shape. 
- Extra dimensions of size 1 are added to the left as needed.

![](https://numpy.org/doc/stable/_images/broadcasting_2.png)

In [19]:
a = np.array([[ 0.0,  0.0,  0.0],
              [10.0, 10.0, 10.0],
              [20.0, 20.0, 20.0],
              [30.0, 30.0, 30.0]])
b = np.array([1.0, 2.0, 3.0])
print(a + b)

[[ 1.  2.  3.]
 [11. 12. 13.]
 [21. 22. 23.]
 [31. 32. 33.]]


In [20]:
# if the given two arrays have different sizes, you will encounter an ValueError
b = np.array([1.0, 2.0, 3.0, 4.0])
print(a + b)

ValueError: operands could not be broadcast together with shapes (4,3) (4,) 

Illustrative example of array broadcasting

![](./figs/broadcasting.png)

### 2.1.4 Array iteration
You can iterate over array rows or components.

In [21]:
a = np.array([1,2,3])
for x in a:
    print(x)

b = np.array([[1,2],[3,4],[5,6]])
for x in b:
    print(x)

for (x,y) in b:
    print(x * y)

1
2
3
[1 2]
[3 4]
[5 6]
2
12
30


### 2.1.5 Numpy linear algebra

Numpy provides many functions to support linear algebra operations.

In [22]:
X = np.random.randn(2,3)                         # create a 2 x 3 random matrix
print('X =\n', X, '\n')
print('Transpose of X, X.T =\n', X.T, '\n')      # matrix transpose operation X^T

y = np.random.randn(3) # random vector 
print('y =', y, '\n')

print('Matrix-vector multiplication')
print('X.dot(y) =\n', X.dot(y), '\n')            # matrix-vector multiplication  X * y

print('Matrix-matrix product')
print('X.dot(X.T) =', X.dot(X.T))        # matrix-matrix multiplication  X * X^T
print('\nX.T.dot(X) =\n', X.T.dot(X))      # matrix-matrix multiplication  X^T * X

X =
 [[ 0.48621806  0.71608885 -0.60074337]
 [-0.32227417 -1.1488435  -0.68736787]] 

Transpose of X, X.T =
 [[ 0.48621806 -0.32227417]
 [ 0.71608885 -1.1488435 ]
 [-0.60074337 -0.68736787]] 

y = [ 0.39363111  1.50113806 -1.33721614] 

Matrix-vector multiplication
X.dot(y) =
 [ 2.06966252 -0.93227044] 

Matrix-matrix product
X.dot(X.T) = [[ 1.11008384 -0.56643786]
 [-0.56643786  1.89617663]]

X.T.dot(X) =
 [[ 0.34026864  0.71841792 -0.07057136]
 [ 0.71841792  1.83262465  0.35949248]
 [-0.07057136  0.35949248  0.83336719]]


In [23]:
X = np.random.randn(5,3)
print('X =\n', X, '\n')

C = X.T.dot(X)               # C = X^T * X is a square matrix
print('C = X.T.dot(X) =\n', C, '\n')

invC = np.linalg.inv(C)      # inverse of a square matrix
print('Inverse of C = np.linalg.inv(C)\n', invC, '\n')

detC = np.linalg.det(C)      # determinant of a square matrix
print('Determinant of C = np.linalg.det(C) =', detC)

S, U = np.linalg.eig(C)      # eigenvalue S and eigenvector U of a square matrix
print('Eigenvalues of C =\n', S)
print('Eigenvectors of C =\n', U)

X =
 [[ 0.75509637  1.21031778 -0.27673027]
 [-0.59423527 -0.38830071  0.98806093]
 [ 0.30594205  0.99951386  0.62389756]
 [-0.56217128  1.41717456  0.40134906]
 [ 0.39379507 -1.07963493 -0.34755409]] 

C = X.T.dot(X) =
 [[ 1.48799773  0.22859211 -0.96771418]
 [ 0.22859211  5.78866983  0.84901115]
 [-0.96771418  0.84901115  1.72396713]] 

Inverse of C = np.linalg.inv(C)
 [[ 1.17343265 -0.15407472  0.73456074]
 [-0.15407472  0.20643098 -0.18814889]
 [ 0.73456074 -0.18814889  1.085047  ]] 

Determinant of C = np.linalg.det(C) = 7.890232613193009
Eigenvalues of C =
 [0.52645665 2.51502847 5.95914957]
Eigenvectors of C =
 [[-0.71873286  0.69524048  0.00798445]
 [ 0.14106707  0.13456969  0.98081144]
 [-0.68082535 -0.70606775  0.19479519]]
