# Lecture 7 Introduction to Numpy

[NumPy -- *Numerical Python*](https://numpy.org/) provides the building-blocks for the entire ecosystem of data science tools in Python, serving as the efficient tool to store and manipulate data, and [friendly to Matlab users](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html).

Unfortunately, the native numpy does not support GPU operations. For arrays on GPU, we have some popular substitutions, such as tensors in [TensorFlow](https://www.tensorflow.org/) and [jax](https://github.com/google/jax#quickstart-colab-in-the-cloud) (by Google), [PyTorch](https://pytorch.org/) (by Facebook) or arrays in [CuPy](https://cupy.dev/) (by Nvidia) -- while they all have close relations/ similar interface with Numpy. Therefore, learning the basic concepts about Numpy is crucial for doing data science with Python.

In [6]:
import numpy as np

my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [2]:
%time for _ in range(10): my_arr2 = my_arr * 2 

CPU times: user 18.4 ms, sys: 10.1 ms, total: 28.6 ms
Wall time: 29.4 ms


In [3]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

CPU times: user 968 ms, sys: 283 ms, total: 1.25 s
Wall time: 1.34 s


## Difference between ndarray and list : Data Memory Perspective

[Intuitively speaking](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html), the built-in list object in Python can be viewed as the "address book" that store multiple pointers to heterogeneous objects in Python as its elements. On the other, the Numpy array object in Python stored the pointer to a consecutive memory block (data buffer) implemented in C language -- that's why the elements in Numpy array should be fixed-type, and the implementation is more efficient than list. 

In [7]:
a = np.array([1,2,3,4]) #numpy 1-d array, initialization with list
l = [1,2,3,4]  # python built-in list

Slicing of Numpy array creates *View* instead of *Copy*. The view object shares the same data buffer with the original one.

In [8]:
b = a[0:2] # creating view by slicing

In [9]:
print(b)
b.base # view has the base object because its memory is from some other object.

[1 2]


array([1, 2, 3, 4])

We can also check the  `flags` to see whether the array has its "own data".

In [10]:
b.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [11]:
a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

This mechanism may cause unexpected outcomes for beginners.

In [12]:
b[0] = 1000 # change the first element of b (which is the slice of a -- view) 
a

array([1000,    2,    3,    4])

This is very different with the Python built-in list.

In [13]:
c = l[0: 2] #slicing in list
c[0] = 100
l

[1, 2, 3, 4]

Many other methods/functions in Numpy creates **view** instead of **copy** (in fact view is far more efficient than copy).

For example, Reshape creates the view whenever possible (for most of the case with consistent dimensions).

In [14]:
a_mat = a.reshape(2,2)

In [15]:
a_mat.base

array([1000,    2,    3,    4])

In [16]:
a_mat[0,0] = 2000 # same as a_mat[0][0]
a

array([2000,    2,    3,    4])

Transpose also creates the **view**.

In [17]:
a_t = a_mat.T # attribute
a_tt = a_mat.transpose() # method

In [18]:
a_t.base

array([2000,    2,    3,    4])

In [19]:
a_t[0,0] = 0 # change the view -- change the data buffer -- the base a is also changed!
a

array([0, 2, 3, 4])

Conversely, once the "base" is changed, **all** the associated "view" objects are changed!

In [20]:
a_mat # reshape of a -- view, changed!

array([[0, 2],
       [3, 4]])

In [21]:
b # slicing of a -- view, changed!

array([0, 2])

Use the copy method to create the new data buffer

In [23]:
a_copy = a.copy()
print(a_copy.base)

None


In [24]:
a_copy.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [25]:
a_mat_copy = a_mat.copy()

In [26]:
a_mat_copy.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

## Numpy ndarray as object

As the object created by Numpy, the ndarray has identity, type, value, attributes and methods.

In [27]:
type(a) 

numpy.ndarray

In [28]:
dir(a)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__

In [29]:
help(a)

Help on ndarray object:

class ndarray(builtins.object)
 |  ndarray(shape, dtype=float, buffer=None, offset=0,
 |          strides=None, order=None)
 |  
 |  An array object represents a multidimensional, homogeneous array
 |  of fixed-size items.  An associated data-type object describes the
 |  format of each element in the array (its byte-order, how many bytes it
 |  occupies in memory, whether it is an integer, a floating point number,
 |  or something else, etc.)
 |  
 |  Arrays should be constructed using `array`, `zeros` or `empty` (refer
 |  to the See Also section below).  The parameters given here refer to
 |  a low-level method (`ndarray(...)`) for instantiating an array.
 |  
 |  For more information, refer to the `numpy` module and examine the
 |  methods and attributes of an array.
 |  
 |  Parameters
 |  ----------
 |  (for the __new__ method; see Notes below)
 |  
 |  shape : tuple of ints
 |      Shape of created array.
 |  dtype : data-type, optional
 |      Any objec

In [30]:
a = np.arange(4)
a.shape # 1-d array with length 4 -- different with 4x1 2-d array!

(4,)

In [33]:
b = a.reshape(-1,1)
b.shape

(4, 1)

In [34]:
a_mat.shape 

(2, 2)

In [35]:
a_mat.tolist()

[[0, 2], [3, 4]]

In [36]:
a.mean()

1.5

In [37]:
help(a.mean)

Help on built-in function mean:

mean(...) method of numpy.ndarray instance
    a.mean(axis=None, dtype=None, out=None, keepdims=False)
    
    Returns the average of the array elements along given axis.
    
    Refer to `numpy.mean` for full documentation.
    
    See Also
    --------
    numpy.mean : equivalent function



In [38]:
np.mean(a) 

1.5

In [39]:
help(a.reshape)

Help on built-in function reshape:

reshape(...) method of numpy.ndarray instance
    a.reshape(shape, order='C')
    
    Returns an array containing the same data with a new shape.
    
    Refer to `numpy.reshape` for full documentation.
    
    See Also
    --------
    numpy.reshape : equivalent function
    
    Notes
    -----
    Unlike the free function `numpy.reshape`, this method on `ndarray` allows
    the elements of the shape parameter to be passed in as separate arguments.
    For example, ``a.reshape(10, 11)`` is equivalent to
    ``a.reshape((10, 11))``.



##  Dimension and Axis of ndarray

Numpy use the term *dimension* and *axis* (indexing from 0) to describe the degree of freedom of array. [See the illustrations here.](https://www.cs.ubc.ca/~pcarter/cs189/cs189_ch7s3.html)

In [40]:
a = np.arange(24).reshape(2,3,4) # 3-d array, or tensor  
a

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In the method `reshape`, you can also pass value -1 to let Numpy calculate the number for you.

In [41]:
np.arange(24).reshape(2,-1,4)

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [42]:
help(np.arange) # note the difference with range()

Help on built-in function arange in module numpy:

arange(...)
    arange([start,] stop[, step,], dtype=None)
    
    Return evenly spaced values within a given interval.
    
    Values are generated within the half-open interval ``[start, stop)``
    (in other words, the interval including `start` but excluding `stop`).
    For integer arguments the function is equivalent to the Python built-in
    `range` function, but returns an ndarray rather than a list.
    
    When using a non-integer step, such as 0.1, the results will often not
    be consistent.  It is better to use `numpy.linspace` for these cases.
    
    Parameters
    ----------
    start : number, optional
        Start of interval.  The interval includes this value.  The default
        start value is 0.
    stop : number
        End of interval.  The interval does not include this value, except
        in some cases where `step` is not an integer and floating point
        round-off affects the length of `out`.
   

In [43]:
print(a.T)
a.T.shape

[[[ 0 12]
  [ 4 16]
  [ 8 20]]

 [[ 1 13]
  [ 5 17]
  [ 9 21]]

 [[ 2 14]
  [ 6 18]
  [10 22]]

 [[ 3 15]
  [ 7 19]
  [11 23]]]


(4, 3, 2)

In [44]:
a_1d = np.array([1,2,3,4])
a_1d.shape

(4,)

In [45]:
a_1d.T.shape # transpose is still 1-D array! this is very different with Matlab!

(4,)

In [46]:
a_2d = a_1d[:,np.newaxis] # increase dimension
a_2d.shape 

(4, 1)

In [47]:
a_2d

array([[1],
       [2],
       [3],
       [4]])

In [48]:
a_1d

array([1, 2, 3, 4])

In [49]:
print(a_1d.ndim)
print(a_2d.ndim)

1
2


To change the multi-dimension array to 1-d array, in addition to `reshape`(create view), we can also choose `ravel`(create view) or `flatten`(create copy).

In [50]:
a_mat = np.zeros((2,2)) # note the parentheses here
a_mat_reshape = a_mat.reshape(-1) # -1 means default length -- create view
a_mat_ravel =  a_mat.ravel()
a_mat_flatten = a_mat.flatten()

In [51]:
a_mat_reshape

array([0., 0., 0., 0.])

In [52]:
a_mat_ravel.base

array([[0., 0.],
       [0., 0.]])

In [53]:
a_mat_flatten.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

## Indexing of ndarray

**1. Slicing: Similiar to the list indexing**

Always remember that slicing creates the view instead of copy!

In [55]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
b = a[:2, 1:3] # create the view instead of copy
print(a[0, 1])   
b[0, 0] = 77
print(a[0, 1])   

2
77


Be cautious with the difference between simple indexing (one integer index) and slicing.

In [56]:
a[:,0] # 1-d array

array([1, 5, 9])

In [57]:
a[:,0:1] # 2-d array

array([[1],
       [5],
       [9]])

In [5]:
a[0:1,:] # 2-d array

array([[ 1, 77,  3,  4]])

For more exercise: See Figure 4-2 in [this material](https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html).

**2. Boolean Indexing**

In [58]:
a[a<5] = 0 # In Numpy terms, a<5 creates the "mask" contaning true or false values

In [59]:
a

array([[ 0, 77,  0,  0],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [60]:
b = a[a>2]
b

array([77,  5,  6,  7,  8,  9, 10, 11, 12])

Boolean indexing can create new numpy ndarray instead of the view.

In [61]:
x = np.arange(10)
y = x[(x>4) & (x<8)] # just for your information: do not use keyword "and" here

In [62]:
y.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

**3. Integer Array Indexing (Fancy Indexing)**

General rule: `arr[[ind1,ind2]]` just means `np.array([arr[ind1],arr[ind2]])`

In [63]:
ind = np.array([1,0,2]) # no problem for list [1,0,2]
x = np.arange(10)
x[ind] # equivalently, x[[1,0,2]]

array([1, 0, 2])

In [64]:
a = np.arange(12).reshape(3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [13]:
a[[1,0,2],:]

array([[ 4,  5,  6,  7],
       [ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

In [14]:
a[2,[1,0,2]]

array([ 9,  8, 10])

## Numpy Universal Functions (ufuncs) and Aggregate Function

Similar to Matlab, the built-in loops in Python can be very slow for large-scale problems. To solve this issue, Numpy adopts vectorized methods (uses [vectorization](https://numpy.org/doc/stable/glossary.html#term-vectorization)) written in optimized C-language codes, and provides the interface as Numpy universal functions (ufuncs). 

Numpy ufuncs operates on ndarrays in an element-by-element fashion. You can find all the ufuncs in the [documentation](https://numpy.org/doc/stable/reference/ufuncs.html).

In [66]:
x = np.arange(1000000)
np.log(1+x)

array([ 0.        ,  0.69314718,  1.09861229, ..., 13.81550856,
       13.81550956, 13.81551056])

We can also iterate the numpy array through elements just as Python built-in list (of course you can always get elements through iterating the index), although it is not very recommended for large-scale problems.

In [16]:
a = np.arange(6)
for elem in a:
    print(elem, end =" " )

0 1 2 3 4 5 

In [17]:
a = a.reshape(2,-1)
for row in a:
    print(row, end =" " )  

[0 1 2] [3 4 5] 

In [18]:
for row in a:
    for elem in row:
        print(elem, end =" " ) 

0 1 2 3 4 5 

In [19]:
for elem in np.nditer(a):
    print(elem, end =" " )

0 1 2 3 4 5 

In [20]:
for (idx, elem) in np.ndenumerate(a):
    print([idx, elem])

[(0, 0), 0]
[(0, 1), 1]
[(0, 2), 2]
[(1, 0), 3]
[(1, 1), 4]
[(1, 2), 5]


Numpy also provides some useful aggregate functions.

In [67]:
a = np.arange(6).reshape(2,3)
a 

array([[0, 1, 2],
       [3, 4, 5]])

In [22]:
a.sum(axis=0)

array([3, 5, 7])

In [23]:
a.sum(axis=1)

array([ 3, 12])

In [68]:
a.sum()

15

In [69]:
a.min(axis=1)

array([0, 3])

In [70]:
b = np.arange(24).reshape(2,3,-1)
b

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [26]:
b.sum(axis=1)

array([[12, 15, 18, 21],
       [48, 51, 54, 57]])

In [71]:
b.max(axis=0)

array([[12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

## Numpy Linear Algebra Functions

See the reference [here](https://numpy.org/doc/stable/reference/routines.linalg.html?highlight=linear%20algebra#matrix-and-vector-products) and [compare it with Matlab](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html). Be cautious with operators like `*`,`@` (only available after Python 3.5) and functions/methods `dot`,`vdot` and `matmul`.

In [72]:
help(np.dot)

Help on function dot in module numpy:

dot(...)
    dot(a, b, out=None)
    
    Dot product of two arrays. Specifically,
    
    - If both `a` and `b` are 1-D arrays, it is inner product of vectors
      (without complex conjugation).
    
    - If both `a` and `b` are 2-D arrays, it is matrix multiplication,
      but using :func:`matmul` or ``a @ b`` is preferred.
    
    - If either `a` or `b` is 0-D (scalar), it is equivalent to :func:`multiply`
      and using ``numpy.multiply(a, b)`` or ``a * b`` is preferred.
    
    - If `a` is an N-D array and `b` is a 1-D array, it is a sum product over
      the last axis of `a` and `b`.
    
    - If `a` is an N-D array and `b` is an M-D array (where ``M>=2``), it is a
      sum product over the last axis of `a` and the second-to-last axis of `b`::
    
        dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
    
    Parameters
    ----------
    a : array_like
        First argument.
    b : array_like
        Second argument.
    out : 

In [73]:
help(np.vdot)

Help on function vdot in module numpy:

vdot(...)
    vdot(a, b)
    
    Return the dot product of two vectors.
    
    The vdot(`a`, `b`) function handles complex numbers differently than
    dot(`a`, `b`).  If the first argument is complex the complex conjugate
    of the first argument is used for the calculation of the dot product.
    
    Note that `vdot` handles multidimensional arrays differently than `dot`:
    it does *not* perform a matrix product, but flattens input arguments
    to 1-D vectors first. Consequently, it should only be used for vectors.
    
    Parameters
    ----------
    a : array_like
        If `a` is complex the complex conjugate is taken before calculation
        of the dot product.
    b : array_like
        Second argument to the dot product.
    
    Returns
    -------
    output : ndarray
        Dot product of `a` and `b`.  Can be an int, float, or
        complex depending on the types of `a` and `b`.
    
    See Also
    --------
    dot : 

In [74]:
help(np.matmul)

Help on ufunc object:

matmul = class ufunc(builtins.object)
 |  Functions that operate element by element on whole arrays.
 |  
 |  To see the documentation for a specific ufunc, use `info`.  For
 |  example, ``np.info(np.sin)``.  Because ufuncs are written in C
 |  (for speed) and linked into Python with NumPy's ufunc facility,
 |  Python's help() function finds this page whenever help() is called
 |  on a ufunc.
 |  
 |  A detailed explanation of ufuncs can be found in the docs for :ref:`ufuncs`.
 |  
 |  Calling ufuncs:
 |  
 |  op(*x[, out], where=True, **kwargs)
 |  Apply `op` to the arguments `*x` elementwise, broadcasting the arguments.
 |  
 |  The broadcasting rules are:
 |  
 |  * Dimensions of length 1 may be prepended to either array.
 |  * Arrays may be repeated along dimensions of length 1.
 |  
 |  Parameters
 |  ----------
 |  *x : array_like
 |      Input arrays.
 |  out : ndarray, None, or tuple of ndarray and None, optional
 |      Alternate array object(s) in which