# Introduction to NumPy

-----

As we discussed previously, the Python programming language provides a rich set of data structures such as the list, tuple, dictionary, and string, which can greatly simplify common programming tasks. Of these structures, all but the string are heterogeneous, which means they can hold data of different types. This flexibility comes at a cost, however, as it is more expensive in both computational time and storage to maintain an arbitrary collection of data than to hold a pre-defined set of data.

For example, the Python list is implemented as a variable length array that contains pointers to the objects held in the array. While flexible, it takes time to create, resize, and iterate over, even if the data contained in the list is homogenous. Also, even though you can create multiple-dimensional lists, creating and working with them is neither simple nor intuitive. Yet, many applications require multi-dimensional representation, for example, location on the surface of the Earth or pixel properties in an image.

Thus, these data structures clearly are not designed or optimized for data intensive computing. Scientific and engineering computing applications have a long history of using optimized data structures and libraries, including codes written in C, Fortran, or MatLab. These applications expect to have vector and matrix data structures and optimized algorithms that can leverage these structures seamlessly.  Fortunately, since many data science applications, including statistical and machine learning, are built on this academic legacy, a community of open source developers have built the [Numerical Python (NumPy)](http://numpy.org) library, a fundamental numerical package to facilitate scientific computing in Python.

## Table of Contents

[NumPy Array](#NumPy-Array)  

[Indexing Arrays](#Indexing-Arrays)  

[NumPy Basic Operations](#NumPy-Basic-Operations)  

---
[[Back to TOC]](#Table-of-Contents)


## NumPy Array

NumPy is built around a new, n-dimensional array (`ndarray`) data structure that provides fast support for numerical computations. This data type for objects stored in the array can be specified at creation time, but the array is homogenous. This array can be used to represent a vector (one-dimensional set of numerical values) or matrix (multiple-dimensional set of vectors). Furthermore, NumPy provides additional benefits built on top of the `array` object, including masked arrays, universal functions, sampling from random distributions, and support for user-defined, arbitrary data-types that allow the `array` to become an efficient, multi-dimensional generic data container.

-----

### Is NumPy Worth Learning?

Despite the discussion in the previous section, you might be curious if the benefits of learning NumPy are worth the effort of learning to effectively use a new Python data structure, especially one that is not part of the standard Python distribution. In the end, you will have to make this decision, but there are several definite benefits to using NumPy:

1. NumPy is much faster than using a `list`.
2. NumPy is generally more intuitive than using a `list`.
3. NumPy is used by many other libraries like SciPy, MatPlotLib, and Pandas.
4. NumPy is part of the standard **data science** Python distribution.


-----


### Creating an Array

[NumPy arrays][i], which are instances of the `ndarray` class, are statically-typed, homogenous data structures that can be created in a number of [different ways][1]. You can create an array from an existing Python `list` or `tuple`, or use one of the many built-in NumPy convenience methods:

- `empty`: Creates a new array whose elements are uninitialized.
- `zeros`: Creates a new array whose elements are initialized to zero.
- `ones`: Creates a new array whose elements are initialized to one.
- `empty_like`: Creates a new array whose size matches the input array and whose values are uninitialized.
- `zeros_like`: Creates a new array whose size matches the input array and whose values are initialized to zero.
- `ones_like`: Creates a new array whose size matches the input array and whose values are initialized to unity.

-----
[i]: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
[1]: http://docs.scipy.org/doc/numpy/user/basics.creation.html

In [1]:
import numpy as np
# Make and print out simple NumPy arrays

print(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))

print("\n", np.empty(10))
print("\n", np.zeros(10))
print("\n", np.ones(10))
print("\n", np.ones_like(np.arange(10)))

[0 1 2 3 4 5 6 7 8 9]

 [0.0e+000 4.9e-324 9.9e-324 1.5e-323 2.0e-323 2.5e-323 3.0e-323 3.5e-323
 4.0e-323 4.4e-323]

 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

 [1 1 1 1 1 1 1 1 1 1]


-----

We can also create NumPy arrays that have specific initialization patterns. For example, the `arange(start, end, step)` method works like the normal Python `range` method, creating an array whose elements begin with the `start` parameter. Subsequent elements are incremented successively by the `step` parameter, stopping when the `end` parameter would either be reached or surpassed. As was the case with the `range` method, the `start` and `step`  parameters are optional, defaulting to zero and one, respectively, and the `end` value is **not** included in the array.

-----

In [2]:
# Demonstrate the np.arange method

print(np.arange(10))
print(np.arange(3, 10, 2))

[0 1 2 3 4 5 6 7 8 9]
[3 5 7 9]


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, create a one-dimensional NumPy array by using the `arange` function of the integers zero to ten, inclusive. Next, create an array of the same shape by using the `zero_like` function.

-----

-----

NumPy also provides a convenient method that assigns values to the elements that are evenly spaced. The method is the `linspace(start, end, num)` method, which creates `num` elements and assigns values that are linearly spaced between `start` and `end`. `num` has default value 50 which means if not provided, the function will create 50 data points evenly spaced between `start` and `end` inclusive.

-----

In [3]:
# Demonstrate linear array creation.

print(f"Linear space bins [0, 10] = {np.linspace(0, 10, 4)}\n")

print(f"Default linspace bins = {len(np.linspace(0,10))}\n")


Linear space bins [0, 10] = [ 0.          3.33333333  6.66666667 10.        ]

Default linspace bins = 50



-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, create a one-dimensional NumPy array by using the `linspace` function that contains twenty-five values between one and ten.

-----

-----

### Array Attributes

Each NumPy array has several attributes that describe the general features of the array. These attributes include the following:

- `ndim`: Number of dimensions of the array (previous examples were all unity)
- `shape`: The dimensions of the array, so a matrix with `n` rows and `m` columns has `shape` equal to `(n, m)`
- `size`: The total number of elements in the array. For a matrix with n rows and m columns, the `size` is equal to the product of $n \times m$
- `dtype`: The data type of each element in the array
- `itemsize`: The size in bytes of each element in the array
- `data`: The buffer that holds the array data

-----

### Array Data Types

NumPy arrays are statically-typed, thus their [data type][1] is specified when they are created. The default data type is `float`, but the type can be specified in several ways. First, if you use an existing `list` (as we did in the previous code block) or `array` to initialize the new `array`, the data type of the previous data structure will be used. If a heterogeneous Python `list` is used, the greatest data type will be used in order, guaranteeing that all values will be safely contained in the new `array`. If using a NumPy function to create the new `array`, the data type can be specified explicitly by using the `dtype` argument, which can either be one of the predefined built-in data types or a user-defined custom data type.

The full list of built-in data types can be obtained from the `np.sctypeDict.keys()` method, but for brevity, we list some of the more commonly used built-in data types below, along with their maximum size in bits, which constrains the maximum allowed value that may be stored in the new `array`:

- Integer: `int8`, `int16`, `int32`, and `int64` 
- Unsigned Integer: `uint8`, `uint16`, `uint32`, and `uint64` 
- Floating Point: `float16`, `float32`, `float64`, and `float128` 

Other data types include complex numbers, byte arrays, character arrays, and date/time arrays. 

To check the type of an array, you can simply access the array's `dtype` attribute. A `ValueError` exception will be thrown if you try to assign an incompatible value to an element in an `array`. 

-----
[1]: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

In [4]:
a = np.linspace(0, 10, 4)
a.dtype

dtype('float64')

-----

As previously declared, trying to assign the wrong data type to a NumPy array will throw a `ValueException`, as shown in the following figure (note: you can repeat this on your own in a new code cell).

![NumPy ValueException](images/numpy-except.png)


-----

-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, create a one-dimensional NumPy array by using the `arange` function of the integers twenty to twenty-five, inclusive. When creating this array, specify that the array should be created using a datatype of `int64`. Checkout function docstring by `np.arrange?` if you don't know how to specify datatype. Print out the array's actual data type to verify you created the array correctly.

-----

### Multidimensional Arrays

NumPy supports multidimensional arrays, although for simplicity we will rarely discuss anything other than two- or three-dimensional arrays. We will defer a discussion of working with multi-dimensional arrays by using NumPy to a subsequent lesson in this course.


---
[[Back to TOC]](#Table-of-Contents)


## Indexing Arrays

NumPy supports many different ways to [access elements][1] in an array. Elements can be indexed or sliced in the same way a Python list or tuple can be indexed or sliced, as demonstrated in the following code cell.

------
[1]: http://docs.scipy.org/doc/numpy/user/basics.indexing.html

In [5]:
a = np.arange(9)
print("Original Array = ", a)

a[1] = 3
a[3:5] = 4
a[0:6:2] *= -1

print("\nNew Array = ", a)

Original Array =  [0 1 2 3 4 5 6 7 8]

New Array =  [ 0  3 -2  4 -4  5  6  7  8]


-----

### Special Indexing

NumPy also provides several other _special_ indexing techniques. The first such technique is the use of an [index array][1], where you use an array to specify the elements to be selected. The second technique is a [Boolean mask array][2]. In this case, the Boolean array is the same size as the primary NumPy array, and if the element in the mask array is `True`, the corresponding element in the primary array is selected, and vice-versa for a `False` mask array element. These two special indexing techniques are demonstrated in the following two code cells.

-----
[1]: http://docs.scipy.org/doc/numpy/user/basics.indexing.html#index-arrays
[2]: http://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays

In [6]:
# Demonstration of an index array

a = np.arange(10)

print("\nStarting array:\n", a)
print("\nIndex Access: ", a[np.array([1, 3, 5, 7])])


Starting array:
 [0 1 2 3 4 5 6 7 8 9]

Index Access:  [1 3 5 7]


In [7]:
# Demonstrate Boolean mask access

# Simple case

a = np.arange(10)
print("Original Array:", a)

print("\nMask Array: ", a > 4)

# Now change the values by using the mask

a[a > 4] = -1.0
print("\nNew Array: ", a)

Original Array: [0 1 2 3 4 5 6 7 8 9]

Mask Array:  [False False False False False  True  True  True  True  True]

New Array:  [ 0  1  2  3  4 -1 -1 -1 -1 -1]


---
[[Back to TOC]](#Table-of-Contents)


## NumPy Basic Operations

NumPy arrays naturally support basic mathematical operations, including addition, subtraction, multiplication, division, integer division, and remainder, allowing you to easily combine a scalar (a single number) with a vector (a one-dimensional array) or a matrix (a multi-dimensional array). In the next code block, we first create a one-dimensional array, and subsequently operate on this array to demonstrate how to combine a scalar with a vector.

-----

In [8]:
# Create and use a vector
a = np.arange(10)

print(a)
print("\n", (2.0 * a + 1)/3)
#remainder of divide by 2
print("\n", a%2)
#integer division by 2
print("\n", a//2)

[0 1 2 3 4 5 6 7 8 9]

 [0.33333333 1.         1.66666667 2.33333333 3.         3.66666667
 4.33333333 5.         5.66666667 6.33333333]

 [0 1 0 1 0 1 0 1 0 1]

 [0 0 1 1 2 2 3 3 4 4]


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, first create a one-dimensional NumPy array by using the `arange` function of the integers zero to ten, inclusive. Next, perform the following  basic math operations on this new array and display only the final result. First, add two to every element. Second, multiply the result by five. Finally, subtract seven. Next, create a second array of eleven, randomly selected integers between zero and ten, and multiply this new array by the original array, displaying the result.

-----

-----

### Summary Functions

NumPy provides convenience functions that can quickly summarize the values of an array, which can be very useful for specific data processing tasks (note that we will cover descriptive statistics in a subsequent lesson). These functions include basic [statistical measures][1] (`mean`, `median`, `var`, `std`, `min`, and `max`), the total sum or product of all elements in the array (`sum`, `prod`), as well as running sums or products for all elements in the array (`cumsum`, `cumprod`). The last two functions actually produce arrays that are of the same size as the input array, where each element is replaced by the respective running sum/product up to and including the current element. Another function, `trace`, calculates the trace of an array, which simply sums up the diagonal elements in the multi-dimensional array.


-----

[1]: http://docs.scipy.org/doc/numpy/reference/routines.statistics.html

In [9]:
# Demonstrate data processing convenience functions

# Make an array = [1, 2, 3, 4, 5]
a = np.arange(1, 6)

print(f"Mean value = {np.mean(a)}")
print(f"Median value = {np.median(a)}")
print(f"Variance = {np.var(a)}")
print(f"Std. Deviation = {np.std(a)}\n")

print(f"Minimum value = {np.min(a)}")
print(f"Maximum value = {np.max(a)}\n")

print(f"Sum of all values = {np.sum(a)}")
print(f"Running cumulative sum of all values = {np.cumsum(a)}\n")

print(f"Product of all values = {np.prod(a)}")
print(f"Running product of all values = {np.cumprod(a)}\n")

Mean value = 3.0
Median value = 3.0
Variance = 2.0
Std. Deviation = 1.4142135623730951

Minimum value = 1
Maximum value = 5

Sum of all values = 15
Running cumulative sum of all values = [ 1  3  6 10 15]

Product of all values = 120
Running product of all values = [  1   2   6  24 120]



-----

### Universal Functions

NumPy also includes methods that are universal functions, or [__ufuncs__][1], that are vectorized and thus operate on each element in the array, without the need for a loop. You have already seen examples of some of these functions at the start of this notebook when we compared the speed and simplicity of NumPy versus normal Python for numerical operations. These functions almost all include an optional `out` parameter that allows a pre-defined NumPy array to be used to hold the results of the calculation, which can often speed up the processing by eliminating the need for the creation and destruction of temporary arrays. These functions will all still return the final array, even if the `out` parameter is used. 

NumPy includes over sixty _ufuncs_ that come in several different categories:

- Math operations, which can be called explicitly or implicitly when the standard math operators are used on NumPy arrays. Example functions in this category include `add`, `divide`, `power`, `sqrt`, `log`, and `exp`.
- Trigonometric functions, which assume angles measured in radians. Example functions include the `sin`, `cos`, `arctan`, `sinh`, and `deg2rad` functions.
- Bit-twiddling functions, which manipulate integer arrays as if they are bit patterns. Example functions include the `bitwise_and`, `bitwise_or`, `invert`, and `right_shift`.
- Comparison functions, which can be called explicitly or implicitly when using standard comparison operators that compare two arrays, element-by-element, returning a new array of the same dimension. Example functions include `greater`, `equal`, `logical_and`, and `maximum`.
- Floating functions, which compute floating point tests or operations, element-by-element. Example functions include `isreal`, `isnan`, `signbit`, and `fmod`.

Look at the official [NumPy _ufunc_][1] reference guide for more information on any of these functions, for example, the [isnan][2] function, since the user guide has a full breakdown of each function and sample code demonstrating how to use the function. 

In the following code blocks, we demonstrate several of these _ufuncs_.

-----
[1]: http://docs.scipy.org/doc/numpy/reference/ufuncs.html
[2]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html#numpy.isnan

In [10]:
b = np.arange(1, 10)

print('original array:\n', b)

c = np.sin(b)

print('\nnp.sin : \n', c)

print('\nnp.log and np.abs : \n', np.log10(np.abs(c)))

print('\nnp.mod : \n', np.mod(b, 2))

print('\nnp.logical_and : \n', np.logical_and(np.mod(b, 2), True))



original array:
 [1 2 3 4 5 6 7 8 9]

np.sin : 
 [ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427 -0.2794155
  0.6569866   0.98935825  0.41211849]

np.log and np.abs : 
 [-0.07496085 -0.04129404 -0.85041141 -0.12101744 -0.01821569 -0.55374951
 -0.18244349 -0.00464642 -0.38497791]

np.mod : 
 [1 0 1 0 1 0 1 0 1]

np.logical_and : 
 [ True False  True False  True False  True False  True]


In [11]:
# Demonstrate Boolean tests with operators

d = np.arange(9)

print("Greater Than or Equal Test: \n", d >= 5)

# Now combine to form Boolean Matrix

np.logical_and(d > 3, d % 2)

Greater Than or Equal Test: 
 [False False False False False  True  True  True  True]


array([False, False, False, False, False,  True, False,  True, False])

-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, first create a one-dimensional NumPy array of the even integers from 20 to 42, inclusive. Next, apply a ufunc to compute the fourth power of each element in this array. Divide each element in this array by 11, and print out the resulting elements, one element per line.

-----

### Random Data

NumPy has a rich support for [random number][1] generation, which can be used to create and populate an array of a given shape and size. NumPy provides support for sampling random values from over 30 different distributions including the  `normal`, `binomial`, and `poisson`. There are also special convenience functions to simplify the sampling of random data over the uniform or normal distributions. These techniques are demonstrated in the following code cell.


-----
[1]: http://docs.scipy.org/doc/numpy/reference/routines.random.html

In [12]:
# Create arrays of random data from different distributions

print("Uniform sampling, integers [0, 10): {np.random.randint(0, 10, 5)}")
print("Uniform sampling [0, 1): {np.random.rand(5)}")
print("Normal sampling (0, 1) : {np.random.randn(5)}")

Uniform sampling, integers [0, 10): {np.random.randint(0, 10, 5)}
Uniform sampling [0, 1): {np.random.rand(5)}
Normal sampling (0, 1) : {np.random.randn(5)}


If you execute the above code cell again, you will see different random data generated. It makes sense since our purpose is to generate **random** values. But sometimes we may want to produce the same **random** values. Numpy has a RandomState class for this purpose. You will need to pass a seed which can be any integer to initiate RandomState class. Then you can use the object of RandomState to generate **fixed random** values. This is demonstated in the following code cell.

In [13]:
#Generate fixed random values, use seed 23
rs = np.random.RandomState(23)
print(f"Fixed values, uniform sampling, integers [0, 10): {rs.randint(0, 10, 5)}")
print(f"Fixed values, uniform sampling [0, 1): {rs.rand(5)}")
print(f"Fixed values, Normal sampling (0, 1) : {rs.randn(5)}")

Fixed values, uniform sampling, integers [0, 10): [3 6 8 9 6]
Fixed values, uniform sampling [0, 1): [0.78167951 0.57390838 0.44191368 0.18399514 0.09982012]
Fixed values, Normal sampling (0, 1) : [ 0.56857098 -0.47818389  1.53194062 -0.44070334 -0.06272943]


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, create a one-dimensional NumPy array that contains 10 random integers drawn from a uniform distribution between 5 and 10. 

-----

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [NumPy tutorial][1]
2. [NumPy cheat sheet][2]
3. [NumPy lecture notes][3]
4. [NumPy notebook demo][4]
3. The [NumPy chapter][pdc] from the book _Python Data Science Handbook_ by Jake VanderPlas

-----

[1]: http://docs.scipy.org/doc/numpy/user/index.html
[2]: http://pages.physics.cornell.edu/~myers/teaching/ComputationalMethods/python/arrays.html
[3]: http://www.scipy-lectures.org/intro/numpy/index.html
[4]: http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb
[pdc]: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.00-Introduction-to-NumPy.ipynb

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode