__Author__: Christian Camilo Urcuqui López

__Date__: 2 October 2018

<img src="../Utilities/NumPy_logo.png" width="500">

It is an useful tool for numerical tasks, it provides the mechanisms to storage and data operations as the arrays grow larger in size. Is one of the most important fundational packages computing in Python (a lot of scientific packages use it).

Numpy has some useful tools, some of them are:

+ ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations 
+ mathematical functions for fast operations on entire arrays of data without having to write loops
+ Tools for reading/writing array data to disk and working with memory-mapped files.
+ Linear algebra, random number generation, and Fourier transform capabilities.
+ A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.


The NumPy's website is https://docs.scipy.org/doc/numpy/user/index.html

NumPy is so important for numerical computations in Python because it is designed for efficiency on large arrays of data. There are a number of reasons for this:
+ NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy's library of algorithms written in the C language can operate on this memory without any checking or other overhead. 
+ NumPy operations perform complex computations on entire arrays without the need for Python for loops.


In [1]:
# import and how to see the numpy version

import numpy
numpy.__version__

'1.14.3'

As recommendation, most of the people in data science (in the SciPy/PyData) world use numpy using np as an alias

In [2]:
import numpy as np

In [12]:
# Let's see the performance of numpy, we are going to use a NumPy array of one million integers
# and the equivalent Python list:

import numpy as np

my_arr = np.arange(1000000)

my_list = list(range(1000000))


In [17]:
%time for _ in range(10): my_arr2 = my_arr * 2

Wall time: 20.1 ms


In [16]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

Wall time: 793 ms


## The NumPy ndarray: A multimensional Array Object

One of the most important thing of NumPy is its ndarray, which is a fast, flexible container for large datasets in Python. This kind of object allows us to perform mathematical operations on whole block of data using similar sintax to the equivalent operations between scalar elements.

In order to see the NumPy functionality, we are going to start with a small random data samples.

In [18]:
import numpy as np

# random.randn allows us make random data, to this case a matrix of 2x3
data = np.random.randn(2,3)
data

array([[-1.08899403, -0.39095681, -0.5079292 ],
       [ 0.39371492, -0.59242131, -0.12346692]])

We are going to use some mathematical functions in multimensoinal array objects

In [19]:
data * 10

array([[-10.88994026,  -3.90956805,  -5.07929203],
       [  3.9371492 ,  -5.92421313,  -1.23466916]])

In [20]:
data + data

array([[-2.17798805, -0.78191361, -1.01585841],
       [ 0.78742984, -1.18484263, -0.24693383]])

A _ndarray_ is a generic multimensional container for homogeneous data (all of the elements must be the same type). Every array has a shape, a tuple means the size of each dimension, and _dtype_, an object describes the _data type_ of the array: 

In [22]:
data.shape

(2, 3)

In [23]:
data.dtype

dtype('float64')

## Creating ndarrays

The function _array_ is one of the ways to make an array. This kind of object accepts any sequence-like object and produces a new NumPy array. 

In [27]:
data1 = [6, 7.5, 8, 0, 1]

arr1 = np.array(data1)

print(arr1.dtype)

arr1

float64


array([6. , 7.5, 8. , 0. , 1. ])

In [29]:
data2 = [[1,2,3,4],[5,6,7,8]]

arr2 = np.array(data2)
print(arr2.dtype)
arr2

int32


array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [34]:
print("shape: "+str(arr2.shape))
print("dimension: "+str(arr2.ndim))

shape: (2, 4)
dimension: 2


We can make other king of arrays through the application of _np.array_, for example, if we need an array with 0s or 1s the can use the _zero_ method. Respectively, with the length or shape, we can make an array without values through the _empty_ method of numpy.

In [35]:
np.zeros(10) # we are defining the length

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [36]:
np.zeros((3,5)) # we are defining the shape

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [37]:
np.empty((2, 3, 2))

array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

In [38]:
np.arange(1,20)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19])

In [41]:
np.eye(5,5) # the identify matrix 

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

## NumPy Data Types

The NumPy dtypes allows us to understand and work with the kind of data contained in a _ndarray_. When we used the function _dtype_ it allowed us to get the type of variables that are in this structure. In some projects is important to have a good manage with the variables because they have an assignation of bytes in the memory, so if we have low resources one way is to administrate them through the kind of variables that we are using.

In this URL we can see the information about the variables

https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

In [23]:
arr =  np.array([1, 2, 3, 4, 5])
arr.dtype

dtype('int32')

In [43]:
# we can change the type of the variables 
float_arr =  arr.astype(np.float64)
float_arr.dtype

dtype('float64')

In [3]:
# we can define a ndarray with strings and next we can change them to another type
numeric_strings = np.array([.22, .270, .357, .44, .50], dtype=np.string_)

numeric_strings.astype(float)


array([0.22 , 0.27 , 0.357, 0.44 , 0.5  ])

## Arithmetic with NumPy 

As we saw, one of the most important things of NumPy is the possibility to use mathematical operations through arrays without loops, we saw that the cost in NumPy is more efficient.

In [5]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])

arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [6]:
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [7]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array.

In [8]:
1 / arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [9]:
arr ** 0.5

array([[0.5, 1. , 1.5],
       [2. , 2.5, 3. ]])

In [11]:
arr2 =  np.array([[0., 4., 1.],[7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [12]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

## Indexing and slicing

We have different ways to index our data in a ndarray, let's see in a One-dimensional arrays

In [25]:
arr =  np.arange(10)

arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
#let's get the data in the position 5, remember that the first position of a 
# Python array starts at 0
arr[5]

5

In [16]:
# we are going to print values between positions 5 and 8
arr[5:8]
# look that the value in position 8 didn't print 

array([5, 6, 7])

In [4]:
# let's see that we can propagate a scalar value to the entire selection
arr[5:8] = 12

arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [5]:
# we are going to get and make a new ndarray with only the slice of values
arr_slice = arr[5:8]

arr_slice

array([12, 12, 12])

Pay attention to next code linex, one important thing is array slices are _views_ on the original array. That means that the data is not copied!, and any modifications to the view will be reflected in the source array.

In [6]:
arr_slice[1] = 12345

In [7]:
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The _"bare"_ slice __[:]__ will assign to all values in an array

In [9]:
arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If we want to copy a slice of an ndarray instead of a view, we must use the method _copy()_ from ndarray, like this:

```
arr[5:8].copy()
```

We can have more ptions with higher dimensional arrays, for example in a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays

In [6]:
arr2d = np.array([[1.,2.,3.], [4.,5.,6.], [7.,8.,9]])

In [3]:
arr2d[2]

array([7., 8., 9.])

In [5]:
arr2d[0][2]

3.0

In [6]:
# this is another way to get the specific value from a row and a column
arr2d[0,2]

3.0

In [3]:
# the next example is ndarray of 2 x 2 x 3 
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d.shape

(2, 2, 3)

In [4]:
arr3d    

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [10]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

In [11]:
arr3d[0,1]

array([4, 5, 6])

In [14]:
arr3d[0,1,2]

6

In [15]:
old_values =  arr3d[0].copy()

In [16]:
arr3d[0] = 42

In [20]:
# let's see that we are going to replace all values in the 2x3 ndarray with 42
print(arr3d)
print(arr3d.shape)

[[[42 42 42]
  [42 42 42]]

 [[ 7  8  9]
  [10 11 12]]]
(2, 2, 3)


In [21]:
arr3d[0] = old_values

arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

### Indexing with slices

let's see the slicing to one-dimensional array

In [26]:
print(arr)

print(arr[1:6])

[0 1 2 3 4 5 6 7 8 9]
[1 2 3 4 5]


Let's see the slicing to two-dimensional array

In [4]:
arr2d

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [8]:
print(arr2d[:2])
print("shape:")
print(arr2d.shape)

[[1. 2. 3.]
 [4. 5. 6.]]
shape:
(3, 3)


As you can see, the slicing is a little bit different, we can say that we selected the first two rows of arr2d.

In [14]:
# for this case, we are going to get a view of the first two rows and
# the last two columns
arr2d[:2,1:]

array([[2., 3.],
       [5., 6.]])

In [10]:
# in this case we are going to take the second row and the third column
arr2d[1, 2:]

array([6.])

In [19]:
# we are going to take the second and the third rows with their columns
arr2d[1:]

array([[4., 5., 6.],
       [7., 8., 9.]])

In [7]:
# we are going toselect the third column but only the first two rows
arr2d[:2, 2]

array([3., 6.])

In [8]:
# when are using the colon by itself means to take the entire axis
arr2d[:, :1]

array([[1.],
       [4.],
       [7.]])

In [9]:
arr2d[:2,1:] = 0
arr2d

array([[1., 0., 0.],
       [4., 0., 0.],
       [7., 8., 9.]])

<img src="https://www.oreilly.com/library/view/python-for-data/9781449323592/httpatomoreillycomsourceoreillyimages1346882.png" height=50 width=250>

## Boolean Indexing

Let's see an example which have duplicate data

In [11]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

data = np.random.randn(7, 4)

names


array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [12]:
data

array([[ 1.2194719 ,  1.90229479,  0.13427477,  2.75439561],
       [-0.13822246,  0.26719519,  0.08527973, -0.27490515],
       [ 1.46010908,  0.08292961,  0.36703866, -0.19062384],
       [-0.28412293, -0.91159991,  0.86456737,  0.07496006],
       [-0.27704824,  1.24407004, -0.57918374, -0.50286833],
       [-0.50304865,  0.19842506, -0.60033939, -1.6698228 ],
       [ 1.01061853, -0.23435338, -1.141846  , -0.35119606]])

# References

+ McKinney, W. (2012). Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.".