<a href="https://colab.research.google.com/github/cristripoli/Codility/blob/master/pyds_01_introduction_to_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/daitan-innovation/daitan-ml-course-resources/main/daitan-header.jpg" width="700">

<br />

<span style="font-size: 10px; font-style: italic;">
Privileged and confidential. If this content has been received in error, please delete it immediately.
<br />
Conteúdo confidencial. Se este material foi recebido por engano, por favor apague-o imediatamente.
</span>

# Trilha de Machine Learning

- **Module.** Python for Data Science
- **Instructors:**
  - Alisson Hayasi da Costa
  - Lucas Silveira de Moura



# Introduction to NumPy

The NumPy library (short for "Numerical Python") is a tool for efficiently storing and manipulating large data arrays. For this reason numpy is a fundamental package for any scientific application.

> **Why NumPy?**
>
> Datasets have different shapes and formats. They can be collections of documents, images, sounds, numerical measurements, etc. Although most datasets can be heterogeneous (i.e. various types of data), it is interesting to think of them as arrays of numbers since most algorithms used in machine learning and data science take numeric values as input.
>
> For example, images can simply be expressed as a matrix of pixels. Sounds can be expressed as one-dimensional arrays of sound amplitude over time. Texts can be converted into various forms of numerical representation, such as binary digits to indicate the frequency of a certain set of words (bag-of-words).
>
> Therefore, efficient storage and manipulation of numeric arrays is absolutely necessary.



In [None]:
import numpy as np

## Building NumPy Arrays

The entire Numpy is built around the `ndarray` object.

The `ndarray` (aka numpy arrays) is a data structure for efficient storage and operations (much more efficient than any standard Python data structure).

More specifically, it is an object that represents multidimensional and homogeneous arrays whose elements have a *fixed size*. Also, the *data type defined for the array describes the format of each element*, such as byte order, number of bytes occupied in memory, etc.

(For more information, see the documentation [_ndarray_](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html)).


### From Existing Data

Numpy arrays can be constructed from any tuple of elements. To do this, the `np.array` function is used.

* **`np.array(<array>, dtype=<dtype>, [...])`**
    * `<array>`.An object that can be interpreted as a sequence (e.g. list, tuple, string, etc.).
    * `<dtype>`. Data type of array elements. If not specified, then the type will be determined as the minimum type needed to keep the objects accurately represented.
    * `[...]`. Parameters not described. For more information, see the documentation [np.array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html#numpy.array) 

In [None]:
# Creating an array from a python list

np.array([1, 2, 3, 4])

array([1, 2, 3, 4])

In [None]:
# Since numpy arrays are homogeneous (i.e. restricted to the same data type),
# if there is more than one data type in the array,
# they will be converted to the minimum type needed to keep the objects accurately represented.

np.array([1, 2, 3.14])

array([1.  , 2.  , 3.14])

In [None]:
np.array([1.0, 2.0, 3])

array([1., 2., 3.])

In [None]:
# We can define the data type by passing the desired type through the `dtype` argument.

np.array([1, 2, 3], dtype='float32')

array([1., 2., 3.], dtype=float32)

In [None]:
# To convert the array from one type to another, we use the astype(<dtype>) method.

arr_float = np.array([1, 2, 3], dtype='float32')
arr_float

array([1., 2., 3.], dtype=float32)

In [None]:
arr_int = arr_float.astype(int)
arr_int

array([1, 2, 3])

In [None]:
# If the sequence informed is a list of lists,
# the object created is a multidimensional ndarray.

np.array([[1, 2, 3], [4, 5, 6]], dtype='float32')

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [None]:
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

### From Scratch

If you don't have an object to pass but still want to start an ndarray. Just use the `np.empty` routine.

* **`np.empty(<shape>, dtype=<dtype>)`**. Returns a new array of dimension <shape> and data type <dtype>, without initializing inputs.
    * `<shape>`. Array dimension as tuple or integer.
    * `<dtype>`. Desired data type for array. If not specified, the default is 'float64'.

Some other very useful and used built-in routines are:    
    
* **`np.zeros(<shape>, dtype=<dtype>)`**. Returns an array with shape `<shape>` and data type `<dtype>` filled with zeros.


* **`np.ones(<shape>, dtype=<dtype>)`**. Returns an array with shape `<shape>` and data type `<dtype>` filled with ones.


* **`np.full(<shape>, <fill_value>, dtype=<dtype>)`**. Returns an array with shape `<shape>`, data type `<dtype>` and filled with `<fill_value>`.


* **`np.arange(<start>,<stop>,<step>,dtype=<dtype>)`**. Returns an array within the range $[\textrm{start},\textrm{stop})$, skipping a `<step>` value at each iteration.

* **`np.linspace(<start>, <stop>, num=<num>, dtype=None, [...])`**. Returns an array with `<num>` evenly spaced samples, calculated over the interval [start, stop].

In [None]:
# Creating a 10-size unidimensional array

np.empty((1, 10))

array([[4.63755533e-310, 2.14321575e-312, 2.33419537e-312,
        2.31297541e-312, 6.79038654e-313, 2.35541533e-312,
        6.79038654e-313, 2.35541533e-312, 2.18565567e-312,
        0.00000000e+000]])

In [None]:
# Creating an unidimensional array of zeros

np.zeros((10,1))

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])

In [None]:
# Creating an 2x5 array of ones and dtype int

np.ones((2,5), dtype='int')

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [None]:
# Creating an 3x2 array of 3.14s

np.full((3,2), 3.14)

array([[3.14, 3.14],
       [3.14, 3.14],
       [3.14, 3.14]])

In [None]:
# Creating an array from 0 to 9, dtype int

np.arange(0, 10, dtype='int')

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
# Creating an array of evens from 0 to 10

np.arange(0, 11, 2, dtype='int')

array([ 0,  2,  4,  6,  8, 10])

In [None]:
# Creating an array of 10 evenly spaced samples from 0 to 10

np.linspace(0, 10, 10)

array([ 0.        ,  1.11111111,  2.22222222,  3.33333333,  4.44444444,
        5.55555556,  6.66666667,  7.77777778,  8.88888889, 10.        ])


> **Attention!**
>
>    It is very common to confuse the shapes of arrays,
especially when they are one-dimensional or two-dimensional with an unique row or column.
>
> Note that `shape=10` indicates a one-dimensional array.
> `shape=(1, 10)` or `(10, 1)` (or even `(1,1)`) indicates a two-dimensional array with an unique row (or column).

### From Randomness

To generate random arrays, just use the routines contained in `np.random`

* **`np.random.rand(d0, d1, ..., dn)`**. Returns an array according to the dimensions passed as argument and fills it with random samples from an uniform distribution over the range $[0, 1)$. If no dimension is provided, a single random value is returned.

* **`np.random.randn(d0, d1, ..., dn)`**. Returns an array according to the dimensions passed as argument and fills it with random samples from a normal distribution (mean 0 and standard deviation 1). If no dimension is provided, a single random value is returned.


* **`np.random.randint(<low>, high=<high>, size=<shape>, dtype=<dtype>)`**. Returns an array of dimension `<shape`> whose elements are in the range [`<low>`, `<high>`).

* **`np.random.choice(<array>, size=<size>, replace=<replace>, p=<p>)`**. Returns a sample of a ndarray passed as argument.
    * `<size>`. Number of samples to be returned
    * `<replace>`. If `True`, the same element can be chosen more than once. If `False`, the same element is only chosen once.
    * `<p>`. Probability of each element of the array `<array>` be selected.


* **`np.random.random_sample(size=<shape>)`**. Returns random numbers in the range $[0,1)$. If no `<shape>` is specified, a single random number is returned (i.e., `'float32'`)
 

In [None]:
# array of 5 random numbers

np.random.rand(2)

array([[[0.85889319, 0.38325886],
        [0.47483739, 0.02325571]],

       [[0.21351637, 0.14932934],
        [0.96879965, 0.40300003]]])

In [None]:
# normal distribution

np.random.randn(10)

array([-0.83664575,  0.72675042, -0.38996157, -0.55442921, -1.44610934,
        1.39267663,  0.40015279,  0.63928359, -0.45952131, -0.89861893])

In [None]:
# 3x3 array with values from 0 to 10

np.random.randint(0, 10, (3,3))

array([[3, 3, 4],
       [3, 7, 1],
       [0, 4, 4]])

In [None]:
# 5 element sampling without replacement 

arr = np.arange(0, 10)
np.random.choice(arr, 5, replace=False)

array([9, 6, 4, 8, 1])

### Further Reading 

There are many ways to create numpy arrays from built-in routines. For more ways and information, read the documentation: 

* [Array creation routines](https://docs.scipy.org/doc/numpy/reference/routines.array-creation.html)
* [Random sampling](https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html)


## Handling Data & Arrays

Numpy provides several ways to manipulate data from its arrays. The most important categories are:

- **Attributes**. Identify the size, shape (dimension), memory consumption and data type of arrays.
- **Indexing**. Read and write the value of a specific element of the array.
- **Slicing**. Read and write small subarrays of a larger array.
- **Reshaping**. Change the size and shape of an array numpy.
- **Joining / Splitting**. Combining multiple arrays into one or splitting an array into multiple.

### Attributes

Every `ndarray` has several attributes associated with it. The most important are:

- `T`. Returns a transposition of the called array.
- `ndim`. Number of dimensions of the `ndarray`.
- `shape`. `ndarray` dimension.
- `size`. Total number of elements in array.
- `dtype`. Type of data stored in the `ndarray`.
- `itemsize`. Size (in bytes) of the data type stored in the array.
- `nbytes`. Total array size (in bytes)

> To see all, access the [documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html)

In [None]:
arr1 = np.random.randint(10, size=6)
arr2 = np.random.randint(10, size=(3, 4))

In [None]:
arr1, arr1.T

(array([6, 3, 0, 6, 9, 5]), array([6, 3, 0, 6, 9, 5]))

In [None]:
arr2, arr2.T

(array([[8, 8, 2, 4],
        [9, 1, 9, 4],
        [7, 0, 0, 1]]), array([[8, 9, 7],
        [8, 1, 0],
        [2, 9, 0],
        [4, 4, 1]]))

In [None]:
print('Shape: ', arr1.shape)
print('Number of Dimensions: ', arr1.ndim)
print('Number of elements: ', arr1.size)
print('Data type: ', arr1.dtype)
print('Element size (bytes): ', arr1.itemsize)
print('Array total size (bytes): ', arr1.nbytes)

Shape:  (6,)
Number of Dimensions:  1
Number of elements:  6
Data type:  int64
Element size (bytes):  8
Array total size (bytes):  48


In [None]:
print('Shape: ', arr2.shape)
print('Number of Dimensions: ', arr2.ndim)
print('Number of elements: ', arr2.size)
print('Data type: ', arr2.dtype)
print('Element size (bytes): ', arr2.itemsize)
print('Array total size (bytes): ', arr2.nbytes)

Shape:  (3, 4)
Number of Dimensions:  2
Number of elements:  12
Data type:  int64
Element size (bytes):  8
Array total size (bytes):  96


### Inxeding

The indexing mechanism of numpy arrays is very similar to that used in python built-in data structures. The only difference is in the use of comma in square brackets to access elements of multidimensional arrays. For example:

- `<array>[i,j]`

In [None]:
arr1 = np.arange(10)
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
# Accessing the first and last element of the array

arr1[0], arr1[-1]

(0, 9)

In [None]:
arr2 = np.random.randint(10, size=(3, 4))
arr2

array([[9, 1, 1, 3],
       [7, 9, 4, 3],
       [5, 1, 5, 4]])

In [None]:
# Acessing the first and last row of the matrix

arr2[0], arr2[-1]

(array([9, 1, 1, 3]), array([5, 1, 5, 4]))

In [None]:
# Acessing the first and last elements of the matrix

arr2[0,0], arr2[-1,-1]

(9, 4)

In [None]:
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
arr1[0] = 3.14

In [None]:
arr1

# Wait, What? Where is 3.14? xD

array([3, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Slicing

Like Python sequences, we can use the slicing mechanism to access subarrays from a larger array. The syntax is:
- `<array>[start:end:step]`

In multidimensional arrays the syntax is the same. Just merge the indexing syntax with the slicing.
- `<array>[<row_start>:<row_end>:<step>,<col_start>:<col_end>:<step>]`

> **Attention!**
> When `step` is a negative integer the syntax becomes `<array>[end:start:step]`.
>
> This behavior is a convenient way to invert arrays or work inversely.

In [None]:
arr = np.arange(0, 10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
# Accessing the first 5 elements

arr[:5]

array([0, 1, 2, 3, 4])

In [None]:
# Accessing all elements from index 5

arr[5:]

array([5, 6, 7, 8, 9])

In [None]:
# Accessing all elements, two by two

arr[::2]

array([0, 2, 4, 6, 8])

In [None]:
# Accessing all elements in reverse

arr[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [None]:
arr = np.random.randint(10, 100, (5,5))
arr

array([[33, 40, 52, 45, 69],
       [95, 68, 25, 42, 66],
       [74, 16, 98, 62, 63],
       [93, 98, 36, 89, 80],
       [56, 87, 90, 47, 17]])

In [None]:
# Accessing the first 3 rows and first 2 columns

arr[:3, :2]

array([[33, 40],
       [95, 68],
       [74, 16]])

In [None]:
# Accessing all rows and third column

arr[:, 2]

array([52, 25, 98, 36, 90])

In [None]:
# Reversing the rows

arr[:, ::-1]

array([[69, 45, 52, 40, 33],
       [66, 42, 25, 68, 95],
       [63, 62, 98, 16, 74],
       [80, 89, 36, 98, 93],
       [17, 47, 90, 87, 56]])

In [None]:
# Reversing rows and columns

arr[::-1,::-1]

array([[17, 47, 90, 87, 56],
       [80, 89, 36, 98, 93],
       [63, 62, 98, 16, 74],
       [66, 42, 25, 68, 95],
       [69, 45, 52, 40, 33]])

It's important to note that unlike Python's built-in lists, numpy's slicing returns a reference and not a copy of ndarray. So if you modify the subarray the original array is also modified.

The good thing about this approach is that when working on large sets of data, you don't need to copy them and, as a result, waste more memory.

To create a copy of the array, you need to use the `copy()` method
* `<array>.copy()` (In-place)
* `np.copy(<array>)` (Copy return)


In [None]:
arr

array([[33, 40, 52, 45, 69],
       [95, 68, 25, 42, 66],
       [74, 16, 98, 62, 63],
       [93, 98, 36, 89, 80],
       [56, 87, 90, 47, 17]])

In [None]:
arr_sliced = arr[:2,:2]
arr_sliced

array([[33, 40],
       [95, 68]])

In [None]:
arr_sliced[0,0] = 0
arr_sliced

array([[ 0, 40],
       [95, 68]])

In [None]:
arr

array([[ 0, 40, 52, 45, 69],
       [95, 68, 25, 42, 66],
       [74, 16, 98, 62, 63],
       [93, 98, 36, 89, 80],
       [56, 87, 90, 47, 17]])

In [None]:
arr[:, 0] = 0
arr

array([[ 0, 40, 52, 45, 69],
       [ 0, 68, 25, 42, 66],
       [ 0, 16, 98, 62, 63],
       [ 0, 98, 36, 89, 80],
       [ 0, 87, 90, 47, 17]])

In [None]:
# Creating a copy of the ndarray

arr_cpy = arr.copy()

In [None]:
arr_cpy

array([[ 0, 44, 50, 51, 49],
       [29, 36, 15, 30, 79],
       [65, 14, 46, 40, 33],
       [82, 85, 36, 55, 74],
       [90, 99, 52, 36, 15]])

In [None]:
arr_cpy[0,0] = 0
arr_cpy

array([[ 0, 44, 50, 51, 49],
       [29, 36, 15, 30, 79],
       [65, 14, 46, 40, 33],
       [82, 85, 36, 55, 74],
       [90, 99, 52, 36, 15]])

In [None]:
arr

array([[ 0, 44, 50, 51, 49],
       [29, 36, 15, 30, 79],
       [65, 14, 46, 40, 33],
       [82, 85, 36, 55, 74],
       [90, 99, 52, 36, 15]])

### Reshaping

Reshaping is a task that is often performed depending on context. We can reshape a ndarray by using the `reshape()` method.

- **`<array>.reshape(<shape>)`**. Resizes the `<array>` to the shape passed as argument.

It is possible to resize any ndarray, but the number of elements must be kept (ie, the size of both arrays (`<size>`) must be equal.)

In [None]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
arr.reshape(5, 2)

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [None]:
arr.reshape(2, 5)

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

### Joining

To join arrays there are several strategies, being the use of the `np.concatenate`, `np.vstack` and `np.hstack` routines the most adopted.

* **`np.concatenate((a1, a2, ...), axis=<eixo>)`**. Joins a sequence of arrays (of equivalent dimensions) along a specific axis.
    * `(<a1>, <a2>, ...)`. Sequence of ndarrays (the arrays must have the same shape, except in the dimension corresponding to axis)
    * `<eixo>`. The axis along which the arrays will be joined. If axis is None, arrays are flattened before use. Default is 0.


* **`np.vstack((a1, a2, ...))`**. Joins a sequence of arrays vertically (row-wise).

* **`np.hstack((a1, a2, ...))`**. Joins a sequence of arrays horizontally (columns-wise).

* **`np.stack((a1, a2, ...), axis=<eixo>, out=<out>`**. Joins a sequence of arrays on the specified `<axis>` axis.

In [None]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

In [None]:
x

array([1, 2, 3])

In [None]:
y

array([4, 5, 6])

In [None]:
np.concatenate((x, y), axis=None)

array([1, 2, 3, 4, 5, 6])

In [None]:
np.concatenate((x, y), axis=1)

# Why?

AxisError: ignored

In [None]:
a = np.zeros((3,3))
b = np.ones((3,3))

In [None]:
a

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [None]:
b

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [None]:
# Joining a sequence of arrays along rows (vertically)

np.concatenate((a,b), axis=0)

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [None]:
# Joining a sequence of arrays along columns (horizontally)

np.concatenate((a,b), axis=1)

array([[0., 0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1.]])

To join arrays of different dimensions, we use `np.vstack` or `np.hstack`

In [None]:
a.shape, x.shape, y.shape

((3, 3), (3,), (3,))

In [None]:
np.vstack((a,x,y))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 2., 3.],
       [4., 5., 6.]])

### Splitting

Just as it is possible to concatenate arrays, it is possible to split them. Array division can be done in several ways. The most basic is through slicing.

However, given the need to copy the slice, we can use the methods: `np.split`, `np.hsplit` and `np.vsplit`.

* **`np.split(<array>, <idx_or_sec>, axis=<axis>)`**. Splits an array into subarrays according to the indicated indexes (or subsequences) on the specified axis (while keeping the number of dimensions).
    * `<idx_or_sec>`. If indices_or_sections is an integer, N, the array will be divided into N equal arrays along axis. If such a split is not possible, an error is raised. If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. If an index exceeds the dimension of the array along axis, an empty sub-array is returned correspondingly.


* **`np.vsplit(<array>, <idx_or_sec>, axis=<axis>)`**. Splits an array into subarrays vertically.

* **`np.hsplit(<array>, <idx_or_sec>, axis=<axis>)`**. Splits an array into subarrays horizontally.


In [None]:
arr = np.random.randint(0, 10, (3,3))
arr

array([[8, 5, 8],
       [7, 3, 7],
       [9, 7, 5]])

In [None]:
# Dividindo arr em 3 partes iguais sobre as linhas 

np.split(arr, 3, axis=0)

[array([[8, 5, 8]]), array([[7, 3, 7]]), array([[9, 7, 5]])]

In [None]:
# Dividindo arr em 3 partes iguais sobre as colunas 

np.split(arr, 3, axis=1)

[array([[8],
        [7],
        [9]]), array([[5],
        [3],
        [7]]), array([[8],
        [7],
        [5]])]

In [None]:
arr = np.random.rand(4, 4)
arr

array([[0.63751387, 0.47328574, 0.26270842, 0.77084785],
       [0.46343442, 0.66605687, 0.6461159 , 0.94678493],
       [0.88387736, 0.58880533, 0.5588965 , 0.72248956],
       [0.99050997, 0.07665214, 0.12670353, 0.9211732 ]])

In [None]:
# Dividindo arr nos índices 2 e 3

np.split(arr, [2,3], axis=0)

[array([[0.63751387, 0.47328574, 0.26270842, 0.77084785],
        [0.46343442, 0.66605687, 0.6461159 , 0.94678493]]),
 array([[0.88387736, 0.58880533, 0.5588965 , 0.72248956]]),
 array([[0.99050997, 0.07665214, 0.12670353, 0.9211732 ]])]

In [None]:
# Dividindo arr horizontalmente em 2

left, right = np.hsplit(arr, 2)

In [None]:
left, right

(array([[0.63751387, 0.47328574],
        [0.46343442, 0.66605687],
        [0.88387736, 0.58880533],
        [0.99050997, 0.07665214]]), array([[0.26270842, 0.77084785],
        [0.6461159 , 0.94678493],
        [0.5588965 , 0.72248956],
        [0.12670353, 0.9211732 ]]))

## Universal Functions

Term vectorization refers to the use of optimized, pre-compiled code written in C (together of SIMD instructions set) to perform mathematical operations over a ndarray. Consequently, operations are performed much faster!

Universal functions (UFunc) are built-in vectorized functions that operate on ndarrays. In other words, it refers to performing the same operation on multiple elements (just like Python's native `map()`) extremely efficiently.

UFUncs can be classified into two types: element-oriented (element-wise) and aggregation.

### Element-wise

Element-wise operations are those applied to each element of the ndarray.

In [None]:
arr = np.arange(0, 5)
arr

array([0, 1, 2, 3, 4])

In [None]:
# Adicionando 2 em todos os elementos do array

arr + 2

array([2, 3, 4, 5, 6])

In [None]:
# Multiplicando por 2 todos os elementos de um ndarray 3x3

arr * 2

array([0, 2, 4, 6, 8])

In [None]:
arr ** 2

array([ 0,  1,  4,  9, 16])

In [None]:
# O quao eficiente? TODO: Translate

In [None]:
N = int(1e7)

elements = list(range(0, N))

In [None]:
%%timeit

[element + 2 for element in elements]

1 loop, best of 5: 953 ms per loop


In [None]:
arr_elements = np.array(elements)

NameError: ignored

In [None]:
%%timeit

arr_elements + 2

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])

In [None]:
a + b

In [None]:
a ** b

### Aggregation

Aggregation operations are those whose all elements are aggregated to generate a single value.

Some examples are:

- `sum()`
- `mean()`
- `std()`
- `min()`
- `max()`

In [None]:
arr = np.arange(0, 20)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [None]:
arr.sum()

190

In [None]:
arr.mean()

9.5

In [None]:
arr.min()

0

In [None]:
# How much efficient?

In [None]:
N

10000000

In [None]:
elements = list(range(0, N))

In [None]:
%%timeit

soma = 0
for element in elements:
    soma += element

1 loop, best of 5: 606 ms per loop


In [None]:
arr = np.arange(0, N)

In [None]:
%%timeit

arr.sum()

100 loops, best of 5: 11.1 ms per loop


### Broadcasting

_Broadcasting_ is a set of rules for applying binary _ufuncs_ (addition, subtraction, multiplication, etc.) to arrays of different sizes. The three rules that determine the iteration between two arrays with respect to vector functions are:

* **Rule 1.** If two arrays have a number of different dimensions, the _shape_ of the array that has the smallest number of dimensions is completed with ones on the main (left) side.
* **Rule 2.** If the dimensions of the two arrays do not match, the array with the smallest dimension is increased to match the one with the largest dimension.
* **Rule 3.** If it is not possible to fit, then the two arrays are incompatible and an error is generated.

In [None]:
M = np.ones((3, 3))
M + a

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

In [None]:
arr = np.arange(1, 6)
arr

array([1, 2, 3, 4, 5])

In [None]:
np.add.reduce(arr)

15

In [None]:
np.add.accumulate(arr)

array([ 1,  3,  6, 10, 15])

In [None]:
np.min(arr), arr.min()

(1, 1)

In [None]:
np.max(arr), arr.max()

(5, 5)

In [None]:
arr_mean = arr.mean(0)
arr_mean

3.0

In [None]:
arr_centered = arr - arr_mean
arr_centered

array([-2., -1.,  0.,  1.,  2.])

In [None]:
arr_centered.mean(0)

0.0

In [None]:
arr = np.random.randint(10, size=(3, 4))
arr, arr.shape

(array([[5, 1, 6, 0],
        [7, 6, 0, 4],
        [2, 1, 1, 8]]), (3, 4))

In [None]:
arr.sum(), arr.sum(axis=0), arr.sum(axis=1)

(41, array([14,  8,  7, 12]), array([12, 17, 12]))

### Boolean Operations

#### Comparison Operators

For create boolean masks we use the comparison operators implemented (note that these operators are also ufuncs). The result of these comparison operations is always an array with a boolean data type.

All six standard comparison operations are available.

| Operator | Equivalent ufunc |
|----------|------------------|
| ==       | np.equal         |
| !=       | np.not_equal     |
| <        | np.less          |
| <=       | np.less_equal    |
| >        | np.greater       |
| >=       | np.greater_equal |


In [None]:
arr = np.random.randint(0, 10, (3,4))
arr

array([[3, 2, 1, 7],
       [3, 5, 8, 5],
       [8, 5, 8, 5]])

In [None]:
arr == 6

array([[False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]])

In [None]:
arr != 6

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

In [None]:
np.less(arr, 6)

array([[ True,  True, False, False],
       [False, False,  True,  True],
       [False, False,  True,  True]])

In [None]:
np.greater(arr, 6)

array([[False, False,  True,  True],
       [ True,  True, False, False],
       [ True,  True, False, False]])

#### Logical Operators

We can combine the comparison operators with logical operators to create even more powerful boolean masks.

| Operator | Equivalent ufunc |
|----------|------------------|
| &        | np.bitwise_and   |
| \|       | np.bitwise_or    |
| ^        | np.bitwise_xor   |
| ~        | np.bitwise_not   |


In [None]:
arr

array([[3, 2, 1, 7],
       [3, 5, 8, 5],
       [8, 5, 8, 5]])

In [None]:
(arr % 2 !=0) & (arr > 5)

array([[False, False, False,  True],
       [False, False, False, False],
       [False, False, False, False]])

#### Boolean Masks

Using Boolean masks is a useful strategy for extracting, modifying, counting or manipulating array values based on some criteria. In NumPy, boolean masking is often the most efficient way to accomplish these types of tasks.


In [None]:
arr

array([[3, 2, 1, 7],
       [3, 5, 8, 5],
       [8, 5, 8, 5]])

In [None]:
mascara = (arr % 2 != 0) & (arr > 5)
mascara

array([[False, False, False,  True],
       [False, False, False, False],
       [False, False, False, False]])

In [None]:
arr[mascara]

array([7])

### Appendix

#### Row-wise vs Column-wise

![axis](https://i.stack.imgur.com/OuyFd.png)
<p style="font-size: 14px;">
    Source:
    <a href="https://stackoverflow.com/questions/64428980/why-numpy-apply-along-axis-give-the-same-result-whatever-was-the-axis" target="_blank">
     StackOverflow $-$ Why numpy.apply_along_axis give the same result whatever was the axis?
    </a>
</p>

## Reading and Writing Files with NumPy

The two main functions for efficiently saving and loading array data to disk are `np.save` and `np.load`. By default arrays are saved in an uncompressed raw binary format with a `.npy` file extension (if the file path does not end in `.npy`, the extension will be appended).

* **`np.save(<path>, <arr>, [...])`**. Saves an ndarray `<arr>` to path `<path>` in binary format with extension `.npy`.
 
* **`np.load(<path> [...])`**. Loads a `.npy` file in `<path>`.


On the other hand, if the goal is to save or load data in other formats such as *CSV*, we can use the `np.savetxt` and `np.loadtxt` methods.

* **`np.savetxt(fname, X, fmt='%.18e', delimiter=' ', header='', [...])`**. Saves a ndarray in text format.
    * `fname`. File name (with extension) where the data is to be saved.
    * `X`. ndarray (restricted to 1D-arrays and 2D-arrays).
    * `delimiter`. Reference character for column separation.


* **`numpy.loadtxt(fname, dtype=<class 'float'>, delimiter=None, [...])`**. Loads a text file `fname`.

In [None]:
arr = np.random.randint(0, 100, size=(5, 5))
arr

array([[44, 24, 22, 11, 30],
       [22, 56, 11, 37, 57],
       [68, 50, 76, 56, 26],
       [11,  9, 95, 75, 15],
       [64, 26,  3, 23, 17]])

In [None]:
np.save("array_de_inteiros_5x5", arr)

In [None]:
narr = np.load("array_de_inteiros_5x5.npy")
narr

array([[44, 24, 22, 11, 30],
       [22, 56, 11, 37, 57],
       [68, 50, 76, 56, 26],
       [11,  9, 95, 75, 15],
       [64, 26,  3, 23, 17]])

In [None]:
np.savetxt("array_de_inteiros_5x5.csv", arr, delimiter=",")

In [None]:
np.loadtxt("array_de_inteiros_5x5.csv", delimiter=",")

array([[44., 24., 22., 11., 30.],
       [22., 56., 11., 37., 57.],
       [68., 50., 76., 56., 26.],
       [11.,  9., 95., 75., 15.],
       [64., 26.,  3., 23., 17.]])

In [None]:
# Tip: Use fmt argument to reduce the filesize

np.savetxt("array_de_inteiros_5x5.csv", arr, delimiter=",", fmt="%.3f")

## References

- [Python Data Science Handbook (2016) by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Python for Data Analysis (2017) by Wes McKinney](https://learning.oreilly.com/library/view/python-for-data/9781491957653/)

<span style="font-size: 12px; font-style: italic;">
<br /><br />
Privileged and confidential. If this content has been received in error, please delete it immediately.
<br />
Conteúdo confidencial. Se este material foi recebido por engano, por favor apague-o imediatamente.
</span>