<center>

# <b style="font-family: 'LUISS', 'Lato'">Data Processing</b>

<h2 style="font-family: 'LUISS', 'Lato'">Python and R for Data Science</h2>
<h3 style="font-family: 'LUISS', 'Lato'">Management and Data Science</h3>
<img src="https://ercoppa.github.io/labds/dist/img/cliente-luiss.png">
<br><br><br>

</center>

# Package `numpy`

## Why `numpy`?

In science, we often want to work on multidimensional data:
- one dimension: list or *array* (*an efficient list with fixed size*)
- two dimensions: matrix, also dubbed 2d-array
- three dimensions: 3d-array
- n dimensions: nd-array

In theory, we could just use lists and nest them, e.g.:


In [2]:
matrix = [[1, 2, 3], [4, 5, 6]] # 2x3 matrix
print(matrix[1][0]) # first element of second row

4


However, lists are not very efficient. Hence, we we have thousands of data values they can be quite slow and memory hungry. This why we may want to use `numpy`.

## What is `numpy`?

<center>

<img src="img/numpylogo.svg">

</center>

`numpy` (Numerical Python) contains **multidimensional array data structures**, such as the homogeneous, N-dimensional nd-array, and a large library of **functions that operate efficiently** on them.

Main difference with Python lists:
- a `numpy` nd-array must contain values of the same numerical type
- a `numpy` nd-array has a fixed size
- a `numpy` nd-array should have a shape that is *rectangular*, i.e., different rows must have the same number of columns

## `numpy`: install and import

Install `numpy` with `pip`:

In [3]:
! pip3 install numpy

Defaulting to user installation because normal site-packages is not writeable


By convention `numpy` is imported with an alias `np`:

In [4]:
import numpy as np
# now we can use numpy using np

## `numpy` array: construction

We can build an array by passing a list:

In [5]:
a1 = np.array([1, 2, 3])              # 1D array
print(a1)
a2 = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
print(a2)

[1 2 3]
[[1 2 3]
 [4 5 6]]


For our goals, we often get the `numpy` arrays built from other packages, e.g., as a result of a complex data processing.

## `numpy` array: construction (cont'd)

`numpy` comes with many useful constructors:

In [11]:
a3 = np.zeros((2, 3)) # 2x3 array of zeros
print("Zeros:", a3)
a4 = np.ones((2, 3))  # 2x3 array of ones
print("Ones:", a4)
a5 = np.random.random((2, 3)) # 2x3 array of random numbers
print("Random:", a5)
a6 = np.empty((2, 3)) # 2x3 array of numbers already in memory
print("Garbage:", a6) # this is used when we do not care
                      # for the values of the array
                      # but they are guaranteed to be random!
a7 = np.arange(0, 10, 2) # array of numbers from 0 to 10 with step 2
print("Range:", a7)

Zeros: [[0. 0. 0.]
 [0. 0. 0.]]
Ones: [[1. 1. 1.]
 [1. 1. 1.]]
Random: [[0.67071932 0.16095177 0.78488374]
 [0.17997293 0.30958598 0.80442035]]
Garbage: [[0.67071932 0.16095177 0.78488374]
 [0.17997293 0.30958598 0.80442035]]
Range: [0 2 4 6 8]


## `numpy` array: data type

Be default, a `numpy` array uses floating-point values. However, we can set the `numpy` data type during construction:

In [16]:
a = np.ones(2, dtype=np.dtype('int')) # 1D array of ones with int type
print(a)

[1 1]


Since `numpy` is used to efficiently perform scientific computations, it comes with a wide range of `numpy` data types, that goes beyond the Python types `int`, `float`, `bool`, and `str`. For instance, it supports complex numbers or *smaller* integers (which require less memory but are less accurate).

When using `numpy`, for our goals, we can stick with `np.dtype('int')` and `np.dtype('float')` types.

## `numpy` array is similar to a list

Similarly to lists, we can:

In [27]:
a = np.array([1, 2, 3, 4, 5, 6])
print("array:", a)
print("element:", a[0])     # access an element
print("slice:", a[1:3])     # access a slice
print("A slice is still an numpy array:", type(a[1:3]))

a[0] = 10                   # modify an element
print("Updated array:", a)

a = np.array([[1, 2, 3], [4, 5, 6]])            # 2x3 array
print("element:", a[1, 0], "same as", a[1][0])  # first element of second row
a[0] = 10                                       # modify the first row
print("Updated 2D array:", a)

array: [1 2 3 4 5 6]
element: 1
slice: [2 3]
A slice is still an numpy array: <class 'numpy.ndarray'>
Updated array: [10  2  3  4  5  6]
Acessing 2D array: 4 same as 4
Updated 2D array: [[10 10 10]
 [ 4  5  6]]


## `numpy` array: attributes

A few useful attributes:

In [31]:
a = np.random.random((2, 3))            # random 2x3 array
print("Data type:", a.dtype)            # data type of the array
print("Shape:", a.shape)                # shape of the array
print("Size:", a.size)                  # number of elements in the array
print("Number of dimensions:", a.ndim)  # number of dimensions

Data type: float64
Shape: (2, 3)
Size: 6
Number of dimensions: 2


## `numpy` array: operations

Given `a1 = np.array([[1, 2], [4, 5]])` and `a2 = np.array([[6, 7], [8, 9]])`:

| Operator | Semantics | Example | Example Result |
| -------- | ------- | :-------: | :-------: |  
| `+` | element-wise sum |`a1 + a2` | `[[ 7  9] [12 14]]`| 
| `-` | element-wise difference |`a1 - a2` | `[[-5 -5] [-4 -4]]`| 
| `*` | element-wise product |`a1 - a2` | `[[ 6 14] [32 45]]`| 
| `/` | element-wise division |`a1 / a2` | `[[0.16 0.28] [0.5 0.55]]`|
| `%` | element-wise remainder |`a2 % a1` | `[[ 1  4] [16 25]]`| 
| `**` | element-wise exp |`a1**2` | `[[ 6 14] [32 45]]`| 
| `np.dot()` | matrix product |`np.dot(a1, a2)` | `[[22 25] [64 73]]`| 
| `a1.min()` | minimum value |`a1.min()` | `1`|
| `a1.max()` | maximum value |`a1.max()` | `5`|
| `a1.sum()` | sum of values |`a1.sum()` | `12`| 

## `numpy` array: try them!

In [44]:
a1 = np.array([[1, 2], [4, 5]])
a2 = np.array([[6, 7], [8, 9]])
print("Sum:", a1 + a2)          # element-wise sum
print("Difference:", a1 - a2)   # element-wise difference
print("Product:", a1 * a2)      # element-wise product
print("Division:", a1 / a2)     # element-wise division
print("Module:", a2 % a1)       # element-wise module
print("Power:", a1 ** 2)        # element-wise power
print("Matrix product:", np.dot(a1, a2)) # matrix product
print("Matrix product:", a1.dot(a2))     # matrix product
print("Min:", a1.min())         # minimum element
print("Max:", a1.max())         # maximum element
print("Sum:", a1.sum())         # sum of all elements

Sum: [[ 7  9]
 [12 14]]
Difference: [[-5 -5]
 [-4 -4]]
Product: [[ 6 14]
 [32 45]]
Division: [[0.16666667 0.28571429]
 [0.5        0.55555556]]
Module: [[0 1]
 [0 4]]
Power: [[ 1  4]
 [16 25]]
Matrix product: [[22 25]
 [64 73]]
Matrix product: [[22 25]
 [64 73]]
Min: 1
Max: 5
Sum: 12


## `numpy` array: filtering

We can easily filter values:

In [55]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print("Array:", a)
print("Filtered array (less than 6):", a[a < 6])        # elements less than 6
print("Filtered array (even):", a[a % 2 == 0])          # even elements
print("Filtered array (and):", a[(a > 2) & (a < 11)])   # conjunction of two conditions
print("Filtered array (or):", a[(a < 2) | (a > 10)])    # disjunction of two conditions
print("Check conditions:", (a > 5) | (a == 5))          # check conditions    

Array: [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
Filtered array (less than 6): [1 2 3 4 5]
Filtered array (even): [ 2  4  6  8 10 12]
Filtered array (and): [ 3  4  5  6  7  8  9 10]
Filtered array (or): [ 1 11 12]
Check conditions: [[False False False False]
 [ True  True  True  True]
 [ True  True  True  True]]


## `numpy` array: `nonzero()`

A convenient function is `nonzero()` that:
- takes an array
- returns one array for each dimension containing the index of elements that are different than zero

In [64]:
a = np.array([[-1, -2, -3, -4], [5, 6, 7, 8], [9, 10, 11, 12]])
nonzero_a_idx = np.nonzero(a) # default condition: a != 0
print("Indexes of non-zero elements:", np.nonzero(a))
print("Non-zero elements:", a[nonzero_a_idx])


Indexes of non-zero elements: (array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]), array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]))
Non-zero elements: [-1 -2 -3 -4  5  6  7  8  9 10 11 12]


However, instead of checking for the nonzero condition, we can pass an arbitrary condition:

In [65]:

g10_a_idx = np.nonzero(a > 10) # condition: a > 10
print("Indexes of condition > 10:", g10_a_idx)
print("> 10 elements:", a[g10_a_idx])


Indexes of condition > 10: (array([2, 2]), array([2, 3]))
> 10 elements: [11 12]


## `numpy`: documentaion

There is ***way more*** to say about `numpy`. Check its documentation:

- [Getting started](https://numpy.org/doc/stable/user/index.html#user)
- [API reference](https://numpy.org/doc/stable/reference/index.html#reference)

We will cover other bits of it when needed.