<a href="https://colab.research.google.com/github/battistabiggio/ai4dev/blob/main/AI4Dev_01_intro_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to libraries and basic operations

## Basic Operations with arrays

In Python, we can use the `numpy` library to represent data in a structured way. It's also better than using generic python lists, as:
* arrays are more easy to index
* APIs are better and targeted to numerical applications (including ML!)
* computational efficiency - interfaces are in python, but the operations are run with efficient C++ backend

And these are just a few aspects. There are way more that we will not cover here.

We first introduce the array creation operations to either wrap existing data structures into numpy arrays, or to generate arrays with known properties (e.g., a matrix of zeros).

In [1]:
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.shape)  # prints the dimensions of the array
print(a.dtype)  # prints the data type of the elements

a.dtype = np.float64  # casting operation
print(a.dtype)

n_rows, n_cols = 2, 4
# creates a matrix of zero-valued elements with the given shape
a = np.zeros(shape=(n_rows, n_cols))

print(a)

(2, 3)
int64
float64
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## More methods to create arrays

Here we see other APIs for creation of arrays.
These are useful to sample distributions with known properties or also to create useful structures to transform the arrays (e.g., masking operations, indexing).

In [2]:
a = np.ones(shape=(n_rows, n_cols))  # creates matrix of ones
print(a)

a = np.eye(n_rows, n_cols)  # creates identity matrix
print(a)

# random numbers from Normal distribution
# with zero mean and unit variance
a = np.random.randn(n_rows, n_cols)
print(a)

# random numbers from Uniform distribution in [0,1]
a = np.random.rand(n_rows, n_cols)
print(a)

a = np.random.randint(0, 5, [n_rows, n_cols])  # random integers
print(a)


[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]]
[[ 0.90996221  0.27430254  0.03586409 -0.05597693]
 [ 0.26379571  0.20626228 -2.57862825  0.7875208 ]]
[[0.581358   0.62739482 0.47861371 0.46201351]
 [0.50141633 0.15785272 0.21654061 0.94829879]]
[[2 2 3 1]
 [2 0 2 3]]


## Array Indexing

Sometimes we are interested in extracting certain elements from the arrays. We use indexing operations for this.

With python lists, we can index individual elements or also ranges of elements. We can also do it with numpy arrays, but we can also bring it to the next level.

Specifically, to index elements in multi-dimensional arrays, we can now use a more compact and intuitive notation.

To extract one element, it is sufficient to list the indices conecutively. For example, in a 2D array, we can use the compact notation:

```python
a[0, 1]
```

More in general, we can select subsets of the elements by using the notation

```
<start>:<stop>:<step>
```

Where any, if omitted, defaults to:
* start $\rightarrow$ the beginning of the array (first index)
* stop $\rightarrow$ the end of the arrya (the last index)
* step $\rightarrow$ one, i.e., take all elements without skipping any

Then, we can extract submatrices from the arrays by specifying the indices for the slices. For example, if we start from a 2D array, we can extract submatrices by using:

```python
a[0:2, 0:2]  # extracts submatrix of rows 0 to 2 and columns 0 to 2
```

where each index is used for a dimension of the array (thus this instruction will return the element in the row 0, column 1).

We can also select entire rows or columns (or more in general dimensions) with the colon operator (`:`), that omits all three parameters.

For example, in the previous array, we can select the row 0 by using the colon operator in the first dimension:

```python
a[0, :]
```

and column 0 by using the colon operation in the first dimension:

```python
a[:, 0]
```


In [3]:
a = np.eye(3)
print(a)

element = a[0, 0]  # picks the first element (returns float)
print(element)

submatrix = a[0:2, 0:2]  #selects submatrix with slicing operators
print(submatrix)

row = a[0, :]  # picks the first row (returns flat array)
print(row)

column = a[:, 1]  # picks the second column (returns flat array)
print(column)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
1.0
[[1. 0.]
 [0. 1.]]
[1. 0. 0.]
[0. 1. 0.]


We can also index arrays with other arrays of boolean type.
Boolean arrays can be the result of a boolean comparison in numpy.

In [4]:
a = np.eye(3)
b = np.array([1, 1, 0])
indexed_a = a[b, :] # in this case it will pick the row 1 twice, then row 0 once
print(indexed_a)

# boolean_operator
indexed_a = a[b==1, :]  # in this case it will pick the rows where b is == 1
print(indexed_a)


[[0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]]
[[1. 0. 0.]
 [0. 1. 0.]]


The complete reference for knowing about indexing arrays can be found in the docs:
- https://numpy.org/doc/stable/reference/arrays.indexing.html

## Other operations on arrays

There are other operations that can be used to transform arrays.

For example, we can transpose arrays, stack them vertically or horizontally, and perform standard operations (e.g., sums, multiplications, etc).

In [5]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([[1, 2, 3], [4, 5, 6]])

transpose = c.T  # swaps rows and columns
print(transpose)

vertical_stack = np.vstack((a, b))  # stack rows (vertical stacking)
print(vertical_stack)

horizontal_stack = np.hstack((a, b)) # stack columns (horizontal stacking)
print(horizontal_stack)

# element-wise operations
array_sum = a + b
print(array_sum)

array_product = a * b
print(array_product)

scalar_product = a.dot(b)  # scalar product
print(scalar_product)

# inner dimensions must match for matrix operations
# 1x3 and 3x2 --> result is 1x2
scalar_product_with_matrix = a.dot(c.T)
print(scalar_product_with_matrix)

[[1 4]
 [2 5]
 [3 6]]
[[1 2 3]
 [4 5 6]]
[1 2 3 4 5 6]
[5 7 9]
[ 4 10 18]
32
[14 32]


## Exercise

Define a function `extract_subset(x, y, y0)` that takes as input:

- a 2D matrix `x`, and an array `y`
-  a target `y0` (e.g., `y0=0`)

and returns the matrix containing only rows where y is equal to the value of y0

In [6]:
x = np.array([
        [ 0.33990211,  0.94182274,  0.66611658,  0.72773846],
        [ 0.20281557,  0.24280422,  0.3627702,   0.80495032],
        [ 0.5016927,   0.29465024,  0.61690932,  0.25302243],
        [ 0.01744464,  0.82521145,  0.82226041,  0.89858553],
        [ 0.33772606,  0.17433791,  0.7705529,   0.11211808]
    ])

y = np.array([0, 0, 1, 1, 1])  # one value for each row of x

def extract_subset(x, y, y0):
    return x[y == y0, :]

result = extract_subset(x, y, y0=0)
print(result)


[[0.33990211 0.94182274 0.66611658 0.72773846]
 [0.20281557 0.24280422 0.3627702  0.80495032]]


## Handling data with Pandas

Pandas provides two types of classes for handling data:

* **Series**: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.
* **DataFrame**: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

Creating a Series by passing a list of values makes pandas create a default RangeIndex to index the series. This is an array of indexes from 0 to the length of the series.


In [7]:
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


Otherwise, we can create a DataFrame by passing a NumPy array. Optionally, we can specify explicitly the index to use. For example, we can create a datetime index using `date_range()`. Additionally, we can create named columns by passing the labels during creation:

In [8]:
dates = pd.date_range("20130101", periods=20)  # explicit index with 20 elements
data = np.random.randn(20, 4)  # 20 rows with 4 values each
df = pd.DataFrame(data, index=dates, columns=["a", "b", "c", "d"])

print(df.head())  # prints only the first values

print(df.tail(3))  # prints only the last (three) values

                   a         b         c         d
2013-01-01 -0.036050 -0.389737 -1.773208  1.034039
2013-01-02  0.170702 -0.422041 -0.724043 -1.200224
2013-01-03 -1.028236 -0.283168  0.855429  0.836579
2013-01-04 -1.418137  0.329680  0.147871 -0.744977
2013-01-05  1.536930 -1.708008  1.895684 -0.127222
                   a         b         c         d
2013-01-18 -0.779606  0.553652 -0.030407 -0.413704
2013-01-19 -1.253986 -0.113520  1.923814  0.598770
2013-01-20 -0.078383 -1.512828  0.599198  0.107753


Then, we can retrieve the index and the columns:

In [9]:
print(df.index)
print(df.columns)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
               '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
               '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20'],
              dtype='datetime64[ns]', freq='D')
Index(['a', 'b', 'c', 'd'], dtype='object')


To select elements from the dataframe, we can use the getitem operator (`[]`) like we do with numpy, but we can now also index columns.

In [10]:
column_a = df["a"]  # extract the Series "a" from the df
print(column_a)

some_rows = df[1:3]  # extracts a sub-frame by the row index
print(some_rows)

sliced_by_index = df["20130102":"20130104"]  # extracts a sub-frame by the index values
print(sliced_by_index)

2013-01-01   -0.036050
2013-01-02    0.170702
2013-01-03   -1.028236
2013-01-04   -1.418137
2013-01-05    1.536930
2013-01-06    0.301234
2013-01-07   -0.056693
2013-01-08   -1.284757
2013-01-09    0.228056
2013-01-10   -1.807216
2013-01-11   -0.617031
2013-01-12    0.038732
2013-01-13    0.453570
2013-01-14    0.278195
2013-01-15   -0.763572
2013-01-16   -0.885871
2013-01-17    1.932232
2013-01-18   -0.779606
2013-01-19   -1.253986
2013-01-20   -0.078383
Freq: D, Name: a, dtype: float64
                   a         b         c         d
2013-01-02  0.170702 -0.422041 -0.724043 -1.200224
2013-01-03 -1.028236 -0.283168  0.855429  0.836579
                   a         b         c         d
2013-01-02  0.170702 -0.422041 -0.724043 -1.200224
2013-01-03 -1.028236 -0.283168  0.855429  0.836579
2013-01-04 -1.418137  0.329680  0.147871 -0.744977


We can also represent the selection more explicitly by using the specific operators to access the elements by matching them from the label (`df.loc`) or by position (`df.iloc`).

In [11]:
by_label = df.loc[dates[0], "a"]
print(by_label)

by_position = df.iloc[3:4, 0:2]
print(by_position)

-0.03605024709893531
                   a        b
2013-01-04 -1.418137  0.32968


Finally, we can slice through boolean arrays/series (like we did with numpy).

In [12]:
boolean_indexed = df[df["a"] > 0]
print(boolean_indexed)

                   a         b         c         d
2013-01-02  0.170702 -0.422041 -0.724043 -1.200224
2013-01-05  1.536930 -1.708008  1.895684 -0.127222
2013-01-06  0.301234  0.416774  1.197323 -0.416041
2013-01-09  0.228056  0.657547 -2.062970 -0.471200
2013-01-12  0.038732  0.274945 -0.998196 -0.127441
2013-01-13  0.453570 -0.212440 -0.035703  0.720582
2013-01-14  0.278195 -0.397982  0.626447 -0.740126
2013-01-17  1.932232 -0.340469 -0.965420  0.253758


For the rest, the operations performed with numpy are in general applicable to the dataframes. There are more specific operations that we will not cover in this course, but you can find good examples in the documentation of the library.

Reference for Pandas tutorial:
* https://pandas.pydata.org/docs/user_guide/10min.html