[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ciri/iese-dsfb/blob/main/notebooks/010-Numpy.ipynb)

# NumPy: Numerical Python

## Why learn Numpy?

In today's data-driven world, everything can be quantified and represented as numbers—from user behavior on websites to patterns in medical data. Understanding these numerical representations is essential for machine learning, data analysis, and scientific computing. NumPy serves as the foundational package for numerical operations in Python, making it easier for you to delve into these domains.


## Numpy Arrays

This tutorial introduces you to NumPy's data containers, offering a quick look at how to manage data sets in Python. In the realm of mathematics, **vectors** and **matrices** are fundamental units represented as sequences and grids of numbers, respectively. They are integral to linear algebra, a crucial area in computational tasks underlying modern Machine Learning and Artificial Intelligence. In NumPy, these structures are simplified as one-dimensional (1D) and two-dimensional (2D) arrays, and the library even supports higher-dimensional arrays with ease.

To get started, import NumPy like this:

In [4]:
import numpy as np

A 1D arrays can be created from a list with the NumPy function array. If the items of the list have different
type, they are converted to a common type when creating the array. A simple example follows.

In [5]:
mylist = [2, 7, 14, 5, 9]

In [7]:
type(mylist)

list

In [10]:
mylist + [1]

[2, 7, 14, 5, 9, 1]

In [13]:
['hello',12344,844.5,True]

['hello', 12344, 844.5, True]

In [11]:
myarray = np.array(mylist)
myarray

array([ 2,  7, 14,  5,  9])

In [26]:
myarray[-1]

9

In [22]:
my2darray = np.array([
    [1,2,3,4,5],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10],
    [6,7,8,9,10]
])
my2darray.shape

(14, 5)

In [27]:
my2darray[0,1]

2

In [3]:
type(mylist)

list

In [24]:
arr1 = np.array([1,2,3,4])
arr1

array([1, 2, 3, 4])

This looks the same, but it's a very different beast! The constraint that everything has the same type is a useful one as it allows us to operate on the array more naturally. Try these for comparison:

In [25]:
arr1 + 1
mylist + [1]
arr1 * 2
mylist * 2

[2, 7, 14, 5, 9, 2, 7, 14, 5, 9]

In case of mixed types (never do this!) you will get unexpected results:

In [26]:
arr1 = np.array([1,'a',3,4])
arr1

array(['1', 'a', '3', '4'], dtype='<U21')

There are two types involves with arrays. The type of the array is `array`, its elements also have a type which can be checked using `myarray.dtype`:

In [27]:
type(arr1)

numpy.ndarray

In [28]:
arr1.dtype

dtype('<U21')

A 2D array can be directly created from a list of lists of equal length. The terms are entered row-by-row:

In [29]:
my_list_of_lists = [
    [0, 7, 2, 3], 
    [3, 9, -5, 1]
]
my_list_of_lists

[[0, 7, 2, 3], [3, 9, -5, 1]]

In [30]:
arr2 = np.array(my_list_of_lists)
arr2

array([[ 0,  7,  2,  3],
       [ 3,  9, -5,  1]])

Although we visualize a vector as a column (or as a row) and a matrix as a rectangular arrangement, with rows and columns, it is not so in the computer. The 1d array is just a sequence of elements of the same type, neither horizontal nor vertical. It has one **axis**, which is the 0-axis.

In a similar way, a 2d array is a sequence of 1d arrays of the same length and type. It has two axes. When we visualize it as rows and columns, `axis=0` means *across rows*, while `axis=1` means *across columns*.

The number of terms stored along an axis is the **dimension** of that axis. The dimensions are collected in the attribute `shape`:

In [31]:
arr1.shape

(4,)

In [32]:
arr2.shape

(2, 4)

In [33]:
arr3 = np.random.randn(2,3,4)
arr3.shape

(2, 3, 4)

This is a tuple, meaning that you can extract one of the elements, but you cannot reassign it.

In [34]:
print(arr2.shape[0])

2


In [35]:
arr2.shape[0] = 123

TypeError: 'tuple' object does not support item assignment

**You try it:**

1. Create a prediction function `price` that predicts the price of an appartment based on its surface:

    Price = 100,000 + 5,000 x (surf in m2)

2. Calculate the price of a house of 130 m2.
3. Calculate the price over a range of surfaces 50,60,..,130

In [None]:
def price(surface):
    return 100000 + 5000 * surface

price(130)

surfaces = np.arange(50,131,10)

price(surfaces)

array([350000, 400000, 450000, 500000, 550000, 600000, 650000, 700000,
       750000])

## NumPy functions

NumPy incorporates vectorized forms of the **mathematical functions** of the package `math`. A **vectorized function** is one that, when applied to an array, returns an array with same shape, whose terms are the values of the function on the corresponding terms of the original array. For instance, the NumPy square root function `np.max` takes the maximum of every term of a numeric array:

In [64]:
# Heights in cm [Female, Male]
heights = np.array([[160, 175],  
                    [155, 180],  
                    [165, 170],  
                    [162, 178],  
                    [158, 172]])
np.max(heights, axis=0)

array([165, 180])

In [65]:
heights.shape

(5, 2)

You can also tell `numpy` to do this calculation along the rows or columns using the `axis` parameter. Let's try calculating the mean for Females v.s. Males using the `np.mean()` function.

In [39]:
np.mean(heights ,axis=0)

array([160., 175.])

NumPy also provides common mathematical and statistical functions, such as `median, max, sum, sqrt, std, quantile` etc.


Functions that are defined in terms of vectorized functions are automatically vectorized. Let's try this with an exercise:

**You try it:**

Given the heights and weights of Females and Males, calculate the BMI of the whole population. Remember that the formula for BMI is:

$$
    \text{BMI} = \frac{\text{weight}}{\text{height (in meter)}^2}
$$

In [None]:
# Weights in kg [Female, Male]
heights =    np.array([[160, 175],
                       [155, 180],
                       [165, 170],
                       [162, 178],
                       [158, 172]])

weights = np.array([[55, 70],
                    [52, 77],
                    [58, 68],
                    [54, 75],
                    [53, 72]])

def bmi(h, w):
    BMI = w / (h/100)**2
    return BMI

all_bmis_by_gender = bmi(heights, weights)
all_bmis_by_gender

[[160 175]
 [155 180]
 [165 170]
 [162 178]
 [158 172]]


array([[21.484375  , 22.85714286],
       [21.64412071, 23.7654321 ],
       [21.30394858, 23.52941176],
       [20.57613169, 23.67125363],
       [21.23057202, 24.33747972]])

In [75]:
np.mean(all_bmis_by_gender, axis=0)

array([21.2478296 , 23.63214401])

## Subsetting arrays

**Slicing** a 1D array is done the same as for a list:

In [42]:
arr1 = np.array([5,4,2,41])
arr1[0]

5

In [43]:
arr1[:3]

array([5, 4, 2])

The same applies to two-dimensional arrays, but we need two indexes within the square brackets. The first index selects the rows (`axis=0`), and the second index the columns (`axis=1`):

In [44]:
heights[[1,2], :3]

array([[155, 180],
       [165, 170]])

**Step intervals** can be used if we need to select only every n-th element of the array:

In [45]:
arr1, arr1[::2]

(array([ 5,  4,  2, 41]), array([5, 2]))

One special case of this which is used quite often is to use steps of -1, which is equivalent to reversing the array.

In [46]:
arr1, arr1[::-1]

(array([ 5,  4,  2, 41]), array([41,  2,  4,  5]))

In [47]:
arr1

array([ 5,  4,  2, 41])

In [48]:
arr1[::2]

array([5, 2])

Here's an overview of common slicing operations from McKinney (2017):

<center>
    <img src='https://raw.githubusercontent.com/ciri/iese-dsfb/main/images/nparray-indexing.png' width='60%'>
</center>

**Filtering** Subsets of an array can also be extracted by means of expressions which acts as filters. Any expression involving an array is evaluated in Python as a Boolean array (called a **Boolean mask**):

In [92]:
f_tall           = heights >= 160 
f_not_super_tall = heights < 180
f_weight         = weights > 60

heights[(f_tall & f_not_super_tall) | f_weight]

array([160, 175, 180, 165, 170, 162, 178, 172])

In [104]:
(f_tall & f_not_super_tall) | f_weight

array([[ True,  True],
       [False,  True],
       [ True,  True],
       [ True,  True],
       [False,  True]])

In [112]:
import pandas as pd
pd.read_excel('../data/Greenchips - Data.xlsx').values

array([[  1.   , 107.804,   1.6  ],
       [  2.   , 112.659,   1.67 ],
       [  3.   , 149.85 ,   1.51 ],
       [  4.   , 160.922,   1.51 ],
       [  5.   ,  91.963,   1.63 ],
       [  6.   , 167.646,   1.44 ],
       [  7.   , 121.756,   1.55 ],
       [  8.   , 164.52 ,   1.41 ],
       [  9.   , 160.345,   1.38 ],
       [ 10.   , 142.767,   1.56 ],
       [ 11.   , 133.273,   1.41 ],
       [ 12.   , 198.859,   1.41 ],
       [ 13.   , 406.204,   0.99 ],
       [ 14.   , 455.209,   0.87 ],
       [ 15.   , 295.649,   1.07 ],
       [ 16.   , 182.941,   1.26 ],
       [ 17.   , 233.646,   1.33 ],
       [ 18.   , 142.402,   1.45 ],
       [ 19.   , 152.546,   1.56 ],
       [ 20.   ,  95.459,   1.64 ],
       [ 21.   , 128.908,   1.59 ],
       [ 22.   , 123.375,   1.62 ],
       [ 23.   ,  91.418,   1.78 ],
       [ 24.   , 112.252,   1.66 ],
       [ 25.   ,  98.486,   1.75 ],
       [ 26.   , 118.56 ,   1.67 ],
       [ 27.   , 130.683,   1.47 ],
       [ 28.   , 115.64 ,   

In [94]:
heights[ ((heights >= 160) & (heights < 180)) | (weights > 60) ]

array([160, 175, 180, 165, 170, 162, 178, 172])

In [87]:
np.mean(
    all_bmis_by_gender[f_tall & f_not_super_tall]
)

22.537106176303787

In [None]:
female_heights = heights[:,0]
female_bmis    = all_bmis_by_gender[:,0]

female_bmis[]

In [81]:
heights > 160

array([[False,  True],
       [False,  True],
       [ True,  True],
       [ True,  True],
       [False,  True]])

In [49]:
filter_tall = heights > 160
filter_tall

array([[False,  True],
       [False,  True],
       [ True,  True],
       [ True,  True],
       [False,  True]])

We can use this as an index, and it will remove anyone in the dataset who is too tall:

In [50]:
heights[filter_tall]

array([175, 180, 165, 170, 162, 178, 172])

These can be combined into complex expressions as suited for the application:

In [51]:
filter_tall   = heights > 160
filter_short  = heights < 180
heights[filter_tall & filter_short ]

array([175, 165, 170, 162, 178, 172])

Since both heights and weights are of the same size, we can filter either on the other:

In [52]:
weights[filter_tall]

array([70, 77, 58, 68, 54, 75, 72])

### Extra exercises for home

**You try it**

What is the average BMI of people taller than 170m?

**You try it**

Try to change the BMI function so that instead of given you your BMI, it returns a classification (string). The BMI classifications are:

* Below 18.5	Underweight
* 18.5 – 24.9	Healthy Weight
* 25.0 – 29.9	Overweight
* 30.0 and Above	Obesity