<a href="https://colab.research.google.com/github/dymiyata/intro-to-ml-and-ai-2025-2026/blob/main/intro_to_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NumPy

NumPy (Numerical Python) is a Python library for handling numerical data in Python.

Let's import it.

In [124]:
import numpy as np

The fundamental data structure that NumPy uses to store and manipulate data is a NumPy array. We can create one from a Python list by using `np.array(some_list)`.

Unlike Python lists, NumPy arrays are *true* arrays in the computer science sense. Specifically:
- all elements of the array must have the same data type (e.g. `int64`, `float64`, `bool_`, `str_`, etc...)
  - NumPy will try to automatically detect the data type for elements of a given array
  - You can also manually specify it and NumPy will try its best to convert to the specified data type.
  - To specify the data type, add the `dtype` argument to the `np.array` function
  - To check the data type, you can run `array.dtype`
- The array is stored in a contiguous block of memory with only enough space to store the array itself.
  - This means if you want to add elements to the array (i.e. make the array bigger), you need to create a completely new array.

### Multidimensional Arrays

Sometimes, we want to organize our data in multiple dimensions (like in a table or a matrix). NumPy arrays can be multidimensional!  For now, we will just focus on 1D and 2D arrays, but you can make higher dimensional arrays too.

To create a 2D array, you can again use the `np.array()` function, but this time, input a list of lists (the "rows" of your array).

To check the dimensions of an array, we can use the command `array.shape`.  If it's a 2D array, it will return a pair where the first coordinate is the number of rows and the second coordinate is the number of columns.

### Array Creation Helper Functions

There are some nice helper functions for creating arrays.

We can use `np.zeros(shape)` to create an array of all zeros with the specified shape.

Similarly, we can use `np.ones()` to create an array of all ones.

For other values, we can use `np.full()` in a similar way.  However, we need to specificy the fill value as a second argument.

Let's try to make a $4 \times 5$ array where all the entries are 9.

With `np.full`, you can even use a list as the specified value to get an array where all of the rows are the same.

We can use `np.arange()` (read as "a-range" not "arrange") to get an array of evenly spaced values.  It behaves much like the Python `range()` function.
- Here we specify a starting value, a stopping value, and a step size.  
- `np.arange()` will create a 1D array starting at the starting value, in steps equal to the step size.  It will stop at the last step which is *strictly smaller* than the stopping value.

Sometimes, we want evenly spaced values like `arange` gives us but rather than knowing the step size, we instead know *how many* elements we want.  In this case, it's easier to use `np.linspace` instead.  Again, we specify a starting value and a stopping value.  However, now the third argument isn't the step size, it's the number of elements we want.

### Random elements

Sometimes, we don't want to use real world data because it may not be available or it may be inconvenient to gather.  Instead we might want to simulate some data by randomizing values.  The `np.random` module has many functions to help us do exactly this.

To get a random array of `float` values in the interval $[0, 1)$ from a uniform distribution, we can use `np.random.random()` with the shape of our desired array as an argument.

To get a random array of *normally distributed* values with a certain `mean` and `standard_deviation`, you can use `np.random.normal(mean, st_dev, shape)`.  This will also give an array of floats.

To get a random array of integers that are uniformally distributed, you can use `np.random.randint(a, b, shape)`.  This will give you integers in the interval $[a, b)$.

Sometimes, when randomizing values, you want to *reuse* the same values later.  Or maybe you want someone else to be able to reproduce the same results with the *same* random values that you used.

This can be done with NumPy.
- It turns out all of the random values given by the `np.random` module are not truly random, then are *pseudo-random*.  Which means they appear random but if you know the entire state of the computer, there is actually a deterministic algorithm which computes the values.
- Whenever you call an `np.random` function, NumPy will first use sources of randomness from the operating system to generate a **seed**. Each possible seed has its own *random number generator* (RNG) which will then generate your values.
- Two RNG's with the same seed will always generate the same values if you run the exact same commands in the exact same order.  
- We can manually create an RNG with a specified seed if we want our values to be reproducable.

In [None]:
rng = np.random.default_rng(2025) # 2025 is the chosen seed

Now lets run our random commands using `rng.<command>` instead of `np.random.<command>`.

In [None]:
print(rng.random((2,3)))
print(rng.random((2,3)))

Watch what happens if we reset the seed and run the same command again.

In [None]:
rng = np.random.default_rng(2025)
print(rng.random((2,3)))
print(rng.random((2,3)))

Note, if you keep running the same command without resetting the seed, you will get different values each time.  Afterall, we still want our random commands to appear random.  

However, if you reset the seed, then run all those commands in exactly the same order, then it will be the same as it was before.

### Accessing values/slices of an array

Now that we know how to create arrays, how can we access the data from them?

A 1D array behaves basically the same as a Python list.

In [None]:
arr = np.array([1,2,3,4])
print(arr[0])
print(arr[2])
print(arr[-1])

If we have a 2D array, then we can access values using two indices.  For example `array[i,j]` with give the element in the `i`th row and `j`th column.

In [None]:
arr2 = np.array([
    [1,2,3,4],
    [5,6,7,8],
    [9,10,11,12]
])

In [None]:
print(arr2[0,0])
print(arr2[1,3])

As you know, with Python lists you can use `:` to get slices of the list.  This works essentially the same with NumPy arrays. With 2D arrays, you can take slices of each coordinate individually.

In [None]:
arr3 = np.array([
    [2,3,4,5],
    [8,7,6,5],
    [1,3,9,2],
    [0,0,2,4]
])
print(arr3[2,:]) # index 2 row as a 1D array
print(arr3[2:3,:]) # index 2 row as a 2D array
print(arr3[:,1]) # index 1 column as a 1D array
print(arr3[:,1:2]) # index 1 column as a 2D array

In [None]:
print(arr3[:2, :])

In [None]:
print(arr3[1:4, :])

In [None]:
print(arr3[:2, :2])

In [None]:
print(arr3[:,1:3])

# Math with NumPy

One of the main reasons we use NumPy arrays is because it can do numerical operations **fast**.

Suppose we have two lists of integers and we want to add each entry of the first list to its corresponding entry in the second list and then get a resulting list.

In [None]:
list1 = list(range(10000))
list2 = list(range(0,20000,2))

Let's do it with Python lists.

In [None]:
import time
start = time.time()

# write code here

end = time.time()
print(end - start)




Now let's do it the faster way with NumPy arrays.

NumPy does things much faster because it *vectorizes* the operation.  Essentially, it's doing all of the operations in *parallel* rather than adding each pair one at a time. (This uses more memory, but more often than not, speed is the limiting factor, not memory).

### Broadcasting

Sometimes you may try to add or multiply arrays with different shapes.  In this case, NumPy will use something called *broadcasting* to essentially stretch one array to have the same shape as the other by repeating the existing values. Then it performs the operation as if the arrays were the same size.

In [None]:
b1 = np.array([5])
b2 = np.array([1,2,3,4])
b3 = np.array([
    [3,4,5,5],
    [9,8,7,6],
    [0,0,0,0]
])

In [None]:
b1 + b2

In [None]:
b1 * b2

In [None]:
b1 * b3

In [None]:
b3 + b2

### Vectors

For multiple linear regression, we had the notion of vectors and the dot product of vectors.  We can think of vectors as 1D NumPy arrays.

In [None]:
vector1 = np.array([3,3,4,4])
vector2 = np.array([2,3,0,1])

The dot product can be performed using the `np.dot()` function and giving the two vectors as arguments.

Thus, given a vector $\vec{w}$ for our weights, and a feature vector $\vec{x}$, we know our prediction function is:
$$\hat{y} = \vec{w} \cdot \vec{x} + b$$

Thus, we can code this as follows: