There are three types of cells in Jupyter Lab.

- Markdown
  - Contain text formatted using the Markdown language. When a Markdown cell is run, it is rendered as rich text, useful for documentation and explanation within the notebook.
- Raw
  - Contain plain text that is ignored by the notebook kernel and is meant for specialized conversion purposes. 
- Code
  - Contain code in the notebook's current programming language.

Let's do some coding

Let's start with a simple exercise. Let us compute the estimated mean of a given array. Suppose you are given ```X = [20, 3, 1 , 3, 6, 4]```. First, in the shell, start python and type following.

In [6]:
X = [20, 3, 1 , 3, 6, 4]
sum = 0

for x_i in X:
    sum+= x_i

mean = sum/len(X)
mean

6.166666666666667

Okay, now let's try to abstract it by creating a function.


In [7]:
def mean(X):
    sum = 0
    for x_i in X:
        sum+= x_i
    average = sum/len(X)
    return average

X = [20, 3, 1 , 3, 6, 4]
mean(X)

6.166666666666667

Okay, now observe that for the semester, we are going to work with Vecctors and matrices. After while it will become harder to work with loops to manipulate them. 
So we are going to use tools to make our life easier. First we will use numpy.


Okay, now observe that for the semester, we are going to work with Vecctors and matrices. After while it will become harder to work with loops to manipulate them. 
So we are going to use tools to make our life easier. First we will use numpy.


Let's install numpy

```pip install numpy```

Type this in your command line, and install numpy.

In [9]:
!pip install numpy

Note: you may need to restart the kernel to use updated packages.


Usually you need ! before the command

However, in some Jupyter Lab IDEs, %automagic setting is ON by default. This allows you to run "line magic" commands (like %pip or %ls) without typing the leading percent sign or exclamation mark.

In [14]:
import numpy as np
X = np.array([20, 3, 1 , 3, 6, 4])
type(X)
x_hat = X.sum()/X.size
x_hat
print(x_hat)

6.166666666666667


Or we can just use the mean function directly.



In [15]:
X.mean()

np.float64(6.166666666666667)

Nice thing about is that np has some functions for decriptive stats.



In [16]:
X.max()

np.int64(20)

In [17]:
X.min()

np.int64(1)

In [18]:
X.std() # computes the population standard deviation

np.float64(6.361778227997438)

OKAY! We learnt a lot about matrices in the last few lectures. Let's try to see whether we can use numpy to handle matrices. Let's create a matrix of 6 rows and 4 columns of some random data to play with.

https://numpy.org/doc/stable/reference/random/index.html


In [32]:
rng = np.random.default_rng()
print(rng)
D = rng.random((6,4))
D

Generator(PCG64)


array([[0.5349226 , 0.69309516, 0.87171868, 0.78779846],
       [0.14570116, 0.74805901, 0.25814859, 0.1518063 ],
       [0.74635542, 0.43197683, 0.71462576, 0.83674622],
       [0.59261204, 0.34057936, 0.74391014, 0.39248078],
       [0.78545328, 0.09369396, 0.66516347, 0.20415571],
       [0.62277495, 0.33512718, 0.03862922, 0.8275298 ]])

We can pass a seed to reproduce the random values.

In [43]:
rng = np.random.default_rng(seed=42)
D = rng.random((6,4))
D

array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872],
       [0.55458479, 0.06381726, 0.82763117, 0.6316644 ],
       [0.75808774, 0.35452597, 0.97069802, 0.89312112]])

We can draw random values from different distributions as well. 

Example drawing samples from a binomial distribution where number of trials is **n** and chance of success is **p**

In [44]:
rng = np.random.default_rng()
n, p = 10, .5
D_Bi = rng.binomial(n, p, (6,5))
print(D_Bi)

[[5 5 7 5 3]
 [5 5 5 5 6]
 [6 1 2 6 6]
 [4 4 4 6 4]
 [5 4 6 1 4]
 [7 3 6 3 4]]


Here we use the normal distribution.

In [45]:
rng = np.random.default_rng()
mu, sigma = 10, .68
D_normal = rng.normal(mu, sigma, (6,5))
print(D_normal)

[[ 9.90781586 10.62971072  9.60384361  9.40352844  9.192217  ]
 [ 9.70818582 10.64025881  9.46715852 11.96206     9.49264573]
 [10.17678743 11.1711988  10.67861067 10.50468871  9.10374335]
 [ 9.76235691 10.12311106  9.08056973 10.81053199  9.29543543]
 [10.78089186  9.44047067 10.39926923  9.86358791 10.05861908]
 [ 9.0622119   8.67245264 10.71092621 10.3596665  10.33367097]]


Let's move to numpy arrays again.

Also, let us create a numpy matrix with hard coded values, sometimes for testing and debugging purposes, this will help you!

We will see how to create numpy ndarrays

In [47]:
# Nested array example
N = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(type(N))
N

<class 'numpy.ndarray'>


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Let us look at the shape of the data.

In [48]:
N.shape

(3, 3)

What if we try passing different size arrays to ```np.array```?

In [49]:
N_1 = np.array([[1, 2, 3, 9],
                 [4, 5, 6],
                 [7, 8, 9]])
N_1

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

### How the Validation Works
#### When you call np.array(), NumPy performs the following internal checks:
- Recursive Depth Check: It scans the nested structure to determine the "depth" of your lists. For [[1,2,3], [4,5,6]], it identifies two levels of depth (a 2D structure).
- Shape Consistency: It iterates through each nested sequence to verify that all sub-lists at the same level have the same number of elements. If you pass [[1, 2], [3, 4, 5]], NumPy immediately detects that the first row has length 2 and the second has length 3.

Let us try to access a value:

```D[0,1]```

Note that ```D``` is 0 indexed.

First, Let's print ```D``` again


In [50]:
D

array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872],
       [0.55458479, 0.06381726, 0.82763117, 0.6316644 ],
       [0.75808774, 0.35452597, 0.97069802, 0.89312112]])

In [55]:
D[0,1]
print(D[0,1])

0.4388784397520523


In [56]:
D[4,2]

np.float64(0.8276311719925821)

In [57]:
D[6,1]

IndexError: index 6 is out of bounds for axis 0 with size 6

IndexError: index 6 is out of bounds for axis 0 with size 6.... Obviously!!!

Variable ```D``` stores the resulting 2D array.

Okay, now let's move onto slicing!! This is very powerful technique. We can do lot of fun stuff.
[slicing and dicing (_**See User Docs**_)](https://numpy.org/doc/stable/user/quickstart.html#indexing-slicing-and-iterating)
but today we will just do some basics.

First let us try to do this by looping.. Try this by yourself!!! 


In [59]:
print(D)
column_0_loop = []
for row in D:
    column_0_loop.append(row[0])

print(column_0_loop)

[[0.77395605 0.43887844 0.85859792 0.69736803]
 [0.09417735 0.97562235 0.7611397  0.78606431]
 [0.12811363 0.45038594 0.37079802 0.92676499]
 [0.64386512 0.82276161 0.4434142  0.22723872]
 [0.55458479 0.06381726 0.82763117 0.6316644 ]
 [0.75808774 0.35452597 0.97069802 0.89312112]]
[np.float64(0.7739560485559633), np.float64(0.09417734788764953), np.float64(0.12811363267554587), np.float64(0.6438651200806645), np.float64(0.5545847870158348), np.float64(0.7580877400853738)]


In [60]:
column_0 = [row[0] for row in D]
column_0

[np.float64(0.7739560485559633),
 np.float64(0.09417734788764953),
 np.float64(0.12811363267554587),
 np.float64(0.6438651200806645),
 np.float64(0.5545847870158348),
 np.float64(0.7580877400853738)]

Both of these ways seems to be either too cumbersome or hard to read.

Let's use numpy to access the first column.

In [61]:
D

array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872],
       [0.55458479, 0.06381726, 0.82763117, 0.6316644 ],
       [0.75808774, 0.35452597, 0.97069802, 0.89312112]])

In [62]:
X0 = D[:, 0]
X0

array([0.77395605, 0.09417735, 0.12811363, 0.64386512, 0.55458479,
       0.75808774])

In [63]:
X1 = D[:, 1]
X1

array([0.43887844, 0.97562235, 0.45038594, 0.82276161, 0.06381726,
       0.35452597])

In [64]:
X2 = D[:, 2]
X2

array([0.85859792, 0.7611397 , 0.37079802, 0.4434142 , 0.82763117,
       0.97069802])

Key points in this 
- ```D``` This is the NumPy array, which is a 2-dimensional structure (like a matrix) where elements can be accessed using row and column indices.
- ```:``` is the slicing operator, it means select all rows in this example
- ```0``` Refers to the ith column.

What if we want to select multiple columns?

In [66]:
print(D)
X02 = D[:, [2,0]]
X02

[[0.77395605 0.43887844 0.85859792 0.69736803]
 [0.09417735 0.97562235 0.7611397  0.78606431]
 [0.12811363 0.45038594 0.37079802 0.92676499]
 [0.64386512 0.82276161 0.4434142  0.22723872]
 [0.55458479 0.06381726 0.82763117 0.6316644 ]
 [0.75808774 0.35452597 0.97069802 0.89312112]]


array([[0.85859792, 0.77395605],
       [0.7611397 , 0.09417735],
       [0.37079802, 0.12811363],
       [0.4434142 , 0.64386512],
       [0.82763117, 0.55458479],
       [0.97069802, 0.75808774]])

In [67]:
D

array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872],
       [0.55458479, 0.06381726, 0.82763117, 0.6316644 ],
       [0.75808774, 0.35452597, 0.97069802, 0.89312112]])

Note that ```X02 = D[:, [2,0]]``` is interpreted by numpy as follows:


- ```:``` means slice all rows (basic slicing).
- ```[2, 0]``` is a list of column indices (advanced indexing).
- NumPy interprets this as: "For all rows, select columns 2 and 1."

How about taking a subset of rows with all the columns?

In [69]:
print(D)
D[[0,1,2], :]

[[0.77395605 0.43887844 0.85859792 0.69736803]
 [0.09417735 0.97562235 0.7611397  0.78606431]
 [0.12811363 0.45038594 0.37079802 0.92676499]
 [0.64386512 0.82276161 0.4434142  0.22723872]
 [0.55458479 0.06381726 0.82763117 0.6316644 ]
 [0.75808774 0.35452597 0.97069802 0.89312112]]


array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499]])

What about subset of columns with all rows?
You **cannot** use something like:

```D[[0,1,2], [1,2]]```

In [70]:
D[[0,1,2], [1,2]] # Numpy uses something called advanced indexing, A.K.A. Fancy indexing.

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,) 

```D[[0,1,2], [1,2]]```

This is interpreted as pairing the elements of the two lists element by element.

```(0, 1) → ``` Element at row 0, column 1

```(1, 2) → ``` Element at row 2, column 2

```(2, ?) → ``` Error (shape mismatch if sizes differ)

But you can do something like following:


In [73]:
print(D)
D[[0,1],[3,2]]

[[0.77395605 0.43887844 0.85859792 0.69736803]
 [0.09417735 0.97562235 0.7611397  0.78606431]
 [0.12811363 0.45038594 0.37079802 0.92676499]
 [0.64386512 0.82276161 0.4434142  0.22723872]
 [0.55458479 0.06381726 0.82763117 0.6316644 ]
 [0.75808774 0.35452597 0.97069802 0.89312112]]


array([0.69736803, 0.7611397 ])

Basically outputting ```(0,3), (1,2)``` indices.

This is bit weird.

In [50]:
D

array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872],
       [0.55458479, 0.06381726, 0.82763117, 0.6316644 ],
       [0.75808774, 0.35452597, 0.97069802, 0.89312112]])

In [51]:
D[:, [1,2]]

array([[0.43887844, 0.85859792],
       [0.97562235, 0.7611397 ],
       [0.45038594, 0.37079802],
       [0.82276161, 0.4434142 ],
       [0.06381726, 0.82763117],
       [0.35452597, 0.97069802]])

But when it comes to subset of rows and columns, it becomes bit tricky, you cannot use the same trick that we use earlier.

Let's look at how we do this...

First define the list of rows and columns that needs to be in the result and then use ```np.ix_```

In [52]:
rows = [0, 2, 3]  # Rows to select
cols = [1, 3]  # Columns to select

print(D[np.ix_(rows, cols)])
print("--------------------")
print(D)

[[0.43887844 0.69736803]
 [0.45038594 0.92676499]
 [0.82276161 0.22723872]]
--------------------
[[0.77395605 0.43887844 0.85859792 0.69736803]
 [0.09417735 0.97562235 0.7611397  0.78606431]
 [0.12811363 0.45038594 0.37079802 0.92676499]
 [0.64386512 0.82276161 0.4434142  0.22723872]
 [0.55458479 0.06381726 0.82763117 0.6316644 ]
 [0.75808774 0.35452597 0.97069802 0.89312112]]


Now, let us look at the dot product.

In [53]:
np.dot(X0, X2)

np.float64(2.264068319011254)

We can even get values for the different norms that we talked about.  

The [np.linalg package](https://numpy.org/doc/stable/reference/routines.linalg.html)
has tons of functionality, but we will just use the norm for now.


In [54]:
d1 = np.linalg.norm(X2-X0, 1)
d2 = np.linalg.norm(X2-X0, 2)
print(d1, d2)

1.680396207554768 0.8190462595339381


And, we can also scale by a number!  So to double our vector X0, we can multiply
by 2!

In [55]:
4*X1

array([1.75551376, 3.90248941, 1.80154375, 3.29104645, 0.25526902,
       1.41810387])

So, let's normalize our vector.  That is, make our vector length 1 (in the
2-norm).

In [56]:
two_norm_of_X0 = np.linalg.norm(X0, 2)
normalized_X0 = X0 / two_norm_of_X0
print(normalized_X0)

[0.55839269 0.06794694 0.09243124 0.46453488 0.40012103 0.54694405]


The above code is an example of what numpy (and now other libraries in the
python world call) ([broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html)).
The idea behind broadcasting is that the system "stretches" objects to higher
dimensional objects so that operations can occur in a meaningful way.

Easy way to create matrix with rand ints...

In [57]:
# Generator.integers(low, high, size, dtype)
rng = np.random.default_rng(seed = 42)
I = rng.integers(-2, 20, (5,7), dtype=np.int32)
print(I)

[[-1 15 12  7  7 16 -1]
 [13  2  0  9 19 14 14]
 [13 15  9  0 16  7  9]
 [ 6  2 18 15 12  6 16]
 [ 9  7  7  2  0 10 17]]


e could subtract a vector from a matrix

```D - np.array([1,2,3])```

Add a vector to a matrix

```D + np.array([1,2,3])```

Let's look at the mean function again. Try the following and see what they mean.


In [58]:
a = np.array([1,2,3])
I - a
I + a

ValueError: operands could not be broadcast together with shapes (5,7) (3,) 

In [59]:
D.shape

(6, 4)

However, if try to do something like:

In [60]:
b = np.array([[1,2,3,4,5,6,7]])
I - b

array([[-2, 13,  9,  3,  2, 10, -8],
       [12,  0, -3,  5, 14,  8,  7],
       [12, 13,  6, -4, 11,  1,  2],
       [ 5,  0, 15, 11,  7,  0,  9],
       [ 8,  5,  4, -2, -5,  4, 10]])

Basically, it will broadcast to fit the number of rows, but not columns.

Let's look at the mean function again. Try the following and see what they mean.


In [61]:
I

array([[-1, 15, 12,  7,  7, 16, -1],
       [13,  2,  0,  9, 19, 14, 14],
       [13, 15,  9,  0, 16,  7,  9],
       [ 6,  2, 18, 15, 12,  6, 16],
       [ 9,  7,  7,  2,  0, 10, 17]], dtype=int32)

In [62]:
I.mean()

np.float64(9.2)

In [63]:
I.mean(axis=0) # computes the arithmetic mean along columns

array([ 8. ,  8.2,  9.2,  6.6, 10.8, 10.6, 11. ])

In [64]:
I.mean(axis=1) # computes the arithmetic mean along the rows

array([ 7.85714286, 10.14285714,  9.85714286, 10.71428571,  7.42857143])

In [65]:
I

array([[-1, 15, 12,  7,  7, 16, -1],
       [13,  2,  0,  9, 19, 14, 14],
       [13, 15,  9,  0, 16,  7,  9],
       [ 6,  2, 18, 15, 12,  6, 16],
       [ 9,  7,  7,  2,  0, 10, 17]], dtype=int32)

We can use broadcasting for all sorts of applications.  Use the `mean` function
and broadcasting to to mean center your data:


In [66]:
D-D.mean(axis=0)

array([[ 0.28182527, -0.07878682,  0.15321808,  0.00366443],
       [-0.39795343,  0.45795709,  0.05575986,  0.09236071],
       [-0.36401715, -0.06727932, -0.33458182,  0.23306139],
       [ 0.15173434,  0.30509635, -0.26196564, -0.46646487],
       [ 0.06245401, -0.45384801,  0.12225133, -0.0620392 ],
       [ 0.26595696, -0.16313929,  0.26531818,  0.19941753]])

You can take the transpose your matrix as follows:

In [67]:
rng = np.random.default_rng()
P = rng.integers(0, 20, (4,3), dtype=np.int32)
print(P)
print("-------------")
print(P.transpose())

[[11 13 18]
 [15  9 10]
 [11 19 15]
 [14  4  4]]
-------------
[[11 15 11 14]
 [13  9 19  4]
 [18 10 15  4]]


while I don't have a great reason why you would do this, it illustrates a
useful function, write some code to mean center along the rows.  Check out docs
for [mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) and
[transpose](https://numpy.org/doc/stable/reference/generated/numpy.transpose.html)
for some help.

In [68]:
(P.transpose()-P.mean(axis=1)).transpose()

array([[-3.        , -1.        ,  4.        ],
       [ 3.66666667, -2.33333333, -1.33333333],
       [-4.        ,  4.        ,  0.        ],
       [ 6.66666667, -3.33333333, -3.33333333]])

Okay, now let us move to some statistics. Let us compute the variance of the X0

In [69]:
var_x0 = np.sum((X0-X0.mean())**2) / (X0.size-1)
print(var_x0)

0.09359156344855504


okay, what if we use ```np.var(X0)```

In [70]:
np.var(X0)

np.float64(0.07799296954046253)

What's going on here? Did we make a mistake? Let's check the docs.

[np.var](https://numpy.org/doc/stable/reference/generated/numpy.var.html)

There is an optional arguement called ddof. By default **ddof** is **zero**. Basically, this mean it calculates the variance using the population variance formula. But the formula that we learnt in the class is an unbiased estimator of the variance.
So, let's use that.

In [71]:
np.var(X0, ddof =1)

np.float64(0.09359156344855504)

Let's do a exercise.
1.  Create a matrix M of 10 rows and 8 columns, fill with it with random values. (either with integers or floats, also pick any distribution you would like to generate random values)
2.  Calculate the multi dimensional mean (represent the multi dimensional mean as a vector)
3.  Calculate the min, max, variance and SD of each attribute.
4.  Create matrix R with range normalized data.
5.  Create matrix Z with z-scores.