# Python workshop - 2025

**Last update**: 2025-05-19  
**Author**: El-Amine Mimouni  
**Affiliation**: Québec Centre for Biodiversity Science

**Overview**: In this notebook, we will see how to work with NumPy arrays.

---

# NumPy

NumPy (Numerical Python) is the core library for numerical and scientific computing in Python. It is the main computational workhorse behind much of the compuations that are done in Python. It provides powerful support for multi-dimensional arrays and a wide range of mathematical operations.

If you want to learn more about it, visit [https://numpy.org/](https://numpy.org/).

In [159]:
# Import numpy
import numpy as np

# Creating Arrays

A NumPy array is similar to a mathematical vector or a matrix. It provides a powerful way to work with numerical data in Python. NumPy arrays are highly efficient for performing mathematical operations and offer more flexibility compared to regular Python lists. Arrays in NumPy can be constructed in many ways.

In [28]:
# Create a 1-dimensional array (i.e. a vector)
arr_1d = np.array(object=[1, 2, 3, 4, 5])
print(arr_1d)
print(type(arr_1d))
#
print("-" * 50)
# Important attributes
print(arr_1d.shape)
print(arr_1d.dtype)

[1 2 3 4 5]
<class 'numpy.ndarray'>
--------------------------------------------------
(5,)
int64


In [29]:
# Create a 2-dimensional array (i.e. a matrix)
# Note the fact that you are giving each row as a list in a list [[row1], [row2]].
arr_2d = np.array(object=[[1.2, 2.5, 3.1], [4.8, 5.1, 6.5]])
print(arr_2d)
print(type(arr_2d))

#
print("-" * 50)
# Important attributes
print(arr_2d.shape)
print(arr_2d.dtype)

[[1.2 2.5 3.1]
 [4.8 5.1 6.5]]
<class 'numpy.ndarray'>
--------------------------------------------------
(2, 3)
float64


In [59]:
# Contrary to lists, NumPy arrays behave correctly with regards to the basic OPERATORS
print("The result of multiplying every value of arr_1d by 2:")
print(arr_1d * 2)
#
print("-" * 20)
#
print("\nThe result of adding 5.8 to every value of arr_2d:")
print(arr_2d + 5.8)

The result of multiplying every value of arr_1d by 2:
[ 2  4  6  8 10]
--------------------

The result of adding 5.8 to every value of arr_2d:
[[ 7.   8.3  8.9]
 [10.6 10.9 12.3]]


In [31]:
# If you have doubts, create the list-equivalent of arr_1d
list_1d = [1, 2, 3, 4, 5]
#
print(list_1d)
print(arr_1d)
#
print("-" * 50)

# Look at how they react with the + operator
print(list_1d.__add__)
print(arr_1d.__add__)

[1, 2, 3, 4, 5]
[1 2 3 4 5]
--------------------------------------------------
<method-wrapper '__add__' of list object at 0x000002BA1BF24640>
<method-wrapper '__add__' of numpy.ndarray object at 0x000002BA1BEE7450>


# Slicing and accessing elements

It is done like with conventional Python lists.

In [80]:
# The most general form
# Select everything
print(arr_2d[:, :])

[[1.2 2.5 3.1]
 [4.8 5.1 6.5]]


In [32]:
# The most general form
# Select everything
print("The first row of arr_2d:")
print(arr_2d[0, :])

#
print("-" * 50)
print("The third column of arr_2d:")
print(arr_2d[:, 2])


The first row of arr_2d:
[1.2 2.5 3.1]
--------------------------------------------------
The third column of arr_2d:
[3.1 6.5]


In [63]:
# The : and even , can be omitted in the case of rows
# But I recommend leaving them for clarity
# Also since it clearly shows the dimension of your array
print("The first row of arr_2d:")
print(arr_2d[0, :])
print("\n")
print(arr_2d[0,])
print("\n")
print(arr_2d[0])

The first row of arr_2d:
[1.2 2.5 3.1]


[1.2 2.5 3.1]


[1.2 2.5 3.1]


In [64]:
# If you want to select a range of rows or columns, use the colon :
print("Rows 1 to 2 of arr_2d:")
print(arr_2d[0:2, :])
#
print("\n")
#
print("Columns 2 to 3 of arr_2d:")
print(arr_2d[:, 1:4])

Rows 1 to 2 of arr_2d:
[[1.2 2.5 3.1]
 [4.8 5.1 6.5]]


Columns 2 to 3 of arr_2d:
[[2.5 3.1]
 [5.1 6.5]]


In [34]:
# If you particular values, you can input them as lists:
print("Rows 1 to 2, and columns 1 and 3 of arr_2d:")
print(arr_2d[0:2, [0, 2]])


Rows 1 to 2, and columns 1 and 3 of arr_2d:
[[1.2 3.1]
 [4.8 6.5]]


# Important methods

In [9]:
# Each matrix has the usual mathematical methods
# These are .mean(), .min(), .max()

# Note: axis=None can be left as an empty field.
print("\nGrand mean of arr_2d:")
print(arr_2d.mean(axis=None))
#
print("\nColumn means of arr_2d:")
print(arr_2d.mean(axis=0))
#
print("\nRow means of arr_2d:")
print(arr_2d.mean(axis=1))

Grand mean of arr_2d:
3.866666666666667
Column means of arr_2d:
[3.  3.8 4.8]
Row means of arr_2d:
[2.26666667 5.46666667]


In [70]:
# Can also be used in usual PROCEDURAL
print("Column means of arr_2d:")
print(arr_2d.mean(axis=0))

print("\nThe mean value of the MATRIX:")
print(np.mean(a=arr_2d, axis=0))

Column means of arr_2d:
[3.  3.8 4.8]

The mean value of the MATRIX:
[3.  3.8 4.8]


In [73]:
# Special notice needs to be mentionned regarding the variance/stdev
# You can calculate it by hand as shown below:

# Get sample size
n = arr_2d.shape[0]

# Center arr_2d
arr_2d_c = arr_2d - arr_2d.mean(axis=0)

# Sum squared value
arr_2d_var = sum(arr_2d_c ** 2) / (n - 1)

# See the results
print("Variance of variables in arr_2d:")
print(arr_2d_var)

#((arr_2d - arr_2d.mean(axis=0)) ** 2).sum(axis=0)

Variance of variables in arr_2d:
[6.48 3.38 5.78]


In [75]:
# However, the .var() method gives you
print("Variance of variables in arr_2d.var():")
print(arr_2d.var(axis=0))

#np.var(arr_2d, axis=0)

Variance of variables in arr_2d.var():
[3.24 1.69 2.89]


In [None]:
# The reason for this difference is that NumPy considers the MLE estimate of the variance.
# Therefore, sor a sample of N observations, the estimate will be divided by N rather than (N - 1).

In [57]:
# This can be seen in the np.cov() function, which gives 
# Special notice needs to be mentionned regarding the variance
print("Result of np.cov() with default parameters:")
print(np.cov(arr_2d))

# The default considers rowvar=true, so that the variance of the rows
# is computed
print("\nResult of np.cov() with rowvar=False:")
print(np.cov(arr_2d, rowvar=False))

# Thankfully for most analyses, the value of bias=False is the default
print("\nResult of np.cov() with rowvar=False and bias=False:")
print(np.cov(arr_2d, rowvar=False, bias=False))

Result of np.cov() with default parameters:
[[0.94333333 0.74333333]
 [0.74333333 0.82333333]]

Result of np.cov() with rowvar=False:
[[6.48 4.68 6.12]
 [4.68 3.38 4.42]
 [6.12 4.42 5.78]]

Result of np.cov() with rowvar=False and bias=False:
[[6.48 4.68 6.12]
 [4.68 3.38 4.42]
 [6.12 4.42 5.78]]


In [7]:
# MAYBE REMOVE
print(arr_2d)
#
print("*" * 20)
#
print(arr_2d.flatten())

[[1.2 2.5 3.1]
 [4.8 5.1 6.5]]
********************
[1.2 2.5 3.1 4.8 5.1 6.5]


# Special matrices

In [80]:
## Special arrays can be built for linear algebra

# Zeros
zeros = np.zeros((2, 2))
print("A 2x2 matrix of 0's:")
print(zeros)
print("-" * 50)

# Ones
ones = np.ones((4,1))
print("A 4x1 matrix of 1's:")
print(ones)
print("-" * 50)

# Identity matrix
print("A 3x3 identity matrix:")
print(np.eye(N=3))
print("-" * 50)

# Random numbers
# DIFFERENT MODULE
print("A 2x4 matrix of random normal variates:")
randoms = np.random.randn(2, 4)
print(randoms)

A 2x2 matrix of 0's:
[[0. 0.]
 [0. 0.]]
--------------------------------------------------
A 4x1 matrix of 1's:
[[1.]
 [1.]
 [1.]
 [1.]]
--------------------------------------------------
A 3x3 identity matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
--------------------------------------------------
A 2x4 matrix of random normal variates:
[[ 0.62892756 -0.45052636  1.21577943  1.36060433]
 [ 0.48921381  0.04673573 -1.95165626 -0.08850641]]


In [58]:
# Create two matrices
x1 = np.random.randn(2, 4)
x2 = np.random.randn(4, 3)

# See what they look like
print(x1)
print("\n")
print(x2)

[[-0.03598025  0.97322097  0.12797194  1.16878278]
 [-0.5541337  -0.60531605  0.04688605  0.14748479]]


[[ 0.68710469 -0.58939942 -1.15198543]
 [ 0.75335405  1.17656137 -0.00395283]
 [ 0.70600594  0.22964846  0.10519212]
 [ 0.64225739 -0.26829781 -0.80356156]]


In [25]:
# Compute their scalar product
x1.dot(x2)

array([[ 0.61796906, -1.22930429, -0.68144788],
       [-0.49698702,  0.35794264,  0.6902604 ]])

In [26]:
# Compute their scalar product using the @ operator
x1 @ x2

array([[ 0.61796906, -1.22930429, -0.68144788],
       [-0.49698702,  0.35794264,  0.6902604 ]])

In [88]:
# By supplying matrices into np.hstack() (horizontal stack), you can concatenate
# matrices together.
# Can be useful for linear regression for example.
np.hstack([ones, x2])

# There is also np.vstack() (vertical stack) for concatenating observations by
# columns.

# If you know R, these are similar to rbind() and cbind().

array([[ 1.        ,  0.68710469, -0.58939942, -1.15198543],
       [ 1.        ,  0.75335405,  1.17656137, -0.00395283],
       [ 1.        ,  0.70600594,  0.22964846,  0.10519212],
       [ 1.        ,  0.64225739, -0.26829781, -0.80356156]])

# Creating and reading your own

In [161]:
# You can build your own arrays by OOOOO
ex_array = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

# See what it looks like
print(ex_array)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [162]:
# The function works on other formats besides .txt BTW
tricho = np.genfromtxt(fname="../data/trichoptera.csv", skip_header=1, delimiter=",")

# See what it looks like
print(tricho)
print(tricho.shape)
print(tricho.dtype)

[[0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(220, 56)
float64


# 3. Indexing, Slicing, and Modifying

In [163]:
# Use the axis parameter to define which axis you want to add along

# Total sum
print("Total sum of the trichoptera dataset:")
print(tricho.sum(axis=None))

# Row sums
print("\nColumn sums of the trichoptera dataset:")
print(tricho.sum(axis=0))

# Column sums
print("\nRow sums of the trichoptera dataset:")
print(tricho.sum(axis=1))

Total sum of the trichoptera dataset:
1651.0

Column sums of the trichoptera dataset:
[ 95.  59. 127. 148.  64. 121.  73.  60.  54.  81.  98.  65.  52.  62.
  37.  71.  34.  57.  39.  28.  16.  26.  37.  26.   8.  11.   4.   7.
   9.   9.  13.   8.   5.   3.   5.   6.   1.   3.   3.   3.   1.   5.
   2.   2.   2.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.]

Row sums of the trichoptera dataset:
[ 1.  1.  2.  2.  1.  1.  2.  1.  3.  2.  2.  1.  1.  3.  2.  2.  2.  1.
  0.  0.  1.  0.  7.  6. 12.  9. 11.  8.  7.  6.  9.  4.  7.  7.  8. 17.
  8. 11.  9.  5.  1.  8. 11.  2. 14.  9. 16. 19. 10. 11. 13. 12. 14. 11.
 14. 15. 12. 12. 13. 11. 12.  2.  0.  4. 12.  3. 14. 11. 17. 15. 14.  9.
 15. 12. 10. 13. 13. 14. 12. 14. 16. 15. 10.  4.  2.  7. 10.  7. 19.  7.
 19. 16. 13. 10. 12. 14. 17. 10. 11. 16. 15. 14. 13. 16. 10.  3.  0.  4.
 15.  9. 14.  6. 15. 12. 10. 11.  8. 17. 12.  8.  6. 12. 12. 10. 14. 13.
  6.  5.  0.  5. 14.  8.  9.  3. 12.  9. 12.  9.  8.  6.  7.  5.  5.  7.
  9.  8

# Array manipulation

In [None]:
# I actually didn't tell you the whole story. The 220 points refer to the SAME 22
# points sampled repeatedly over 10 periods.
# (it's more complicated but let's stick to that)

In [164]:
# First flatten the array
# This gives you a 10*22*56=12320 long 1D array.
#
# Then reshape it into a 10x22x56 3D array
tricho_3d = tricho.flatten().reshape(10, 22, 56)
#tricho_3d = tricho.reshape(10, 22, 56)
#
print(tricho_3d)

[[[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[1. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [1. 1. 1. ... 0. 0. 0.]
  ...
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[1. 1. 1. ... 0. 0. 0.]
  [1. 1. 1. ... 0. 0. 0.]
  [1. 1. 1. ... 0. 0. 0.]
  ...
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 ...

 [[1. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [1. 0. 1. ... 0. 0. 0.]
  ...
  [0. 0. 1. ... 0. 0. 0.]
  [1. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  ...
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  ...
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]


In [165]:
# See information about it
print(tricho_3d.shape)

(10, 22, 56)


In [None]:
# To make it more manageable
# Transpose it into
tricho_3d = tricho_3d.transpose((1, 2, 0))

In [171]:
tricho_3d.shape

(22, 56, 10)

In [None]:
# Now you can extract particular parts

# First 10 sites for 
tricho_3d[0:9, :, 0]

array([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0

In [None]:
# Use the axis parameter to define which axis you want to add along
print("\nSums along the first axis (time) of the 3d array:")
print(tricho_3d.sum(axis=0).shape)
#
print("\nSums along the second axis (sites) of the 3d array:")
print(tricho_3d.sum(axis=1).shape)
#
print("\nSums along the third axis (species) of the 3d array:")
print(tricho_3d.sum(axis=2).shape)



Sums along the first axis (time) of the 3d array:
(22, 56)

Sums along the second axis (species) of the 3d array:
(10, 56)

Sums along the third axis (sites) of the 3d array:
(10, 22)


In [54]:
tricho_3d.sum(axis=2)

array([[ 1.,  1.,  2.,  2.,  1.,  1.,  2.,  1.,  3.,  2.,  2.,  1.,  1.,
         3.,  2.,  2.,  2.,  1.,  0.,  0.,  1.,  0.],
       [ 7.,  6., 12.,  9., 11.,  8.,  7.,  6.,  9.,  4.,  7.,  7.,  8.,
        17.,  8., 11.,  9.,  5.,  1.,  8., 11.,  2.],
       [14.,  9., 16., 19., 10., 11., 13., 12., 14., 11., 14., 15., 12.,
        12., 13., 11., 12.,  2.,  0.,  4., 12.,  3.],
       [14., 11., 17., 15., 14.,  9., 15., 12., 10., 13., 13., 14., 12.,
        14., 16., 15., 10.,  4.,  2.,  7., 10.,  7.],
       [19.,  7., 19., 16., 13., 10., 12., 14., 17., 10., 11., 16., 15.,
        14., 13., 16., 10.,  3.,  0.,  4., 15.,  9.],
       [14.,  6., 15., 12., 10., 11.,  8., 17., 12.,  8.,  6., 12., 12.,
        10., 14., 13.,  6.,  5.,  0.,  5., 14.,  8.],
       [ 9.,  3., 12.,  9., 12.,  9.,  8.,  6.,  7.,  5.,  5.,  7.,  9.,
         8.,  8., 11.,  5.,  2.,  0.,  5., 14.,  4.],
       [10.,  7., 14.,  9., 13.,  8., 11.,  9.,  8.,  3.,  4.,  4.,  7.,
         6.,  8.,  8.,  5.,  3.,  0., 

In [42]:
# Giving it nothing will give you the sum of the entire dataset
print("Total sum of the 3-dimensional array:")
print(tricho_3d.sum(axis=None))

# Use the axis parameter to define which axis you want to add along
print("\nSums along the first axis (time) of the 3d array:")
print(tricho_3d.sum(axis=0))
#
print("\nSums along the second axis (species) of the 3d array:")
print(tricho_3d.sum(axis=1))
#
print("\nSums along the third axis (sites) of the 3d array:")
print(tricho_3d.sum(axis=2))


Total sum of the 3-dimensional array:
1651.0

Sums along the first axis (time) of the 3d array:
[[ 6.  3. 10. ...  0.  0.  0.]
 [ 3.  2.  9. ...  0.  0.  0.]
 [ 7.  4. 10. ...  0.  0.  0.]
 ...
 [ 0.  1.  8. ...  0.  0.  0.]
 [ 4.  3.  8. ...  0.  0.  0.]
 [ 2.  2.  0. ...  0.  0.  0.]]

Sums along the second axis (species) of the 3d array:
[[ 0.  0.  5.  0.  0.  4.  0.  0.  0.  7.  0.  0.  0.  0.  0.  1.  0.  0.
   0.  0. 12.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  1.
   0.  0.]
 [ 9.  1. 13. 18.  1. 15.  3. 11. 22. 18. 10.  6.  1.  0.  0. 12.  0. 10.
   0.  1.  1.  1.  0.  6.  0.  3.  0.  2.  1.  0.  2.  3.  0.  0.  0.  0.
   0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   1.  0.]
 [15. 15. 14. 16.  0. 13. 16. 12. 10.  7. 18. 10.  0.  4. 11. 13.  6. 14.
   4.  9.  2.  0.  7.  3.  0.  1.  0.  1.  4.  0.  6.  0.  0.  0.  2.  0.
   0.  1.  1.  1.  0.  0.  0.  0.  1.  0.  

In [96]:
# You can also select a particular slice
tricho_3d[0:2, :, 2].sum()

np.float64(18.0)

# Mini-matrix primer

In [32]:
import numpy as np

mat1 = np.array([[2, 4],
                 [1, 6],
                 [5, 3]])

print(mat1)
print("-" * 50)
print(mat1.shape)

[[2 4]
 [1 6]
 [5 3]]
--------------------------------------------------
(3, 2)


In [33]:
# The transpose of a matrix is defined as the same matrix
# but with rows and columns inverted
# This is an attribute .T
print(mat1.T)
print("-" * 50)
print(mat1.T.shape)

[[2 1 5]
 [4 6 3]]
--------------------------------------------------
(2, 3)


vectoros

In [49]:
# A vector is a matrix but with a single dimension
# It can be a 1xp row vector or a px1 column vector
# When entering values into NumPy, mind the [[]] notation

# NumPy assumes that when you give it a vector, it is a
# 1xp row vector

vec1 = np.array(object=[[3, 2]])

print(vec1)
print("-" * 50)
print(vec1.shape)
#
print("+" * 50)
#
print(vec1.T)
print("-" * 50)
print(vec1.T.shape)

[[3 2]]
--------------------------------------------------
(1, 2)
++++++++++++++++++++++++++++++++++++++++++++++++++
[[3]
 [2]]
--------------------------------------------------
(2, 1)


In [51]:
print(mat1)
print("-" * 50)
print(vec1.T)
print("-" * 50)
print(mat1.dot(b=vec1.T))
print("-" * 50)
print(mat1 @ vec1.T)

[[2 4]
 [1 6]
 [5 3]]
--------------------------------------------------
[[3]
 [2]]
--------------------------------------------------
[[14]
 [15]
 [21]]
--------------------------------------------------
[[14]
 [15]
 [21]]


dispersos

In [52]:
# Variance-covariance matrix
#(mat1 - mat1.mean(axis=0)).sum(axis=0)

n = mat1.shape[0]
S = (mat1 - mat1.mean(axis=0)).T @ (mat1 - mat1.mean(axis=0))

print(n)
print("-" * 50)
print(S)
print("-" * 50)
print(1.0 / (n - 1.0) * S)

3
--------------------------------------------------
[[ 8.66666667 -5.66666667]
 [-5.66666667  4.66666667]]
--------------------------------------------------
[[ 4.33333333 -2.83333333]
 [-2.83333333  2.33333333]]


In [34]:
# This is what is obtained by the np.cov() function.
# But now you know exactly how it works and can do it
# by hand.
np.cov(mat1, rowvar=False)

array([[ 4.33333333, -2.83333333],
       [-2.83333333,  2.33333333]])

In [16]:
import numpy as np

mat1 = np.array([[1, 2, 3],
                 [4, 5, 6]])

print(mat1)
print("-" * 50)
print(mat1.shape)

[[1 2 3]
 [4 5 6]]
--------------------------------------------------
(2, 3)


In [17]:
# The transpose of a matrix is defined as the same matrix
# but with rows and columns inverted
# This is an attribute .T
print(mat1.T)
print("-" * 50)
print(mat1.T.shape)

[[1 4]
 [2 5]
 [3 6]]
--------------------------------------------------
(3, 2)


In [14]:
# A vector is a matrix but with a single dimension
# It can be a 1xp row vector or a px1 column vector
# When entering values into NumPy, mind the [[]] notation

# NumPy assumes that when you give it a vector, it is a
# 1xp row vector

vec1 = np.array(object=[[1, 2, 5]])

print(vec1)
print("-" * 50)
print(vec1.shape)
#
print("+" * 50)
#
print(vec1.T)
print("-" * 50)
print(vec1.T.shape)

[[1 2 5]]
--------------------------------------------------
(1, 3)
++++++++++++++++++++++++++++++++++++++++++++++++++
[[1]
 [2]
 [5]]
--------------------------------------------------
(3, 1)


In [24]:

print(mat1)
print("-" * 50)
print(vec1.T)
print("-" * 50)
print(mat1.dot(vec1.T))
print("-" * 50)
print(mat1 @ vec1.T)

[[1 2 3]
 [4 5 6]]
--------------------------------------------------
[[1]
 [2]
 [5]]
--------------------------------------------------
[[20]
 [44]]
--------------------------------------------------
[[20]
 [44]]


In [31]:
# Variance-covariance matrix
#(mat1 - mat1.mean(axis=0)).sum(axis=0)

(mat1 - mat1.mean(axis=0)).T @ (mat1 - mat1.mean(axis=0))

array([[4.5, 4.5, 4.5],
       [4.5, 4.5, 4.5],
       [4.5, 4.5, 4.5]])

In [30]:
np.cov(mat1, rowvar=False)

array([[4.5, 4.5, 4.5],
       [4.5, 4.5, 4.5],
       [4.5, 4.5, 4.5]])

# LINALG

In [90]:
# Create a square matrix that could be a covariance matrix
# between two variables
S = np.array([[1.0, 0.8],
              [0.8, 1.0]])

In [93]:
# Compute its determinant
print("The determinant of S:")
print(np.linalg.det(a=S))
print(type(np.linalg.det(a=S)))

The determinant of S:
0.3599999999999999
<class 'numpy.float64'>


In [94]:
# Get the inverse of the S matrix
Sm1 = np.linalg.inv(a=S)

print("The inverse of S:")
print(Sm1)
print(type(Sm1))

The inverse of S:
[[ 2.77777778 -2.22222222]
 [-2.22222222  2.77777778]]
<class 'numpy.ndarray'>


In [95]:
print("\nThe result of Sm1 x S:")
print(Sm1 @ S)
print(type(Sm1 @ S))

print("\nThe result of S x Sm1:")
print(S @ Sm1)
print(type(S @ Sm1))


The result of Sm1 x S:
[[1.00000000e+00 0.00000000e+00]
 [2.12175956e-16 1.00000000e+00]]
<class 'numpy.ndarray'>

The result of S x Sm1:
[[1.00000000e+00 2.12175956e-16]
 [0.00000000e+00 1.00000000e+00]]
<class 'numpy.ndarray'>


In [99]:
# Invert matrix S2 which is singular
# Uncomment at your own risk
# (There ain't no risk, it's mathematically impossible)
#np.linalg.inv(S2)

# The return of list unpacking

In [5]:
#np.linalg.cholesky
#np.linalg.eig
#np.linalg.qr
#np.linalg.svd
#np.linalg.inv

Sm1 = np.linalg.inv(S)

print("The inverse of S:")
print(Sm1)

print("\nThe result of Sm1 S:")
print(Sm1 @ S)

print("\nThe result of S Sm1:")
print(S @ Sm1)

The inverse of S:
[[ 2.77777778 -2.22222222]
 [-2.22222222  2.77777778]]

The result of Sm1 S:
[[1.00000000e+00 0.00000000e+00]
 [2.12175956e-16 1.00000000e+00]]

The result of S Sm1:
[[1.00000000e+00 2.12175956e-16]
 [0.00000000e+00 1.00000000e+00]]


In [100]:
# Perform eigenanalysis of S
print(np.linalg.eig(a=S))

EigResult(eigenvalues=array([1.8, 0.2]), eigenvectors=array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))


In [None]:
def pca(X):
    n = X.shape[0]
    X = X - X.mean(axis=1)
    S = 1/(n - 1.0) * X.T @ X
    _, U = np.linalg.eig(S)
    F = X @ U
    return F

In [97]:
def pca(X):
    n = X.shape[0]
    X = X - X.mean(axis=1)
    S = 1/(n - 1.0) * X.T @ X
    eig_out = np.linalg.eig(S)
    U = eig_out[1]
    F = X @ U
    return F

# Masked matrices

In [117]:
# Generate a 5x5 matrix with values either -1 or 1
ex_array = np.random.choice(a=[0.0, 1.0, 2.0, 3.0, -999.0], size=(5, 5))

# See the values in my array
print("Array with some values as -999")
print(ex_array)

# Create a copy of the array and
# Replace values that are equal to -999 with np.nan
ex_nan = ex_array.copy()
ex_nan[ex_nan == -999] = np.nan

# See the values in the mask
print("\nArray with -999 coded as np.nan:")
print(ex_nan)

Array with some values as -999
[[   2. -999.    2.    1.    3.]
 [-999.    2.    2.    3.    3.]
 [   1.    1.    2.    1.    1.]
 [   3. -999. -999.    2.    1.]
 [   2.    3.    2.    1.    3.]]

Array with -999 coded as np.nan:
[[ 2. nan  2.  1.  3.]
 [nan  2.  2.  3.  3.]
 [ 1.  1.  2.  1.  1.]
 [ 3. nan nan  2.  1.]
 [ 2.  3.  2.  1.  3.]]


In [122]:
# Determine a boolean mask defined by whether or not
# values are equal to -999
mymask = ex_array == -999

# See the values in the mask
print("Boolean mask:")
print(mymask)

# Create a masked array from this mask
print("\nMasked array:")
ex_mask = np.ma.masked_array(ex_array, mask=mymask)
print(ex_mask)

Boolean mask:
[[False  True False False False]
 [ True False False False False]
 [False False False False False]
 [False  True  True False False]
 [False False False False False]]

Masked array:
[[2.0 -- 2.0 1.0 3.0]
 [-- 2.0 2.0 3.0 3.0]
 [1.0 1.0 2.0 1.0 1.0]
 [3.0 -- -- 2.0 1.0]
 [2.0 3.0 2.0 1.0 3.0]]


In [126]:
# Print out the mean of these arrays
print("The mean of ex_array is:", ex_array.mean())
print("The mean of ex_nan is:", ex_nan.mean())
print("The nanmean of ex_nan is:", np.nanmean(ex_nan))
print("The mean of ex_mask is:", ex_mask.mean())

The mean of ex_array is: -158.2
The mean of ex_nan is: nan
The nanmean of ex_nan is: 1.9523809523809523
The mean of ex_mask is: 1.9523809523809523


# Views and copies

In [136]:
# Create two vectors
vec_1 = np.array([1, 2, 3, 4, 5])
vec_2 = vec_1[2:]
vec_3 = vec_1[2:].copy()

# Look at them!
print(vec_1)
print(vec_2)
print(vec_3)

[1 2 3 4 5]
[3 4 5]
[3 4 5]


In [151]:
print("ID of vec_1:", id(vec_1))
print("ID of vec_2:", id(vec_2))
print("ID of vec_3:", id(vec_3))

ID of vec_1: 2998782129232
ID of vec_2: 2998782128368
ID of vec_3: 2998782074512


In [155]:
print("Does vec_1 share memory with vec_2?")
print(np.shares_memory(vec_1, vec_2))

print("\nDoes vec_1 share memory with vec_3?")
print(np.shares_memory(vec_1, vec_3))

print("\nDoes vec_2 share memory with vec_3?")
print(np.shares_memory(vec_2, vec_3))

Does vec_1 share memory with vec_2?
True

Does vec_1 share memory with vec_3?
False

Does vec_2 share memory with vec_3?
False


In [152]:
# Change a value in vec_2
vec_2[1] = 9999

# Look at them!
print("Values of vec_1:")
print(vec_1)

print("\nValues of vec_2:")
print(vec_2)

print("\nValues of vec_3:")
print(vec_3)

Values of vec_1:
[   1    2    3 9999    5]

Values of vec_2:
[   3 9999    5]

Values of vec_3:
[3 4 5]


In [None]:
# So ask yourself when subsetting:

# Will I do some analyses on this part and then go back to the original data?

# - If YES: Consider a .copy() of the data so you don't alter it unintentionally.
# - If NO: You can stick with a view, it is more memory-efficient (i.e. you weren't going to use it anyways).