## 0. Importing NumPy
To get started using NumPy, the first step is to import it.

The most common way (and method you should use) is to import NumPy as the abbreviation np.

If you see the letters np used anywhere in machine learning or data science, it's probably referring to the NumPy library.

In [1]:
import numpy as np

#check the version
print(np.__version__)

1.26.4



1. DataTypes and attributes

    Note: Important to remember the main type in NumPy is ndarray, even seemingly different kinds of arrays are still ndarray's. This means an operation you do on one array, will work on another.



In [2]:
# 1-dimensonal array, also referred to as a vector
a1 = np.array([1, 2, 3])

# 2-dimensional array, also referred to as matrix
a2 = np.array([[1, 2.0, 3.3],
               [4, 5, 6.5]])

# 3-dimensional array, also referred to as a matrix
a3 = np.array([
    [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ],
    [
        [10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]
    ]
])


In [3]:
a1.shape

(3,)

In [4]:
a1.ndim, a1.size, a1.dtype, type(a1)

(1, 3, dtype('int64'), numpy.ndarray)

In [5]:
a2.shape, a2.ndim, a2.dtype, a2.size, type(a2)

((2, 3), 2, dtype('float64'), 6, numpy.ndarray)

In [6]:
a3.shape, a3.ndim, a3.dtype, a3.size, type(a3)

((2, 3, 3), 3, dtype('int64'), 18, numpy.ndarray)

In [7]:
a1

array([1, 2, 3])

In [8]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [9]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

# Anatomy of an array

![](anatomy_ndarrays.png)

Key terms:

    Array - A list of numbers, can be multi-dimensional.
    Scalar - A single number (e.g. 7).
    Vector - A list of numbers with 1-dimension (e.g. np.array([1, 2, 3])).
    Matrix - A (usually) multi-dimensional list of numbers (e.g. np.array([[1, 2, 3], [4, 5, 6]])).



pandas DataFrame out of NumPy arrays

This is to examplify how NumPy is the backbone of many other libraries.


In [10]:
import pandas as pd

# np.random.randint(10, size=(5, 3)):
# This generates a 2D array (matrix) of random integers.
# 10: Specifies the upper limit for the random integers (0 to 9).
# size=(5, 3): Specifies the shape of the array—5 rows and 3 columns.
df= pd.DataFrame(np.random.randint(10, size=(5,3)),columns=['a','b','c'])

df

Unnamed: 0,a,b,c
0,2,9,5
1,3,5,8
2,1,5,5
3,1,7,7
4,5,3,1


In [11]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [12]:
df2= pd.DataFrame(a2)
df2

Unnamed: 0,0,1,2
0,1.0,2.0,3.3
1,4.0,5.0,6.5



## 2. Creating arrays

    np.array()
    np.ones()
    np.zeros()
    np.random.rand(5, 3)
    np.random.randint(10, size=5)
    np.random.seed() - pseudo random numbers
    Searching the documentation example (finding np.unique() and using it)



In [13]:
# Create a simple array
simple_array = np.array([1, 2, 3])
simple_array

array([1, 2, 3])

In [14]:
simple_array = np.array((1, 2, 3))
simple_array, simple_array.dtype

(array([1, 2, 3]), dtype('int64'))

In [15]:
ones= np.ones((3,2), dtype="int64")

In [16]:
ones

array([[1, 1],
       [1, 1],
       [1, 1]])

In [17]:
# The default datatype is 'float64'
ones.dtype

dtype('int64')

In [18]:
# You can change the datatype with .astype()
ones.astype(float)

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [19]:
# Create an array of zeros
zeros=np.zeros((5,3,3))

In [20]:
zeros

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

In [21]:
zeros.dtype

dtype('float64')

In [22]:
# create an array within a range of values
range_array= np.arange(3,50,5)
range_array

array([ 3,  8, 13, 18, 23, 28, 33, 38, 43, 48])

In [23]:
# random array
random_array= np.random.randint(5,20,size=(3,3,3))
random_array

array([[[ 5, 11, 16],
        [13, 13, 11],
        [ 9, 19,  7]],

       [[ 9, 10, 17],
        [ 8, 11, 11],
        [12, 11,  7]],

       [[ 5, 11,  8],
        [13, 13,  5],
        [13, 13,  8]]])

In [24]:
# Random array of floats (between 0 & 1)
np.random.random((5, 3))

array([[2.56287398e-01, 7.01548585e-01, 7.43745336e-01],
       [7.04386622e-01, 2.86148960e-01, 6.78597052e-01],
       [7.76990188e-04, 8.87810628e-01, 8.76112235e-01],
       [5.86317860e-03, 5.53251821e-01, 8.86870258e-01],
       [5.91833110e-01, 8.07094343e-01, 6.85277251e-01]])

In [25]:
# Random 5x3 array of floats (between 0 & 1), similar to above
np.random.rand(5, 3)

array([[0.39063958, 0.64670294, 0.48687858],
       [0.87938583, 0.50902764, 0.83549314],
       [0.33859472, 0.30120082, 0.55720078],
       [0.13297358, 0.27732775, 0.34556415],
       [0.31162206, 0.77131865, 0.13986418]])

In [26]:
np.random.rand(5, 3)

array([[0.20148626, 0.7114547 , 0.85834889],
       [0.61837214, 0.37192736, 0.72504357],
       [0.57804083, 0.60571329, 0.90151207],
       [0.37069446, 0.14537552, 0.11709757],
       [0.00833704, 0.14369118, 0.93286981]])



NumPy uses pseudo-random numbers, which means, the numbers look random but aren't really, they're predetermined.

For consistency, you might want to keep the random numbers you generate similar throughout experiments.

To do this, you can use np.random.seed().

What this does is it tells NumPy, "Hey, I want you to create random numbers but keep them aligned with the seed."

Let's see it.


In [27]:
# Set random seed to 0
np.random.seed(0)

# Make 'random' numbers
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])



With np.random.seed() set, every time you run the cell above, the same random numbers will be generated.

What if np.random.seed() wasn't set?

Every time you run the cell below, a new set of numbers will appear.


In [28]:
# Make more random numbers
np.random.randint(10, size=(5, 3))

array([[6, 7, 7],
       [8, 1, 5],
       [9, 8, 9],
       [4, 3, 0],
       [3, 5, 0]])



With np.random.seed() set, every time you run the cell above, the same random numbers will be generated.

What if np.random.seed() wasn't set?

Every time you run the cell below, a new set of numbers will appear.


In [29]:
# Make more random numbers
np.random.randint(10, size=(5, 3))

array([[2, 3, 8],
       [1, 3, 3],
       [3, 7, 0],
       [1, 9, 9],
       [0, 4, 7]])



Let's see it in action again, we'll stay consistent and set the random seed to 0.


In [30]:
# Set random seed to same number as above
np.random.seed(0)

# The same random numbers come out
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])



Because np.random.seed() is set to 0, the random numbers are the same as the cell with np.random.seed() set to 0 as well.

Setting np.random.seed() is not 100% necessary but it's helpful to keep numbers the same throughout your experiments.

For example, say you wanted to split your data randomly into training and test sets.

Every time you randomly split, you might get different rows in each set.

If you shared your work with someone else, they'd get different rows in each set too.

Setting np.random.seed() ensures there's still randomness, it just makes the randomness repeatable. Hence the 'pseudo-random' numbers.


In [31]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
df

Unnamed: 0,0,1,2
0,5,0,3
1,3,7,9
2,3,5,2
3,4,7,6
4,8,8,1



## What unique values are in the array a3?

Now you've seen a few different ways to create arrays, as an exercise, try find out what NumPy function you could use to find the unique values are within the a3 array.

You might want to search some like, "how to find the unqiue values in a numpy array".


In [32]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

In [33]:
np.unique(a3,return_counts=True)

(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))

In [34]:
np.unique(a3)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])

## 3. Viewing arrays and matrices (indexing)
Remember, because arrays and matrices are both ndarray's, they can be viewed in similar ways.

Let's check out our 3 arrays again.

In [35]:
a1

array([1, 2, 3])

In [36]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [37]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

In [38]:


a1[0]



1

In [39]:
a2[0]

array([1. , 2. , 3.3])

In [40]:
a3[0]

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [41]:
# Get 2nd row (index 1) of a2
a2[1]

array([4. , 5. , 6.5])

In [42]:
# Get the first 2 values of the first 2 rows of both arrays
a3[:2,:2,:2]

array([[[ 1,  2],
        [ 4,  5]],

       [[10, 11],
        [13, 14]]])

In [43]:
a4= np.random.randint(10,size=(2,2,3,4,5))
a4

array([[[[[6, 7, 7, 8, 1],
          [5, 9, 8, 9, 4],
          [3, 0, 3, 5, 0],
          [2, 3, 8, 1, 3]],

         [[3, 3, 7, 0, 1],
          [9, 9, 0, 4, 7],
          [3, 2, 7, 2, 0],
          [0, 4, 5, 5, 6]],

         [[8, 4, 1, 4, 9],
          [8, 1, 1, 7, 9],
          [9, 3, 6, 7, 2],
          [0, 3, 5, 9, 4]]],


        [[[4, 6, 4, 4, 3],
          [4, 4, 8, 4, 3],
          [7, 5, 5, 0, 1],
          [5, 9, 3, 0, 5]],

         [[0, 1, 2, 4, 2],
          [0, 3, 2, 0, 7],
          [5, 9, 0, 2, 7],
          [2, 9, 2, 3, 3]],

         [[2, 3, 4, 1, 2],
          [9, 1, 4, 6, 8],
          [2, 3, 0, 0, 6],
          [0, 6, 3, 3, 8]]]],



       [[[[8, 8, 2, 3, 2],
          [0, 8, 8, 3, 8],
          [2, 8, 4, 3, 0],
          [4, 3, 6, 9, 8]],

         [[0, 8, 5, 9, 0],
          [9, 6, 5, 3, 1],
          [8, 0, 4, 9, 6],
          [5, 7, 8, 8, 9]],

         [[2, 8, 6, 6, 9],
          [1, 6, 8, 8, 3],
          [2, 3, 6, 3, 6],
          [5, 7, 0, 8, 4]]],


  

In [44]:
a4.shape

(2, 2, 3, 4, 5)

In [45]:
# Get only the first 3 numbers of each single vector
a4[:,:,:,:,:3]

array([[[[[6, 7, 7],
          [5, 9, 8],
          [3, 0, 3],
          [2, 3, 8]],

         [[3, 3, 7],
          [9, 9, 0],
          [3, 2, 7],
          [0, 4, 5]],

         [[8, 4, 1],
          [8, 1, 1],
          [9, 3, 6],
          [0, 3, 5]]],


        [[[4, 6, 4],
          [4, 4, 8],
          [7, 5, 5],
          [5, 9, 3]],

         [[0, 1, 2],
          [0, 3, 2],
          [5, 9, 0],
          [2, 9, 2]],

         [[2, 3, 4],
          [9, 1, 4],
          [2, 3, 0],
          [0, 6, 3]]]],



       [[[[8, 8, 2],
          [0, 8, 8],
          [2, 8, 4],
          [4, 3, 6]],

         [[0, 8, 5],
          [9, 6, 5],
          [8, 0, 4],
          [5, 7, 8]],

         [[2, 8, 6],
          [1, 6, 8],
          [2, 3, 6],
          [5, 7, 0]]],


        [[[6, 5, 8],
          [9, 7, 5],
          [5, 3, 3],
          [9, 9, 7]],

         [[3, 9, 7],
          [1, 2, 2],
          [5, 8, 4],
          [5, 5, 0]],

         [[1, 0, 3],
          [4, 4, 0],
    



a4's shape is (2, 2, 3, 4, 5), this means it gets displayed like so:

    Inner most array = size 5
    Next array = size 4
    Next array = size 3    
    Next array = size 3
    Outer most array = size 2




## 4. Manipulating and comparing arrays



### Arithmetic
        +, -, *, /, //, **, %
        np.exp()
        np.log()
        Dot product - np.dot()
        Broadcasting
    Aggregation
        np.sum() - faster than Python's .sum() for NumPy arrays
        np.mean()
        np.std()
        np.var()
        np.min()
        np.max()
        np.argmin() - find index of minimum value
        np.argmax() - find index of maximum value
        These work on all ndarray's
            a4.min(axis=0) -- you can use axis as well
    Reshaping
        np.reshape()
    Transposing
        a3.T
    Comparison operators
        >
        <
        <=
        >=
        x != 3
        x == 3
        np.sum(x > 3)

In [46]:
a1


array([1, 2, 3])

In [47]:
ones = np.ones(3)
ones

array([1., 1., 1.])

In [48]:
# Add two arrays
a1 + ones

array([2., 3., 4.])

In [49]:
# substract two arrays
a1 - ones

array([0., 1., 2.])

In [50]:
# Multiply two arrays
a1 * ones

array([1., 2., 3.])

In [51]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [52]:
# Multiply two arrays
a1 * a2

array([[ 1. ,  4. ,  9.9],
       [ 4. , 10. , 19.5]])

In [53]:
a1.shape, a2.shape

((3,), (2, 3))

In [54]:
# This will error as the arrays have a different number of dimensions (2, 3) vs. (2, 3, 3) 
a2 * a3


ValueError: operands could not be broadcast together with shapes (2,3) (2,3,3) 

In [55]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

## Broadcasting

### What is broadcasting?

    Broadcasting is a feature of NumPy which performs an operation across multiple dimensions of data without replicating the data. This saves time and space. For example, if you have a 3x3 array (A) and want to add a 1x3 array (B), NumPy will add the row of (B) to every row of (A).



### Rules of Broadcasting

    If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
    If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
    If in any dimension the sizes disagree and neither is equal to 1, an error is raised.



The broadcasting rule: In order to broadcast, the size of the trailing axes for both arrays in an operation must be either the same size or one of them must be one.

In [56]:
a1

array([1, 2, 3])

In [57]:
a1.shape

(3,)

In [58]:
a2.shape

(2, 3)

In [59]:
a1 + a2

array([[2. , 4. , 6.3],
       [5. , 7. , 9.5]])

In [60]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [61]:
a2 + 2

array([[3. , 4. , 5.3],
       [6. , 7. , 8.5]])

In [62]:
# Raises an error because there's a shape mismatch (2, 3) vs. (2, 3, 3)
a2 + a3

ValueError: operands could not be broadcast together with shapes (2,3) (2,3,3) 

In [63]:
# Divide two arrays
a1 / ones

array([1., 2., 3.])

In [64]:
# Divide using floor division
a2 // a1


array([[1., 1., 1.],
       [4., 2., 2.]])

In [65]:
# Take an array to a power
a1 ** 2

array([1, 4, 9])

In [66]:
# You can also use np.square()
np.square(a1)

array([1, 4, 9])

In [67]:
# Modulus divide (what's the remainder)
a1 % 2

array([1, 0, 1])

You can also find the log or exponential of an array using np.log() and np.exp().

In [68]:
# Find the log of an array
np.log(a1)

array([0.        , 0.69314718, 1.09861229])

In [69]:
# Find the exponential of an array
np.exp(a1)

array([ 2.71828183,  7.3890561 , 20.08553692])

## Aggregation

Aggregation - bringing things together, doing a similar thing on a number of things.

In [70]:
sum(a1)

6

In [72]:
np.sum(a1)

6

# Tip:
Use NumPy's `np.sum()` on NumPy arrays and Python's `sum()` on Python lists. 

In [73]:
massive_array = np.random.random(10000)

In [74]:
massive_array.size, type(massive_array)

(10000, numpy.ndarray)

In [76]:
massive_array[:100]

array([2.18749374e-01, 5.69573535e-01, 4.52109035e-01, 9.70236683e-01,
       6.80544691e-01, 8.52955659e-02, 5.64183327e-02, 4.87837704e-01,
       8.81004562e-01, 9.76404387e-01, 6.17657916e-01, 5.42498775e-01,
       8.54613580e-01, 7.43834545e-01, 4.78596326e-01, 6.77081574e-01,
       6.07045061e-01, 7.14696936e-01, 4.69497183e-01, 4.56014623e-01,
       9.06418087e-01, 1.37220420e-01, 2.29219323e-01, 8.81585399e-01,
       9.04424976e-01, 6.45784599e-01, 3.24682972e-01, 5.19711194e-01,
       5.53568650e-05, 3.11860221e-01, 4.25451538e-01, 8.85337660e-01,
       6.79879456e-01, 4.56129772e-01, 4.83408617e-01, 7.88739428e-01,
       2.29441834e-01, 8.80297603e-01, 3.13692393e-01, 9.57450856e-01,
       4.71751571e-01, 7.11583817e-01, 1.53694305e-01, 7.30442177e-01,
       6.46264437e-01, 2.14880737e-01, 1.86458219e-01, 8.07580269e-01,
       7.47079470e-01, 6.74847346e-01, 2.76893751e-01, 1.74908874e-01,
       7.04474258e-01, 4.63150200e-01, 8.40428533e-01, 2.04865762e-01,
      

In [77]:
%timeit sum(massive_array) # python sum()
%timeit np.sum(massive_array) # Numpy np.sum()

921 μs ± 31.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
8.29 μs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)




NumPy's np.sum() is still fast but Python's sum() is faster on Python lists.


In [78]:
a2


array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [79]:
# Find the mean
np.mean(a2)

3.6333333333333333

In [80]:
# Find the max
np.max(a2)

6.5

In [81]:
# Find the min
np.min(a2)

1.0

In [82]:
# Find the standard deviation (range d"ecart)
np.std(a2)

1.8226964152656422

In [83]:
# Find the variance
np.var(a2)

3.3222222222222224

In [84]:
# The standard deviation is the square root of the variance
np.sqrt(np.var(a2))

1.8226964152656422



### What's mean?

Mean is the same as average. You can find the average of a set of numbers by adding them up and dividing them by how many there are.

### What's standard deviation?

Standard deviation is a measure of how spread out numbers are.

### What's variance?

The variance is the averaged squared differences of the mean.

To work it out, you:

    1. Work out the mean
    2.For each number, subtract the mean and square the result
    3.Find the average of the squared differences



In [85]:
# Demo of variance
high_var_array = np.array([1, 100, 200, 300, 4000, 5000])
low_var_array = np.array([2, 4, 6, 8, 10])

In [86]:
np.var(high_var_array), np.var(low_var_array)

(4296133.472222221, 8.0)

In [3]:
np.std(high_var_array), np.std(low_var_array)

NameError: name 'np' is not defined

In [88]:
# The standard deviation is the square root of the variance
np.sqrt(np.var(high_var_array))

2072.711623024829

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(high_var_array)
plt.show()

NameError: name 'high_var_array' is not defined