# DATA SCIENCE INTENSIVE :: Intro to ML in Python
### An Intensive Python ML Course
## Week 01: NumPy

[&larr; Back to course webpage](http://datakolektiv.com/app_direct/introdsnontech/)

![](../img/IntroMLPython_Head.png)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE INTENSIVE SERIES :: Introduction to ML in Python DataKolektiv course.

### Goran S. Milovanović, PhD
<b>DataKolektiv, Chief Scientist & Owner</b>

### Aleksandar Cvetković, PhD
<b>DataKolektiv, Consultant</b>

![](../img/DK_Logo_100.png)

## Intro to NumPy

- Numpy arrays
- Element-wise operations
- Matrices
- Subsetting vectors, matrices, and multidimensional arrays
- Some algebraic operations
- Broadcasting
- More repeating of things
- Copying things
- Find elements based on conditions
- Basic Statistics
- The treatment of missing values in NumPy

In [1]:
# - libs
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# - set RGN sed
np.random.seed(777) 

### Lists and Numpy arrays

In [2]:
a = [1, 2, 3]
b = [2, 2, 2]
# - a*b rises an error
# - TypeError: can't multiply sequence by non-int of type 'list'

# - Numpy
a = np.array([1, 2, 3])
b = np.array([2, 2, 2])
a*b

array([2, 4, 6])

### Element-wise operations

In [3]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a+b

array([5, 7, 9])

In [4]:
a = np.array([10, 10, 10])
b = np.array([2, 3, 4])
a**b

array([  100,  1000, 10000])

In [5]:
a = np.repeat(10, repeats=3)
b = np.array([2, 3, 4])
a**b

array([  100,  1000, 10000])

In [6]:
np.repeat(10, repeats=3)

array([10, 10, 10])

In [7]:
a = np.repeat(10, repeats=3)
type(a.tolist())
print(a.tolist())

[10, 10, 10]


### Matrices

In [8]:
mat = np.array([[1, 2, 3], 
                [4, 5, 6], 
                [7, 8, 9]])
print(mat)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Shape

In [9]:
mat.shape

(3, 3)

### Subsetting vectors, matrices, and multidimensional arrays

Subsetting Numpy arrays

In [10]:
mat[0, 0]

1

In [11]:
mat[0, 1]

2

Rows

In [12]:
mat[0, :]

array([1, 2, 3])

In [13]:
mat[1, :]

array([4, 5, 6])

Columns

In [14]:
mat[:, 1]

array([2, 5, 8])

In [15]:
mat[:, 2]

array([3, 6, 9])

Use a list to subset a NumPy array

In [99]:
v = np.linspace(1, 10, 10, dtype="int")
print(v)

[ 1  2  3  4  5  6  7  8  9 10]


In [101]:
v[[0, 2, 4]]

array([1, 3, 5])

Shape

In [16]:
a = np.array([1, 2, 3])
a.shape

(3,)

Dimension

In [17]:
a.ndim

1

In [18]:
multiarray = np.array([
                        [[1, 2, 3], [4, 5, 6], [7, 8, 9]], 
                        [[10, 11, 12], [13, 14, 15], [16, 17, 18]],
                        [[19, 20, 21], [22, 23, 24], [25, 26, 27]]
                    ])
print(multiarray)

[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]

 [[19 20 21]
  [22 23 24]
  [25 26 27]]]


Dimension

In [19]:
multiarray.ndim

3

Easier:

In [20]:
multiarray = np.linspace(1,27, 27, dtype=int)
multiarray = np.reshape(multiarray, newshape=(3,3,3))
print(multiarray)

[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]

 [[19 20 21]
  [22 23 24]
  [25 26 27]]]


In [21]:
multiarray.shape

(3, 3, 3)

`np.linspace()`

In [22]:
np.linspace(-10, 10, 100)

array([-10.        ,  -9.7979798 ,  -9.5959596 ,  -9.39393939,
        -9.19191919,  -8.98989899,  -8.78787879,  -8.58585859,
        -8.38383838,  -8.18181818,  -7.97979798,  -7.77777778,
        -7.57575758,  -7.37373737,  -7.17171717,  -6.96969697,
        -6.76767677,  -6.56565657,  -6.36363636,  -6.16161616,
        -5.95959596,  -5.75757576,  -5.55555556,  -5.35353535,
        -5.15151515,  -4.94949495,  -4.74747475,  -4.54545455,
        -4.34343434,  -4.14141414,  -3.93939394,  -3.73737374,
        -3.53535354,  -3.33333333,  -3.13131313,  -2.92929293,
        -2.72727273,  -2.52525253,  -2.32323232,  -2.12121212,
        -1.91919192,  -1.71717172,  -1.51515152,  -1.31313131,
        -1.11111111,  -0.90909091,  -0.70707071,  -0.50505051,
        -0.3030303 ,  -0.1010101 ,   0.1010101 ,   0.3030303 ,
         0.50505051,   0.70707071,   0.90909091,   1.11111111,
         1.31313131,   1.51515152,   1.71717172,   1.91919192,
         2.12121212,   2.32323232,   2.52525253,   2.72

Subsetting: work outside in

In [23]:
multiarray[0, :, :]

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [24]:
multiarray[1, :, :]

array([[10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

In [25]:
multiarray[2, :, :]

array([[19, 20, 21],
       [22, 23, 24],
       [25, 26, 27]])

Second rows from all layers

In [26]:
multiarray[:, 1, :]

array([[ 4,  5,  6],
       [13, 14, 15],
       [22, 23, 24]])

Second columns from all layers

In [27]:
multiarray[:, :, 1]

array([[ 2,  5,  8],
       [11, 14, 17],
       [20, 23, 26]])

In [28]:
print(mat)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Pick a single element

In [29]:
mat

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [30]:
mat[1, 1]

5

In [31]:
mat[1, -2]

5

Stepsize

In [32]:
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
a[0:10:2]

[1, 3, 5, 7, 9]

In [33]:
a[1:10:2]

[2, 4, 6, 8, 10]

Set the value of an element

In [34]:
print(mat)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [35]:
mat[0, 1] = 17
print(mat)

[[ 1 17  3]
 [ 4  5  6]
 [ 7  8  9]]


Change whole row

In [36]:
mat[0, :] = [8, 9, 11]
print(mat)

[[ 8  9 11]
 [ 4  5  6]
 [ 7  8  9]]


Stacking arrays

In [92]:
v1 = np.array([1, 1, 1, 1])
v2 = np.array([2, 2, 2, 2])
vstacked = np.vstack([v1, v2])
print(vstacked)

[[1 1 1 1]
 [2 2 2 2]]


In [93]:
v1 = np.array([1, 1, 1, 1])
v2 = np.array([2, 2, 2, 2])
hstacked = np.hstack([v1, v2])
print(hstacked)

[1 1 1 1 2 2 2 2]


### Some algebraic operations

Transpose

In [37]:
mat.T

array([[ 8,  4,  7],
       [ 9,  5,  8],
       [11,  6,  9]])

Multiply matrix by a scalar constant, elementwise

In [38]:
c = 3
c * mat

array([[24, 27, 33],
       [12, 15, 18],
       [21, 24, 27]])

Matrix times matrix, elementwise

In [39]:
mat1 = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print(mat1)
print(mat)
print("Element-wise product is:")
mat1 * mat

[[1 1 1]
 [2 2 2]
 [3 3 3]]
[[ 8  9 11]
 [ 4  5  6]
 [ 7  8  9]]
Element-wise product is:


array([[ 8,  9, 11],
       [ 8, 10, 12],
       [21, 24, 27]])

the same as:

In [40]:
np.multiply(mat1, mat)

array([[ 8,  9, 11],
       [ 8, 10, 12],
       [21, 24, 27]])

Algebraic operations: the dot product

In [41]:
v1 = np.array([1, 2, 3])
v2 = np.array([5, 6, 7])
np.dot(v1, v2)

38

In [42]:
np.dot(v2, v1)

38

Do not forget that is not commutative for matrices:

In [43]:
print(mat1)
print(mat)
print("Dot product: np.dot(mat1, mat)")
np.dot(mat1, mat)

[[1 1 1]
 [2 2 2]
 [3 3 3]]
[[ 8  9 11]
 [ 4  5  6]
 [ 7  8  9]]
Dot product: np.dot(mat1, mat)


array([[19, 22, 26],
       [38, 44, 52],
       [57, 66, 78]])

In [44]:
print("Dot product: np.dot(mat, mat1)")
np.dot(mat, mat1)

Dot product: np.dot(mat, mat1)


array([[59, 59, 59],
       [32, 32, 32],
       [50, 50, 50]])

Using `@` is preferred (it is faster):

In [45]:
mat @ mat1

array([[59, 59, 59],
       [32, 32, 32],
       [50, 50, 50]])

In [46]:
mat1 @ mat

array([[19, 22, 26],
       [38, 44, 52],
       [57, 66, 78]])

Dot product: vector times matrix

In [47]:
a = np.array([1, 2, 3])
print(a)
print(mat)

[1 2 3]
[[ 8  9 11]
 [ 4  5  6]
 [ 7  8  9]]


In [48]:
np.dot(a, mat)

array([37, 43, 50])

In [49]:
np.dot(mat, a)

array([59, 32, 50])

Type

In [50]:
a.dtype

dtype('int64')

Size (the total number of elements)

In [51]:
a.size

3

Float array

In [52]:
a = np.array([[1.1, 2, 3.14], [2, 2.22, 1.41]])
a.dtype

dtype('float64')

Outer product

In [53]:
print(v1)
print(v2)
np.outer(v1, v2)

[1 2 3]
[5 6 7]


array([[ 5,  6,  7],
       [10, 12, 14],
       [15, 18, 21]])

In [54]:
np.outer(v2, v1)

array([[ 5, 10, 15],
       [ 6, 12, 18],
       [ 7, 14, 21]])

In [55]:
v2 = np.array([4, 5, 6, 7])
print(v1)
print(v2)
np.outer(v1, v2)

[1 2 3]
[4 5 6 7]


array([[ 4,  5,  6,  7],
       [ 8, 10, 12, 14],
       [12, 15, 18, 21]])

In [56]:
np.outer(mat, mat1)

array([[ 8,  8,  8, 16, 16, 16, 24, 24, 24],
       [ 9,  9,  9, 18, 18, 18, 27, 27, 27],
       [11, 11, 11, 22, 22, 22, 33, 33, 33],
       [ 4,  4,  4,  8,  8,  8, 12, 12, 12],
       [ 5,  5,  5, 10, 10, 10, 15, 15, 15],
       [ 6,  6,  6, 12, 12, 12, 18, 18, 18],
       [ 7,  7,  7, 14, 14, 14, 21, 21, 21],
       [ 8,  8,  8, 16, 16, 16, 24, 24, 24],
       [ 9,  9,  9, 18, 18, 18, 27, 27, 27]])

In [57]:
np.outer(mat1, mat)

array([[ 8,  9, 11,  4,  5,  6,  7,  8,  9],
       [ 8,  9, 11,  4,  5,  6,  7,  8,  9],
       [ 8,  9, 11,  4,  5,  6,  7,  8,  9],
       [16, 18, 22,  8, 10, 12, 14, 16, 18],
       [16, 18, 22,  8, 10, 12, 14, 16, 18],
       [16, 18, 22,  8, 10, 12, 14, 16, 18],
       [24, 27, 33, 12, 15, 18, 21, 24, 27],
       [24, 27, 33, 12, 15, 18, 21, 24, 27],
       [24, 27, 33, 12, 15, 18, 21, 24, 27]])

### Broadcasting

In [58]:
a = np.array([1, 2, 3])
b = np.array([[1, 2, 3], [4, 5, 6]])
a+b

array([[2, 4, 6],
       [5, 7, 9]])

In [59]:
a*b

array([[ 1,  4,  9],
       [ 4, 10, 18]])

Example from NumPy documentation

In [60]:
observation = np.array([111.0, 188.0])
codes = np.array([[102.0, 203.0],
    [132.0, 193.0],
    [45.0, 155.0],
    [57.0, 173.0]])
diff = codes - observation
print(diff)
# - Euclidean distances
dist = np.sqrt(np.sum(diff**2,axis=-1))
print(dist)
# - indice of the minimum
w_min = np.argmin(dist)
print(w_min)
# - minimal distance
print(dist[w_min])

[[ -9.  15.]
 [ 21.   5.]
 [-66. -33.]
 [-54. -15.]]
[17.49285568 21.58703314 73.79024326 56.04462508]
0
17.4928556845359


### More repeating of things

In [61]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [62]:
np.ones(10, dtype=int)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [63]:
np.zeros(4)

array([0., 0., 0., 0.])

In [64]:
np.full((2,2), 10)

array([[10, 10],
       [10, 10]])

In [65]:
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a)

[[1 2 3]
 [4 5 6]]


In [66]:
np.repeat(a, repeats=2, axis=0)

array([[1, 2, 3],
       [1, 2, 3],
       [4, 5, 6],
       [4, 5, 6]])

In [67]:
np.repeat(a, repeats=2, axis=1)

array([[1, 1, 2, 2, 3, 3],
       [4, 4, 5, 5, 6, 6]])

In [68]:
a = np.array([[1, 2, 3]])
print(a.ndim)
np.repeat(a, repeats=2, axis=0)

2


array([[1, 2, 3],
       [1, 2, 3]])

In [69]:
a = np.array([[1, 2, 3]])
print(a.ndim)
np.repeat(a, repeats=2, axis=1)

2


array([[1, 1, 2, 2, 3, 3]])

### Copying things

In [70]:
z = np.array([1, 2, 3])
y = z
y[0] = 7
print(z)

[7 2 3]


In [71]:
z = np.array([1, 2, 3])
y = z.copy()
y[0] = 7
print(y)
print(z)

[7 2 3]
[1 2 3]


### Find elements based on conditions

In [72]:
v1 = np.linspace(1, 100, 100)
print(v1)

[  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.  14.
  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.
  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.
  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.  56.
  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.  70.
  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.  84.
  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.  98.
  99. 100.]


In [73]:
cond = v1 > 50
print(cond)

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True]


In [74]:
v1[cond]

array([ 51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,  59.,  60.,  61.,
        62.,  63.,  64.,  65.,  66.,  67.,  68.,  69.,  70.,  71.,  72.,
        73.,  74.,  75.,  76.,  77.,  78.,  79.,  80.,  81.,  82.,  83.,
        84.,  85.,  86.,  87.,  88.,  89.,  90.,  91.,  92.,  93.,  94.,
        95.,  96.,  97.,  98.,  99., 100.])

In [75]:
v1[v1 < 50]

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
       14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
       27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39.,
       40., 41., 42., 43., 44., 45., 46., 47., 48., 49.])

In [76]:
v1[(v1 < 50) & (v1 > 10)]

array([11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23.,
       24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36.,
       37., 38., 39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49.])

In [77]:
print(mat)

[[ 8  9 11]
 [ 4  5  6]
 [ 7  8  9]]


In [78]:
mat[mat > 5]

array([ 8,  9, 11,  6,  7,  8,  9])

`np.any()`

In [109]:
my_matrix = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8 , 9]
    ])
print(my_matrix)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [111]:
np.any(my_matrix>7)


True

`np.any()` on columns:

In [112]:
np.any(my_matrix>7, axis=0)

array([False,  True,  True])

`np.any()` on rows:

In [113]:
np.any(my_matrix>7, axis=1)

array([False, False,  True])

Confused about axes? (e.g. `axis=0` for columns, `axis=1` for rows)

> Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]

### Basic Statistics

In [79]:
a = np.random.random_sample(100)*10
a

array([1.52663735, 3.0235661 , 0.62036415, 4.59860342, 8.35253384,
       9.26997048, 7.26988984, 7.68496222, 2.69205066, 6.44029292,
       0.93373257, 0.79685886, 5.89613753, 3.43340538, 9.88876149,
       6.26473206, 6.8177928 , 5.52256814, 2.68860058, 3.73259386,
       2.22928099, 1.864426  , 3.90648093, 1.93162406, 6.10910931,
       8.82808447, 6.22338824, 2.53118944, 1.79930307, 8.1640447 ,
       2.25371621, 5.1685714 , 5.18495819, 6.00374936, 5.32620483,
       0.1331005 , 5.24097262, 8.95884714, 7.69901294, 1.22851696,
       2.95872694, 6.12023579, 7.26138122, 4.63497471, 7.69110367,
       1.91631031, 5.57866722, 5.50778157, 4.72225491, 7.91884961,
       1.15249678, 6.81303898, 3.62333611, 3.44208894, 4.4951875 ,
       0.2694226 , 4.1524769 , 9.22231703, 0.91205571, 3.1512178 ,
       5.28022244, 3.28062031, 4.48915544, 0.16334415, 0.97026903,
       6.92588574, 8.3594341 , 4.2432199 , 8.48774304, 5.46791211,
       3.54103458, 7.27249682, 0.93851678, 8.92858796, 3.36258

In [80]:
a.mean()

4.856717913159926

In [81]:
a.var()

7.949586450429896

**Note.** By default in `numpy.var()`: `ddof=0`. For the unbiased estimate of variance then:

In [82]:
a.var(ddof=1)

8.02988530346454

The same holds for `numpy.std()` 

In [83]:
a.std()

2.819501099561746

In [84]:
np.sqrt(a.var(ddof=1))

2.8337052252244836

In [85]:
a.std(ddof=1)

2.8337052252244836

In [86]:
np.median(a)

4.687852775613138

### The treatment of missing values in NumPy

In [87]:
v = np.array([1, 2, 3, 4, np.nan, 6, 7, np.nan, 8, 9])
print(v)

[ 1.  2.  3.  4. nan  6.  7. nan  8.  9.]


In [88]:
v.mean()

nan

In [89]:
s = np.sum(v[np.logical_not(np.isnan(v))])
print(s)
n = v[np.logical_not(np.isnan(v))].size
print(n)
s/n

40.0
8


5.0

Also you can do:

In [90]:
v = np.array([1, 2, 3, 4, np.nan, 6, 7, np.nan, 8, 9]) 
v1 = v[~np.isnan(v)]
v1.mean()

5.0

In [91]:
v[~np.isnan(v)].mean()

5.0

Reading `.txt` data from the local filesystem

In [96]:
path = "../_data/just_numbers.txt"
data = np.genfromtxt(path, delimiter=",")
print(data)

[[ 1.  2.  3.  4.  5.]
 [ 6.  7.  8.  9. 10.]
 [11. 12. 13. 14. 15.]]


`astype()`

In [97]:
data.astype("int32")

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]], dtype=int32)

### Readings and Videos

- [Python NumPy Tutorial for Beginners from freeCodeCamp.org](https://www.youtube.com/watch?v=QUT1VHiLmmI)
- [Broadcasting from numpy.org](https://numpy.org/doc/stable/user/basics.broadcasting.html)

<hr>

Goran S. Milovanović & Aleksandar Cvetković

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: <a href="https://www.gnu.org/licenses/gpl-3.0.txt">GPLv3</a> This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see <a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>.</font>