In the big data era, we often need to compute thousands of data or more. To increase the speed and efficiecy on the computation, we would like to take advantages on vectorization. It can do computation in parallel instead of iterating through for-loops. Here are the demo to compare the efficiency between vectorization method and for-loop method.

In [1]:
import numpy as np
a = np.array([1,2,3,4])
print(f"a: {a}")
print(f"a.shape: {a.shape}")

a: [1 2 3 4]
a.shape: (4,)


In [2]:
a = np.random.rand(1000000)
b = np.random.rand(1000000)
print('a:{} ;\nb:{}\n'.format(a,b))
print('a.shape:{} ;\nb.shape:{}'.format(a.shape,b.shape))

a:[0.64765647 0.66400807 0.88634352 ... 0.02563275 0.85729959 0.75305928] ;
b:[0.77827806 0.65743669 0.9024418  ... 0.61962603 0.78409965 0.47896914]

a.shape:(1000000,) ;
b.shape:(1000000,)


In [3]:
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)

# vectorized version
c = 0
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print(f"c = {c}")
print("Vectorized versiion: {} ms\n".format(1000*(toc-tic)))
vecV= 1000*(toc-tic)

# vectorized version
c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]  # c = c + a[i]*b[i]
toc = time.time()
print(f"c = {c}")
print("For-loop versiion: {} ms\n".format(1000*(toc-tic)))
forV = 1000*(toc-tic)

# comparison
print(f"The vectorized version is {round(forV/vecV)} times faster than forloop version")

c = 249613.00502631514
Vectorized versiion: 3.012418746948242 ms

c = 249613.00502631374
For-loop versiion: 943.0043697357178 ms

The vectorized version is 313 times faster than forloop version


In [4]:
u=np.zeros((5,1))
print(f"u: \n{u}")
v = np.random.rand(5)
print(f"v: \n{v}")

u: 
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]
v: 
[0.49068565 0.8785064  0.67276652 0.60107853 0.28136746]


In [5]:
import math
n=1000000
u=np.zeros((n))
v = np.random.rand(n)

# for-loops
tic = time.time()
for i in range(n):
    u[i] = math.exp(v[i])
toc = time.time()
print(f"u: \n{u}")
print("For-loop versiion: {} ms\n".format(1000*(toc-tic)))

# numpy built-in vectorized method
tic = time.time()
u = np.exp(v)
toc = time.time()
print(f"u: \n{u}")
print("Numpy Vectorized versiion: {} ms\n".format(1000*(toc-tic)))

u: 
[1.70935453 2.27257708 1.46078209 ... 2.32710143 1.09417619 2.58486672]
For-loop versiion: 648.2234001159668 ms

u: 
[1.70935453 2.27257708 1.46078209 ... 2.32710143 1.09417619 2.58486672]
Numpy Vectorized versiion: 16.544342041015625 ms



## Broadcasting in Python

In [6]:
A = np.array([[56.0, 0.0, 4.4, 68.0],
             [1.2, 104.0, 52.0, 8.0],
             [1.8, 135.0, 99.0, 0.9]])
print(A)

[[ 56.    0.    4.4  68. ]
 [  1.2 104.   52.    8. ]
 [  1.8 135.   99.    0.9]]


In [7]:
cal = A.sum(axis=0)
print(cal)

[ 59.  239.  155.4  76.9]


In [8]:
cal.reshape(1,4)

array([[ 59. , 239. , 155.4,  76.9]])

In [9]:
percentage = 100*A/cal.reshape(1,4)
print(percentage)

[[94.91525424  0.          2.83140283 88.42652796]
 [ 2.03389831 43.51464435 33.46203346 10.40312094]
 [ 3.05084746 56.48535565 63.70656371  1.17035111]]


## Numpy Vector

In [10]:
import numpy as np
a = np.random.randn(5) # five random gaussian number stored in array a
print(a)

[-0.21570704  0.56618351 -0.20783644 -1.5515172  -1.12910407]


In [11]:
print(a.shape) # rank 1 array in python 
print("""
Array 'a' is a rank 1 array in python; 
It is neither a row vector nor a column vector!
It have some slight non-intuitive effects""")

(5,)

Array 'a' is a rank 1 array in python; 
It is neither a row vector nor a column vector!
It have some slight non-intuitive effects


In [12]:
print(a.T)
print(a.T.shape)

[-0.21570704  0.56618351 -0.20783644 -1.5515172  -1.12910407]
(5,)


In [13]:
print(np.dot(a,a.T))

4.0923708867409925


By expliciting parameter on randn(5,1), it will return a 5 by 1 column vector

In [14]:
a = np.random.randn(5,1)
print(a)
print(a.shape)

[[-0.68315156]
 [ 0.17948474]
 [-0.14198841]
 [-0.51533806]
 [-1.06276184]]
(5, 1)


In [15]:
print(a.T)
print(a.T.shape)

[[-0.68315156  0.17948474 -0.14198841 -0.51533806 -1.06276184]]
(1, 5)


5 by 1 matrix inner dot with 1 by 5 matrix: it should be a 5 by 5 matrix

In [16]:
print(np.dot(a,a.T))

[[ 0.46669605 -0.12261528  0.0969996   0.352054    0.72602741]
 [-0.12261528  0.03221477 -0.02548475 -0.09249532 -0.19074953]
 [ 0.0969996  -0.02548475  0.02016071  0.07317203  0.15089986]
 [ 0.352054   -0.09249532  0.07317203  0.26557332  0.54768163]
 [ 0.72602741 -0.19074953  0.15089986  0.54768163  1.12946273]]


In [17]:
print(np.dot(a,a.T).shape)

(5, 5)


In [18]:
assert(a.shape==(5,1))

In [19]:
assert(a.shape==(1,5))

AssertionError: 

In [20]:
a = np.random.randn(3, 3)
b = np.random.randn(3, 1)
c = a*b
c.shape

(3, 3)

In [2]:
import numpy as np
A= np.random.randn(4,3)
B = np.sum(A, axis = 1, keepdims = True)

In [4]:
A.shape

(4, 3)

In [5]:
A

array([[-0.91665361, -0.26577029,  0.9095057 ],
       [-2.37451795, -0.76792386,  1.14053363],
       [-1.0487089 ,  1.09541977,  0.35821027],
       [ 0.09690215, -1.19163622, -0.08646674]])

In [6]:
B

array([[-0.2729182 ],
       [-2.00190818],
       [ 0.40492114],
       [-1.18120081]])

In [3]:
B.shape

(4, 1)