Hey, in this tutorial we will learn more about **numpy** - a library for effective computations in python. 🚀🚀🚀

[NumPy](https://numpy.org/doc/stable/index.html) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

In [None]:
# numpy installation
!pip install numpy

In [None]:
# the classic alias for numpy is np
import numpy as np

# Arrays in numpy

The core functionality of NumPy is its "ndarray", for n-dimensional array, data structure. These arrays are strided views on memory. In contrast to Python's built-in list data structure, these arrays are homogeneously typed: all elements of a single array must be of the same type.

In [None]:
# classic 1d list in python
arr1d = [1, 2, 3, 4, 5]
arr1d

In [None]:
# 1d ndarray in numpy
nparr1d = np.array(arr1d)
nparr1d

In [None]:
# classic 2d list in python
arr2d = [
    [1, 2, 3, 4, 5],
    [10, 20, 30, 40, 50],
    [100, 200, 300, 400, 500]
]
arr2d

In [None]:
# 2d ndarray in numpy
nparr2d = np.array(arr2d)
nparr2d

In [None]:
# classic 3d list in python
arr_nd = [[
    [11, 21, 31, 41, 51],
    [110, 210, 310, 410, 510],
    [1100, 2100, 3100, 4100, 5100]
], [
    [12, 22, 32, 42, 52],
    [120, 220, 320, 420, 520],
    [1200, 2200, 3200, 4200, 5200]
]]
arr_nd

In [None]:
# nd ndarray in numpy
nparr_nd = np.array(arr_nd)
nparr_nd

In [None]:
# generate random array
random_arr = np.random.random((10, 10))
random_arr

# Basic operations on arrays

In [None]:
arr1 = np.array([[1, 1, 1],
                 [2, 2, 2],
                 [3, 3, 3]])
arr2 = np.array([[10, 20, 30],
                 [10, 20, 30],
                 [10, 20, 30]])

All classical arithmetic operations are performed vector-wise, i.e. for all elements at once

In [None]:
# Vector addition
arr1 + arr2

In [None]:
# Vector subtraction
arr2 - arr1

In [None]:
# scalar multiplication of matrices
arr1 * arr2

In [None]:
# matrix multiplication of matrices
arr1 @ arr2

# Vector operations

Vector calculations are tens and hundreds of times faster than conventional calculations. Let's demonstrate this with the example of matrix multiplication

In [None]:
def my_matrix_mult(X, Y):
    result = np.zeros((X.shape[1], Y.shape[0]))
    for i in range(len(X)):
        # iterate through columns of Y
        for j in range(len(Y[0])):
            # iterate through rows of Y
            for k in range(len(Y)):
                result[i, j] += X[i, k] * Y[k, j]
    return result

In [None]:
%timeit my_matrix_mult(np.random.random((10 ** 2, 10 ** 2)), np.random.random((10 ** 2, 10 ** 2)))

To demonstrate the speed, we will use arrays with 10 times the dimension

In [None]:
%timeit np.random.random((10 ** 3, 10 ** 3)) @ np.random.random((10 ** 3, 10 ** 3))

# Let's solve some problems yourself! (1 point)

As a practice, it is proposed to implement these functions through numpy tools
To find suitable functions, read the [documentation](https://numpy.org/doc/stable/index.html)

In [None]:
# calculate the sum in each column of a two-dimensional table
def sum_by_columns(arr):
    result_arr = [0] * len(arr[0])
    for j in range(len(arr[0])):
        for i in range(len(arr)):
            result_arr[j] += arr[i][j]
    return result_arr

In [None]:
def np_sum_by_columns(arr):
    """write your own function using numpy"""
    ...

In [None]:
for _ in range(10):
    test_arr = np.random.random((np.random.randint(1, 100), np.random.randint(1, 100)))
    assert all(np_sum_by_columns(test_arr) == sum_by_columns(test_arr))
print("Good job!")

In [None]:
# transpose the matrix
def transposition(arr):
    result_arr = np.zeros((len(arr[0]), len(arr)))
    for i in range(len(arr)):
        for j in range(len(arr[0])):
            result_arr[j][i] = arr[i][j]
    return result_arr

In [None]:
def np_transposition(arr):
    """write your own function using numpy"""
    ...

In [None]:
for _ in range(10):
    test_arr = np.random.random((np.random.randint(1, 100), np.random.randint(1, 100)))
    assert (transposition(test_arr) == np_transposition(test_arr)).all()
print("Nice!")

In [None]:
# calculating the arithmetic mean
def m_mean(arr):
    m = 0
    for i in arr:
        for j in i:
            m += j
    return m / arr.size

In [None]:
def np_mean(arr):
    """write your own function using numpy"""
    ...

In [None]:
for _ in range(10):
    test_arr = np.random.random((np.random.randint(1, 100), np.random.randint(1, 100)))
    assert abs(m_mean(test_arr) - np.mean(test_arr)) < 1e-5
print("Well done!")

In [None]:
# find unique items
def get_uniq(arr):
    uniq = set()
    for row in arr:
        for elem in row:
            uniq.add(elem)
    return sorted(list(uniq))

In [None]:
def np_get_uniq(arr):
  """write your own function using numpy"""
    ...

In [None]:
for _ in range(10):
    test_arr = np.random.randint(1, 50, (np.random.randint(1, 10), np.random.randint(1, 10)))
    assert all(get_uniq(test_arr) == np_get_uniq(test_arr))
print('Task completed')

# Complete the tasks (1 point)


Extract all the contiguous 3x3 blocks from a random 10x10 matrix

In [None]:
from numpy.lib import stride_tricks

Considering a 10x3 matrix, extract rows with unequal values (e.g. [2,2,3])

In [None]:
X = np.random.randint(0,5,(10,3))

Consider a large vector X, compute X to the power of 3 using 3 different methods

In [None]:
X = np.random.random(100)

# Make your own clustering algorithm using numpy (2 points)

Machine learning is one domain that can frequently take advantage of vectorization and broadcasting. Let’s say that you have the vertices of a triangle (each row is an x, y coordinate)

In [None]:
# ....
# add picture

In [None]:
import matplotlib.pyplot as plt

In [None]:
tri = [[1, 1], [3, 1], [2, 3]]  # create an arbitrary triangle for demonstration

In [None]:
def find_centroid(arr):
    result_arr = [0] * len(arr[0])
    for row in arr:
        for j, elem in enumerate(row):
            result_arr[j] += elem
    for i, elem in enumerate(result_arr):
        result_arr[i] /= len(arr)
    return result_arr

The centroid of this “cluster” is an (x, y) coordinate that is the arithmetic mean of each column:

In [None]:
centroid = find_centroid(tri)  # find the centroid

It’s helpful to visualize this:

In [None]:
trishape = plt.Polygon(tri, edgecolor='r', alpha=0.2, lw=5)
_, ax = plt.subplots(figsize=(4, 4))
ax.add_patch(trishape)
ax.set_ylim([.5, 3.5])
ax.set_xlim([.5, 3.5])
ax.scatter(*centroid, color='g', marker='D', s=70)
ax.scatter(*transposition(tri), color='b', s=70)

In [None]:
# The distance from each vertex of the triangle to the vertex of the triangle
[euclidean_dist(tri[i], centroid) for i in range(len(tri))]

In [None]:
# the function of finding the distance between two points in the Euclidean metric
euclidean_dist = lambda point1, point2: sum([(point1[i] - point2[i]) ** 2 for i in range(len(point1))]) ** 0.5

Finally, let’s take this one step further: let’s say that you have a 2d array X and a 2d array of multiple (x, y) “proposed” centroids. Algorithms such as K-Means clustering work by randomly assigning initial “proposed” centroids, then reassigning each data point to its closest centroid. From there, new centroids are computed, with the algorithm converging on a solution once the re-generated labels (an encoding of the centroids) are unchanged between iterations. A part of this iterative process requires computing the Euclidean distance of each point from each centroid:

![picture](https://drive.google.com/uc?export=view&id=1s81eiR8xu_cHuz9JXaKrGO9XVS1N88BF/view?usp=sharing)

In [None]:
import random

In [None]:
# Let's create a set of points, which we will then cluster
X = [[3 + random.random() * 4, 3 + random.random() * 4] for _ in range(5)] + [
    [8 + random.random() * 4, 8 + random.random() * 4] for _ in range(5)]

In [None]:
# let's set the coordinates of the centroid
centroids = [[5, 5], [10, 10]]

In [None]:
# calculate the distance from each point to each centroid
distances = [[euclidean_dist(x, centroid) for x in X] for centroid in centroids]
distances

Next, we want the label (index number) of each closest centroid, finding the minimum distance on the 0th axis from the array above:

In [None]:
# let's determine which of the centroids each point is closer to
labels = [distances[0][i] > distances[1][i] for i in range(len(distances[0]))]

Let’s inspect this visually, plotting both the two clusters and their assigned labels with a color-mapping

In [None]:
# Let's display the results
c1, c2 = ['#bc13fe', '#be0119']  # https://xkcd.com/color/rgb/
llim, ulim = 2, 13

_, ax = plt.subplots(figsize=(5, 5))
ax.scatter(*transposition(X), c=np.where(labels, c2, c1), alpha=0.4, s=80)
ax.scatter(*transposition(centroids), c=[c1, c2], marker='s', s=95,
           edgecolor='yellow')
ax.set_ylim([llim, ulim])
ax.set_xlim([llim, ulim])
ax.set_title('One K-Means Iteration: Predicted Classes')


*Plan*

* Generate a series of points, which will then be clustered

* Set the centroid points. In this exercise, they will be known in advance

* Calculate the distance from each point to each centroid

* For each point, select the centroid with the minimum distance

* Display points and centroids colored in the appropriate cluster colors


Below is an implementation without using numpy

More about k-means classification you can read [here](https://en.wikipedia.org/wiki/K-means_clustering)

Good luck! 🤡