# Welcome to Numpy - Numeric Computing Library in python

In [1]:
print('Introduction to Numpy and python for data science in general')

Introduction to Numpy and python for data science in general


In [2]:
print('Numpy is a Python Numeric computing library')


Numpy is a Python Numeric computing library


## Install Numpy

In [3]:
!pip install numpy

Defaulting to user installation because normal site-packages is not writeable


In [4]:
import sys
import numpy as np

In [5]:
data = np.array([89, 33, 45, 90])
data
print(type(data))

<class 'numpy.ndarray'>


## umPy arrays

The NumPy array - an n-dimensional data structure - is the central object of the NumPy package.
A one-dimensional NumPy array can be thought of as a vector, a two-dimensional array as a matrix (i.e., a set of vectors), 
and a three-dimensional array as a tensor (i.e., a set of matrices).

## Array data types
An array can consist of integers, floating-point numbers, or strings. Within an array, the data type must be consistent (e.g., all integers or all floats).

In [6]:
# Array definition

np.array([
    [23, 45, 56],
    [90, 34, 78]
])

vector: list = np.arange(1, 5)
vector

zeros_array = np.zeros((1, 10))
zeros_array

ones_array = np.ones((2, 10))
ones_array

full_array = np.full((5, 10), 100)
full_array

array([[100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]])

In [7]:
# Shape

shape = ones_array.shape
shape
full_array.dtype.type
my_array = np.array([90, 78, 23, 34, 56], dtype='int8')
my_array
my_array.dtype.type

numpy.int8

In [8]:
# Access the elements from the array
my_array[0]
my_array[:2]
my_array[2:4]
my_array[-1]

56

In [9]:
# Load data
import csv

data = []
with open('Datasets/MER_T07_02A.csv', 'r') as csv_file:
    file_reader = csv.reader(csv_file, delimiter=',')
    for row in file_reader:
        data.append(row)
data = np.array(data)
data
data.shape
data.dtype.type

np.save(open('Datasets/my_data.npy', 'wb'), data)
print('data saved as binary')

data saved as binary


In [10]:
data[:8]

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195113', '185203.657', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195213', '195436.666', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195313', '218846.325', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195413', '239145.966', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195513', '301362.698', '1',
        'Electricity 

In [11]:
data[:, 2:4]

array([['Value', 'Column_Order'],
       ['135451.32', '1'],
       ['154519.994', '1'],
       ...,
       ['4243136.159', '13'],
       ['347437.124', '13'],
       ['310200.688', '13']], dtype='<U80')

In [12]:
subset = data[1:6, [2, 3]]
subset

array([['135451.32', '1'],
       ['154519.994', '1'],
       ['185203.657', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

A mask array, also known as a logical array, contains boolean elements (i.e. True or False).

In [13]:
mask_array = np.array([False, True, False, True, True])
subset[mask_array]

array([['154519.994', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

In [14]:
new_mask_array = np.array([True, False, False, False, True])
subset[new_mask_array]

array([['135451.32', '1'],
       ['218846.325', '1']], dtype='<U80')

the mask array retained the rows corresponding to True and the excluded the ones corresponding to False. It is worth noting that a similar approach is used for indexing pandas dataframes.

Masking is a powerful tool that allows us to index elements based on logical expressions. 

## Concatenating

In [15]:
start_3_rows_array = data[:3, :]
start_3_rows_array

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [16]:
end_3_rows_array = data[-3:, :]
end_3_rows_array

array([['ELETPUS', '202213', '4243136.159', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202301', '347437.124', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202302', '310200.688', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

To `concatenate` these arrays we can use `np.vstack`, where the v denotes vertical, or row-wise, stacking of the sub-arrays:
The horizontal counterpart of `np.vstack()` is `np.hstack()`, which combines sub-arrays column-wise. For higher dimensional joins, the most common function is `np.concatenate()`. The syntax for this function is similar to the 2D versions, with the additional requirement of specifying the axis along which concatenation should be performed.

Calling `np.concatenate((array_start, array_end), axis = 0)` would generate identical output to using `np.vstack()`. Axis=1 would generate identical output to using `np.hstack()`.

In [17]:
joined = np.vstack((start_3_rows_array, end_3_rows_array))
joined

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202213', '4243136.159', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202301', '347437.124', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202302', '310200.688', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [18]:
joined_colomnwise = np.hstack((start_3_rows_array, end_3_rows_array))
joined_colomnwise

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit',
        'ELETPUS', '202213', '4243136.159', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours', 'ELETPUS', '202301', '347437.124', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours', 'ELETPUS', '202302', '310200.688', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [19]:
concat_vertical = np.concatenate((start_3_rows_array, end_3_rows_array), axis=0)
concat_vertical

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202213', '4243136.159', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202301', '347437.124', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202302', '310200.688', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [20]:
concat_horizontal = np.concatenate((start_3_rows_array, end_3_rows_array), axis=1)
concat_horizontal

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit',
        'ELETPUS', '202213', '4243136.159', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours', 'ELETPUS', '202301', '347437.124', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours', 'ELETPUS', '202302', '310200.688', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

## Splitting

The opposite of concatenating (i.e., joining) arrays is splitting them. To split an array, NumPy provides the following commands:

hsplit: splits along the horizontal axis
vsplit: splits along the vertical axis
dsplit: Splits an array along the 3rd axis (depth)
array_split: lets you specify the axis to use in splitting

In [21]:
h_split = np.hsplit(data, (2, 3))
h_split
v_split = np.vsplit(data, (1, 4))
v_split
# d_split = np.dsplit(data, (2, 5))
# d_split
split_array = np.array_split(data, (1, 6), axis=1)
split_array

[array([['MSN'],
        ['CLETPUS'],
        ['CLETPUS'],
        ...,
        ['ELETPUS'],
        ['ELETPUS'],
        ['ELETPUS']], dtype='<U80'),
 array([['YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
        ['194913', '135451.32', '1',
         'Electricity Net Generation From Coal, All Sectors',
         'Million Kilowatthours'],
        ['195013', '154519.994', '1',
         'Electricity Net Generation From Coal, All Sectors',
         'Million Kilowatthours'],
        ...,
        ['202213', '4243136.159', '13',
         'Electricity Net Generation Total (including from sources not shown), All Sectors',
         'Million Kilowatthours'],
        ['202301', '347437.124', '13',
         'Electricity Net Generation Total (including from sources not shown), All Sectors',
         'Million Kilowatthours'],
        ['202302', '310200.688', '13',
         'Electricity Net Generation Total (including from sources not shown), All Sectors',
         'Million Kilowatthours'

## Adding/Removing Elements

NumPy provides several functions for adding or deleting data from an array:

resize: Returns a new array with the specified shape, with zeros as placeholders in all the new cells.
append: Adds values to the end of an array
insert: Adds values in the middle of an array
delete: Returns a new array with given data removed
unique: Finds only the unique values of an array


In [22]:
resized = np.resize(data, (10, 7090))
resized

np.append(subset, 'duke lester')
np.insert(subset, 2, 'Lester dlester is here')
np.delete(subset, 2)
np.unique(subset)

array(['1', '135451.32', '154519.994', '185203.657', '195436.666',
       '218846.325'], dtype='<U80')

## Sorting

There are several useful functions for sorting array elements. 
Some of the available sorting algorithms include `quicksort`, `heapsort`, `mergesort`, and `timesort`.

In [23]:
marks = np.array([[34, 89, 12, 70, 35, 80], [90, 34, 67, 23, 98, 36]])
column_sort = np.sort(marks, axis=1, kind='mergesort') # column sort
column_sort
np.sort(marks, axis=1, kind='heapsort')
np.sort(marks, axis=1, kind='quicksort')

array([[12, 34, 35, 70, 80, 89],
       [23, 34, 36, 67, 90, 98]])

In [24]:
# copy

my_data = np.copy(data)
my_data

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ...,
       ['ELETPUS', '202213', '4243136.159', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202301', '347437.124', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202302', '310200.688', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

## Array Broadcasting 

=> Array operations and calculations (inner product, scalar addition, subtraction, divisiom an multiplication)


In [25]:
# Broadcasting => Array operations and calculations (inner product, scalar addition, subtraction, divisiom an multiplication)

marks + 30 - 10
marks * 3
marks / 2
marks + marks
marks * marks

array([[1156, 7921,  144, 4900, 1225, 6400],
       [8100, 1156, 4489,  529, 9604, 1296]])

## Vectorization
Vectorization is the process of modifying code to utilize array operation methods. Array operations can be computed internally by NumPy using a lower-level language, which leads to many benefits:

1. Vectorized code tends to execute much faster than equivalent code that uses loops (such as for-loops and while-loops). Usually a lot faster. Therefore, vectorization can be very important for machine learning, where we often work with large datasets
2. Vectorized code can often be more compact. Having fewer lines of code to write can potentially speed-up the code-writing process, make code more readable, and reduce the risk of errors

In [26]:
# Example
arr1 = np.arange(1, 51)
arr2 = np.arange(51, 101)
print(f'my first array is {arr1}')
print(f'my second array is {arr2}')

my first array is [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50]
my second array is [ 51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
  87  88  89  90  91  92  93  94  95  96  97  98  99 100]


In [27]:
def non_vectorized_output(array1, array2):
    output = []
    for i in range(len(array1)):
        output.append(array1[i] * array2[i])
    return output


In [28]:
print(non_vectorized_output(arr1, arr2))

[51, 104, 159, 216, 275, 336, 399, 464, 531, 600, 671, 744, 819, 896, 975, 1056, 1139, 1224, 1311, 1400, 1491, 1584, 1679, 1776, 1875, 1976, 2079, 2184, 2291, 2400, 2511, 2624, 2739, 2856, 2975, 3096, 3219, 3344, 3471, 3600, 3731, 3864, 3999, 4136, 4275, 4416, 4559, 4704, 4851, 5000]


In [29]:
nv_time = %timeit -o  non_vectorized_output(arr1, arr2)
nv_time

15.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


<TimeitResult : 15.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)>

In [30]:
def vectorized_output(array1, array2):
    return array1 * array2

In [31]:
print(vectorized_output(arr1, arr2))

[  51  104  159  216  275  336  399  464  531  600  671  744  819  896
  975 1056 1139 1224 1311 1400 1491 1584 1679 1776 1875 1976 2079 2184
 2291 2400 2511 2624 2739 2856 2975 3096 3219 3344 3471 3600 3731 3864
 3999 4136 4275 4416 4559 4704 4851 5000]


In [32]:
vectorized_time = %timeit -o vectorized_output(arr1, arr2)
vectorized_time

742 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


<TimeitResult : 742 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)>

In [33]:
print('Non-vectorized version:', f'{1E6 * nv_time.average:0.2f}', 'microseconds per execution, average')

print('Vectorized version:', f'{1E6 * vectorized_time.average:0.2f}', 'microseconds per execution, average')

print('Computation was', "%.0f" % (nv_time.average / vectorized_time.average), 'times faster using vectorization')

Non-vectorized version: 15.44 microseconds per execution, average
Vectorized version: 0.74 microseconds per execution, average
Computation was 21 times faster using vectorization


In [44]:
my_data_matrix = np.arange(1, 41).reshape(10, 4)
print(my_data_matrix)
ones = np.full((10, 1), 1)  # The bias term
updated_data_matrix = np.hstack((ones, my_data_matrix)) # horizontally stacked infront of the array
print(updated_data_matrix)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]
 [25 26 27 28]
 [29 30 31 32]
 [33 34 35 36]
 [37 38 39 40]]
[[ 1  1  2  3  4]
 [ 1  5  6  7  8]
 [ 1  9 10 11 12]
 [ 1 13 14 15 16]
 [ 1 17 18 19 20]
 [ 1 21 22 23 24]
 [ 1 25 26 27 28]
 [ 1 29 30 31 32]
 [ 1 33 34 35 36]
 [ 1 37 38 39 40]]


In [42]:
theta = np.arange(1, 6).reshape(5, 1) # Vector
print(f' My theta is {theta}')

 My theta is [[1]
 [2]
 [3]
 [4]
 [5]]


In [48]:
def non_vectorized_output(data, theta):
    h = []
    for x in range(data.shape[0]):
        total = 0
        for y in range(data.shape[1]):
            total = total + data[x, y] * theta[y, 0]
        h.append(total)
    return h

In [53]:
result = non_vectorized_output(updated_data_matrix, theta)
print(f'My new weights {result}')

My new weights [41, 97, 153, 209, 265, 321, 377, 433, 489, 545]


In [50]:
#Vectorized version

def vectorized_output(data, theta):
    result = np.matmul(data, theta)  # NumPy's matrix multiplication function
    return result

In [51]:
my_result = vectorized_output(updated_data_matrix, theta)
print(f'My vectorized result is {my_result}')

My vectorized result is [[ 41]
 [ 97]
 [153]
 [209]
 [265]
 [321]
 [377]
 [433]
 [489]
 [545]]


In [52]:
non_vectorized_time = %timeit -o non_vectorized_output(data, theta)
non_vectorized_time

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U3'), dtype('int64')) -> None