# NumPy Basics: Arrays and Vectorized Computation

### Learning Objectives

- Learn about NumPy, a package for numerical computing in Python
- Use NumPy for array-based data: operations, algorithms



## Numpy: Array-based operations

* NumPy stands for Numerical Python and is pronounced as /ˈnʌmpaɪ/. Numpy is a Python library that performs numerical calculations.

* NumPy is very fast because it is written in the C programming language.

* NumPy is built on linear algebra. It’s about matrices and vectors and performing the mathematical calculations on them.
x





The key concept in NumPy is the NumPy array data type.





Image of an array with 10 length and the first index, 8th element, and indicies denoted by text

![Image of an array with 10 length and the first index, 8th element, and indicies denoted by text](https://media.geeksforgeeks.org/wp-content/uploads/CommonArticleDesign1-min.png)




 A NumPy array may have one or more dimensions:

- One dimension arrays (1D) represent vectors.
- Two-dimensional arrays (2D) represent matrices.
- And higher dimensional arrays represent tensors.

![image.png](attachment:image.png)

In [None]:
import numpy as np

a = np.array([7,2,9,10])

In [None]:
a.shape

> Unlike the built-in list type that can hold the elements of different types, the NumPy arrays allow only one data type for all elements. Therefore, we say that the NumPy array requires homogeneous data values.

> A NumPy array can contain either integer or float numbers, but not both at the same time. This restriction allows Numpy to speed up the linear algebra calculations.

![image.png](attachment:image.png)

In [None]:
a = [1,2,3]
type(a)

In [None]:
a = np.array([1,2,3])
type(a)

In [None]:
import numpy as np
arr = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])

>  Having an understanding of NumPy arrays and array-oriented computing will help you use tools with array computing semantics, like pandas and Pytorch, much more effectively. 


> While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular data. Also, pandas provides some more domain-specific functionality like time series manipulation, which is not present in NumPy.



![image.png](attachment:image.png)

### Numpy vs Pandas



- Numpy is a package for scientific computing which has support for a powerful N-dimensional array object
- Pandas is a package for data manipulation and analysis, designed to work with tabular and heterogeneous data. It is built on top of NumPy.

- Numpy is a low-level library, while pandas is a high-level library built on top of NumPy.
- Numpy is more suitable for numerical data, while pandas is more suitable for tabular data.
- Numpy is faster than pandas for array computations.
- Numpy is used for array computations, while pandas is used for data analysis.
- Numpy is a dependency for pandas. You can't use pandas without numpy.
- Numpy is more suitable for basic mathematical operations, while pandas provides more built-in functions for data manipulation.
- Numpy is more suitable for matrix operations, while pandas is more suitable for data analysis.


## Why Numpy is important

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

* NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy's library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.

* NumPy operations perform complex computations on entire arrays without the need for Python for loops, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code.



![image.png](attachment:image.png)

### Performance comparison Numpy and Python
Let us see performance difference, between NumPy array of one million integers, and the equivalent Python list:

In [None]:
import numpy as np

my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

Now let's multiply each sequence by 2:



In [None]:
%timeit my_arr2 = my_arr * 2

In [None]:
%timeit my_list2 = [x * 2 for x in my_list]

> NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.



## The NumPy ndarray: A Multidimensional Array Object

- One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. 

- Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.



In [46]:
import numpy as np

In [47]:
a = np.array([1,2,3])

In [67]:
import numpy as np

data = np.array([[1.5, -0.1, 3], 
                  [0, -3, 6.5]])

In [68]:
data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

In [69]:
data.shape

(2, 3)

Let us do mathematical operation such as multiplcation and addition

In [38]:
data * 10

array([[ 15.,  -1.,  30.],
       [  0., -30.,  65.]])

In [39]:
data + data

array([[ 3. , -0.2,  6. ],
       [ 0. , -6. , 13. ]])

> In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each "cell" in the array have been added to each other.



> Note: In this chapter and throughout the book, I use the standard NumPy convention of always using import numpy as np. It would be possible to put from numpy import * in your code to avoid having to write np., but I advise against making a habit of this. The numpy namespace is large and contains a number of functions whose names conflict with built-in Python functions (like min and max). Following standard conventions like these is almost always a good idea.



So, what is ndarray?


> An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:



In [52]:
data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

In [62]:
data.shape

(2, 3)

In [73]:
data.ndim

2

In [51]:
a = [1,2,3]
type(a)

list

In [53]:
type(data)

numpy.ndarray

In [74]:
data.dtype # attributes

dtype('float64')

In [75]:
data.size

6

## Creating ndarrays

> Using funtion

In [55]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

> Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:



In [58]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [59]:
#Since data2 was a list of lists, the NumPy array arr2 has two dimensions, 
#with shape inferred from the data. We can confirm this by inspecting 
#the ndim and shape attributes:
arr2.ndim

2

In [61]:
a = np.array([1,2,3])
a.ndim

1

In [None]:
arr2.shape

Unless explicitly specified, numpy.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have:

In [None]:
arr1.dtype

In [None]:
arr2.dtype

Other functions for creating arrays

#### Special array creation

* `numpy.zeros` creates an array of zeros with a given length or shape
* `numpy.ones` creates an array of ones with a given length or shape
* `numpy.empty` creates an array without initialized values
* `numpy.arange` creates a range
* Pass a tuple for the shape to create a higher dimensional array

In [76]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [81]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [82]:
np.empty((2, 3, 2))
#It’s not safe to assume that numpy.empty will return an array of all zeros. 
# This function returns uninitialized memory and thus may contain nonzero "garbage" values. 
# You should use this function only if you intend to populate the new array with data.

array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

`numpy.arange` is an array-valued version of the built-in Python range function:



In [86]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

![image.png](attachment:image.png)

## Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:





* Unless explicitly specified, `numpy.array` tries to infer a good data created arrays. 
* Data type is stored in a special `dtype` metadata object.
* Can be explict or converted (cast)
* It is important to care about the general kind of data you’re dealing with.

In [89]:
arr1 = np.array([1, 2, 3], dtype=np.float64)

In [90]:
arr1.dtype

dtype('float64')

In [91]:
arr2 = np.array([1, 2, 3], dtype=np.int32)

In [92]:
arr2.dtype

dtype('int32')

> When you need more control over how data is stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type.

![image.png](attachment:image.png)

#### Conversion of ndarray

You can explicitly convert or cast an array from one data type to another using ndarray’s astype method:



In [93]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

dtype('int64')

In [94]:
float_arr = arr.astype(np.float64)
float_arr

array([1., 2., 3., 4., 5.])

In [95]:
float_arr.dtype

dtype('float64')

> In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated:

In [96]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

array([ 3.7, -1.2, -2.6,  0.5, 12.9, 10.1])

In [97]:
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

> If you have an array of strings representing numbers, you can use astype to convert them to numeric form:



In [98]:
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_)

In [99]:
numeric_strings.astype(float)

#Be cautious when using the numpy.string_ type, 
# as string data in NumPy is fixed size and may truncate input without warning. 
# pandas has more intuitive out-of-the-box behavior on non-numeric data.

array([ 1.25, -9.6 , 42.  ])

You can also use another array’s dtype attribute:



In [104]:
int_array = np.arange(10)
int_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [101]:
int_array.dtype

dtype('int64')

In [102]:
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

In [103]:
int_array.astype(calibers.dtype)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

> There are shorthand type code strings you can also use to refer to a dtype:



In [None]:
zeros_uint32 = np.zeros(8, dtype="u4")
zeros_uint32

> Calling astype always creates a new array (a copy of the data), even if the new data type is the same as the old data type.



## Arithmetic with NumPy Arrays (Vectorization)


- Batch operations on data without `for` loops

- NumPy users call this **vectorization**. 





In [105]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [106]:
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [107]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

> Arithmetic operations with scalars propagate the scalar argument to each element in the array:



In [108]:
1 / arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

### Arithmetic operations with scalars 
Propagate the scalar argument to each element in the array


In [109]:
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [110]:
arr ** 2

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

Comparisons between arrays of the same size yield Boolean arrays:



In [111]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

#### Comparisons between arrays


In [112]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

> Evaluating operations between differently sized arrays is called broadcasting

## Basic Indexing and Slicing

NumPy array indexing is a deep topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:



> Select a subset of your data or individual elements


In [113]:
import numpy as np
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

select the sixth element

In [120]:
a = np.array([5,6,7,9,1])

In [122]:
a[0]

5

In [119]:
arr[1]

1

In [123]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [126]:
arr[5:8]

array([5, 6, 7])

Broadcast data

In [127]:
arr[5:8] = 12

In [128]:
arr
# As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcast henceforth) to the entire selection.

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

> An important first distinction from Python's built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.



In [129]:
arr[5:8]

array([12, 12, 12])

In [130]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

Now, when I change values in arr_slice, the mutations are reflected in the original array arr:



In [134]:
arr_slice[1] = 12345
arr_slice



array([   12, 12345,    12])

In [133]:
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The "bare" slice [:] will assign to all values in an array:



In [135]:
arr_slice[:] = 64

In [136]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

> If you are new to NumPy, you might be surprised by this, especially if you have used other array programming languages that copy data more eagerly. As NumPy has been designed to be able to work with very large arrays, you could imagine performance and memory problems if NumPy insisted on always copying data.



> If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, `arr[5:8].copy()`. As you will see, pandas works this way, too.



In [137]:
arr_slice_copy = arr[5:8].copy()
 
arr_slice_copy[:] = 333333

In [138]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

### Higher Dimentional Arrays

> With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:



In [None]:
arr2d = np.array([[1, 2, 3], 
                  [4, 5, 6], 
                  [7, 8, 9]])

In [None]:
arr2d[2]

Individual element can be access using comma-separated list of indices to select individual elements.

In [None]:
arr2d[0, 2]

In [None]:
arr2d[0][2]

> An illustration of indexing on a two-dimensional array. I find it helpful to think of axis 0 as the "rows" of the array and axis 1 as the "columns."

![image.png](attachment:image.png)

## Pseudorandom Number Generation





> The `numpy.random` module supplements the built-in Python random module with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.


* Much faster than Python's built-in `random` module
* For example, you can get a 4 × 4 array of samples from the standard normal distribution using

In [152]:
samples = np.random.standard_normal(size=(4, 4))
samples


array([[ 0.2608348 ,  0.89080833, -0.37105702,  0.57959148],
       [-0.98259207, -0.18606728,  0.12734231, -0.34724384],
       [ 0.87889153, -0.33755182,  1.25498076, -0.39682047],
       [ 1.48742241,  0.9115685 , -0.37025513, -0.36594254]])

Can use an explicit generator:


> These random numbers are not truly random (rather, pseudorandom) but instead are generated by a configurable random number generator that determines deterministically what values are created. Functions like numpy.random.standard_normal use the numpy.random module's default random number generator, but your code can be configured to use an explicit generator:



In [167]:
# seed` determines initial state of generator

rng = np.random.default_rng(seed=12345)
data = rng.standard_normal((2, 3))
data


array([[-1.42382504,  1.26372846, -0.87066174],
       [-0.25917323, -0.07534331, -0.74088465]])

![image.png](attachment:image.png)

## Universal Functions: Fast Element-Wise Array Functions

* A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays.

* Many ufuncs are simple element-wise transformations such as `numpy.sqrt` or `numpy.exp`:

### Unary

In [168]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [169]:
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [170]:
np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

These are referred to as unary ufuncs. Others, such as numpy.add or numpy.maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:



### Binary

In [171]:
x = rng.standard_normal(8)
y = rng.standard_normal(8)


In [172]:
x

array([-1.3677927 ,  0.6488928 ,  0.36105811, -1.95286306,  2.34740965,
        0.96849691, -0.75938718,  0.90219827])

In [173]:
y

array([-0.46695317, -0.06068952,  0.78884434, -1.25666813,  0.57585751,
        1.39897899,  1.32229806, -0.29969852])

In [174]:
np.maximum(x, y)
#In this example, numpy.maximum computed the element-wise maximum of the elements in x and y.


array([-0.46695317,  0.6488928 ,  0.78884434, -1.25666813,  2.34740965,
        1.39897899,  1.32229806,  0.90219827])

> Ufuncs accept an optional out argument that allows them to assign their results into an existing array rather than create a new one:



In [None]:
np.add(arr, 1)

In [None]:
np.add(arr, 1, out=out)

In [None]:
out

![image.png](attachment:image.png)

## Expressing Conditional Logic as Array Operations


The `numpy.where` function is a vectorized version of the ternary expression `x if condition else`.

* second and third arguments to `numpy.where` can also be scalars
* can also combine scalars and arrays

In [176]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

Take a value from `xarr` whenever the corresponding value in `cond` is `True`, and otherwise take the value from `yarr`:


#### Using `x if condition else`

In [175]:
result = [(x if c else y)
  for x, y, c in zip(xarr, yarr, cond)]

result

NameError: name 'xarr' is not defined

### Using Numpy

In [None]:
result = np.where(cond, xarr, yarr)
result

Can also do this with scalars, or combine arrays and scalars:


In [None]:
arr = rng.standard_normal((4,4))
arr

In [None]:
np.where(arr > 0, 2, -2)

In [None]:
# set only positive to 2
np.where(arr > 0,2,arr)

## Mathematical and Statistical Methods

Use "aggregations' like `sum`, `mean`, and `std`



In [179]:
arr = rng.standard_normal((5, 4))
arr

array([[-0.41485376,  0.0977165 , -1.64041784, -0.85725882],
       [ 0.68828179, -1.15452958,  0.65045239, -1.38835995],
       [-0.90738246, -1.09542531,  0.00714569,  0.5343599 ],
       [-1.06580785, -0.18147274,  1.6219518 , -0.31739195],
       [-0.81581497,  0.38657902, -0.22363893, -0.70169081]])

In [180]:
arr.mean()

-0.33887789328041784

In [None]:
np.mean(arr)

> When you use the NumPy function, like numpy.sum, you have to pass the array you want to aggregate as the first argument.

In [181]:
arr.sum()

-6.777557865608356

In [None]:
np.sum(arr)

> Functions like mean and sum take an optional axis argument that computes the statistic over the given axis, resulting in an array with one less dimension:



In [None]:
#Can use `axis` to specify which axis to computer the statistic

arr

In [182]:
arr.mean(axis=1)

array([-0.70370348, -0.30103884, -0.36532554,  0.01431982, -0.33864142])

> Here, arr.mean(axis=1) means "compute mean across the columns," where arr.sum(axis=0) means "compute sum down the rows."



In [None]:
arr.mean(axis=0)

### Methods for Boolean Arrays

* Boolean values are coerced to 1 (`True`) and 0 (`False`) in the preceding methods. 
* Thus, sum is often used as a means of counting True values in a boolean array:



In [183]:
arr

array([[-0.41485376,  0.0977165 , -1.64041784, -0.85725882],
       [ 0.68828179, -1.15452958,  0.65045239, -1.38835995],
       [-0.90738246, -1.09542531,  0.00714569,  0.5343599 ],
       [-1.06580785, -0.18147274,  1.6219518 , -0.31739195],
       [-0.81581497,  0.38657902, -0.22363893, -0.70169081]])

In [184]:
(arr > 0)

array([[False,  True, False, False],
       [ True, False,  True, False],
       [False, False,  True,  True],
       [False, False,  True, False],
       [False,  True, False, False]])

In [185]:
(arr > 0).sum() # Number of positive values

7

`any` tests whether one or more values in an array is True, while `all` checks if every value is True:


In [186]:
bools = np.array([False, False, True, False])
bools

array([False, False,  True, False])

In [187]:
bools.any()

True

## Sorting

NumPy arrays can be sorted in place with the `sort` method:


In [188]:
arr = rng.standard_normal(6)
arr

array([-1.79571318e+00,  8.18325622e-01, -5.71032902e-01,  7.85525063e-04,
       -1.06364272e+00,  1.30171450e+00])

In [189]:
arr.sort()
arr

array([-1.79571318e+00, -1.06364272e+00, -5.71032902e-01,  7.85525063e-04,
        8.18325622e-01,  1.30171450e+00])

> Can sort multidimensional section by providing an axis:
