# NumPy
NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python.
Much of the knowledge about NumPy that I cover is transferable to pandas as well.

For most data analysis applications, the main areas of functionality I’ll focus on are:

- Fast array-based operations for data munging and cleaning, subsetting and filtering, transformation, and any other kind of computation
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with if-elif-else branches
- Group-wise data manipulations (aggregation, transformation, and function application)

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for **efficiency** on large arrays of data. There are a number of reasons for this:
- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead.
- NumPy operations perform complex computations on entire arrays without the need for Python for loops, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code.

In [479]:
import numpy as np
np.random.seed(12345)

import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

In [481]:
import numpy as np

my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

In [None]:
%timeit my_arr2 = my_arr * 2
%timeit my_list2 = [x * 2 for x in my_list]

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their
pure Python counterparts and use significantly less memory.

## The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or **ndarray**, 
which is a fast, flexible container for large datasets in Python. Arrays enable you to
perform mathematical operations on whole blocks of data using similar syntax to the
equivalent operations between scalar elements.

An **ndarray** is a generic multidimensional container for homogeneous data; that is, all
of the elements must be the same type. Every array has a **shape**, a tuple indicating the
size of each dimension, and a **dtype**, an object describing the data type of the array

In [482]:
import numpy as np
data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])
data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

In [None]:
data * 10
data + data

In [None]:
data.shape
data.dtype

### Creating ndarrays

In [None]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

In [None]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

In [None]:
arr2.ndim
arr2.shape

In [None]:
arr1.dtype
arr2.dtype

numpy.zeros and numpy.ones create arrays of 0s or 1s,
respectively, with a given length or shape. 

numpy.empty creates an array without
initializing its values to any particular value

It’s **not** safe to assume that numpy.empty will return an array of all zeros. This function returns uninitialized memory and thus may
contain nonzero “garbage” values. You should use this function only if you intend to populate the new array with data.

In [None]:
np.zeros(10)
np.zeros((3, 6))
np.empty((2, 3, 2))

In [None]:
np.arange(15)

#### Some important NumPy array creation functions
- **array** Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a data
type or explicitly specifying a data type; copies the input data by default
- **asarray** Convert input to ndarray, but do not copy if the input is already an ndarray
- **arange** Like the built-in range but returns an ndarray instead of a list
- **ones,ones_like**
Produce an array of all 1s with the given shape and data type; ones_like takes another array and
produces a ones array of the same shape and data type
- **zeros,zeros_like**
Like ones and ones_like but producing arrays of 0s instead
- **empty,empty_like**
Create new arrays by allocating new memory, but do not populate with any values like ones and
zeros
- **full,full_like**
Produce an array of the given shape and data type with all values set to the indicated “fill value”;
full_like takes another array and produces a filled array of the same shape and data type
- **eye, identity** Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere)

### Data Types for ndarrays
The data type or dtype is a special object containing the information (or metadata,
data about data) the ndarray needs to interpret a chunk of memory as a particular
type of data

Data types are a source of NumPy’s flexibility for interacting with data coming from
other systems. In most cases they provide a mapping directly onto an underlying
disk or memory representation, which makes it possible to read and write binary
streams of data to disk and to connect to code written in a low-level language like
C or FORTRAN. The numerical data types are named the same way: a type name,
like float or int, followed by a number indicating the number of bits per element.
A standard double-precision floating-point value (what’s used under the hood in
Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy
as float64. 

**NumPy data types**
<img src="Img/np_data_types.png" alt="NumPy data types" title="NumPy data types" />


In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr1.dtype
arr2.dtype

In [None]:
#  cast an array from one data type to another
arr = np.array([1, 2, 3, 4, 5])
arr.dtype
float_arr = arr.astype(np.float64)
float_arr
float_arr.dtype

In [None]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr
arr.astype(np.int32)

In [None]:
#`np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.bytes_)
numeric_strings.astype(float)

In [None]:
int_array = np.arange(10)
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
int_array.astype(calibers.dtype)

Calling astype always creates a **new array** (a copy of the data), even
if the new data type is the same as the old data type.

### Arithmetic with NumPy Arrays
Arrays are important because they enable you to express batch operations on data
without writing any for loops. NumPy users call this **vectorization**. Any arithmetic
operations between equal-size arrays apply the operation element-wise

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr
arr * arr
arr - arr

In [None]:
# Arithmetic operations with scalars
1 / arr
arr ** 2

In [None]:
# Comparisons between arrays of the same size
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2
arr2 > arr

### Basic Indexing and Slicing

NumPy array indexing is a deep topic, as there are many ways you may want to select
a subset of your data or individual elements. One-dimensional arrays are simple; on
the surface they act similarly to Python lists.

If you assign a scalar value to a slice, as in arr[5:8] = 12, the value is
propagated (or broadcast henceforth) to the entire selection.

Evaluating operations between differently sized arrays is called **broadcasting**

An important first distinction from Python’s built-in lists is that
array slices are **views** on the original array. This means that the data
is **not** copied, and any modifications to the view will be reflected in
the source array.
As NumPy has been designed to be able to work with very large arrays, you could imagine performance and memory problems if NumPy insisted on always copying data.
If you want a copy of a slice of an ndarray instead of a
view, you will need to explicitly copy the array—for example,
arr[5:8].copy(). As you will see, pandas works this way, too.

In [503]:
arr = np.arange(10)
arr


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [504]:
arr[5]


np.int64(5)

In [505]:
arr[5:8] = 12


In [508]:
old = arr[:5].copy()


In [509]:
arr[5:8] = 10

In [510]:
arr

array([ 0,  1,  2,  3,  4, 10, 10, 10,  8,  9])

In [511]:
old

array([0, 1, 2, 3, 4])

In [None]:
arr_slice = arr[5:8]
arr_slice

In [None]:
# mutations are reflected in the original array arr
arr_slice[1] = 12345
arr

In [None]:
# : will assign to all values in an array
arr_slice[:] = 64
arr

In [520]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d.shape
arr2d[:,0]

array([1, 4, 7])

In [517]:
arr2d[0][2]
arr2d[0, 2]

np.int64(3)

In multidimensional arrays, if you omit later indices, the returned object will be a
lower dimensional ndarray consisting of all the data along the higher dimensions.

In [None]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d

In [None]:
arr3d[0]

In [None]:
# Both scalar values and arrays can be assigned to arr3d[0]
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d
arr3d[0] = old_values
arr3d

In [None]:
# Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0),
# forming a one-dimensional array
arr3d[1, 0]

In [None]:
x = arr3d[1]
x
x[0]

This multidimensional indexing syntax for NumPy arrays will not
work with regular Python objects, such as lists of lists.

In [None]:
arr
arr[1:6]

In [None]:
# As you can see, it has sliced along axis 0, the first axis.
# “select the first two rows of arr2d.”
arr2d
arr2d[:2]

In [None]:
arr2d[:2, 1:]

In [None]:
lower_dim_slice = arr2d[1, :2]

In [None]:
lower_dim_slice.shape

In [None]:
arr2d[:2, 2]

In [None]:
arr2d[:, :1]

In [None]:
arr2d[:2, 1:] = 0
arr2d

<img src="Img/np_2d.png" alt="Indexing elements in a NumPy array" title="Indexing elements in a NumPy array" />

<img src="Img/np_slicing.png" alt="Two-dimensional array slicing" title="Two-dimensional array slicing" />


### Boolean Indexing

Selecting data from an array by Boolean indexing and assigning the result to a new
variable always creates a copy of the data, even if the returned array is unchanged.

The Python keywords and and or do not work with Boolean arrays.
Use & (and) and | (or) instead.

In [521]:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2],
                 [-12, -4], [3, 4]])
names


array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [522]:
data

array([[  4,   7],
       [  0,   2],
       [ -5,   6],
       [  0,   0],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

In [523]:
names == "Bob"

array([ True, False, False,  True, False, False, False])

In [524]:
# The Boolean array must be of the same length as the array axis it’s indexing.
data[names == "Bob"]

array([[4, 7],
       [0, 0]])

In [None]:
# I select from the rows where names == "Bob" and index the columns
data[names == "Bob", 1:]
data[names == "Bob", 0]

array([4, 0])

In [527]:
~(names == "Bob")

array([False,  True,  True, False,  True,  True,  True])

In [528]:
names == "Bob"

array([ True, False, False,  True, False, False, False])

In [529]:
names != "Bob"
~(names == "Bob")
data[~(names == "Bob")]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

In [None]:
# The ~ operator can be useful when you want to invert a Boolean array referenced by a variable
cond = names == "Bob"
data[~cond]

In [None]:
# multiple Boolean conditions, use Boolean arithmetic operators like & (and) and | (or)
mask = (names == "Bob") | (names == "Will")
mask
data[mask]

In [None]:
data[data < 0] = 0
data

In [None]:
data[names != "Joe"] = 7
data

### Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.

Keep in mind that fancy indexing, unlike slicing, always **copies** the data into a new
array when assigning the result to a new variable.

In [None]:
arr = np.zeros((8, 4))
for i in range(8):
    arr[i] = i
arr

In [None]:
# To select a subset of the rows in a particular order, you can simply pass a list 
# or ndarray of integers specifying the desired order
arr[[4, 3, 0, 6]]

In [None]:
# Using negative indices selects rows from the end
arr[[-3, -5, -7]]

In [None]:
# it selects a one-dimensional array of elements corresponding to each tuple of indices
# the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected
arr = np.arange(32).reshape((8, 4))
arr
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

In [None]:
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

In [None]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]
arr[[1, 5, 7, 2], [0, 3, 1, 2]] = 0
arr

### Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the
underlying data without copying anything. Arrays have the transpose method and
the special T attribute

Simple transposing with .T is a special case of swapping axes. ndarray has the method
swapaxes, which takes a pair of axis numbers and switches the indicated axes to
rearrange the data

swapaxes similarly returns a view on the data without making a copy.

In [None]:
arr = np.arange(15).reshape((3, 5))
arr
arr.T

In [None]:
arr = np.arange(15)
arr

array([], dtype=int64)

In [532]:
arr.reshape((3, 5))

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [None]:
arr = np.array([[0, 1, 0], [1, 2, -2], [6, 3, 2], [-1, 0, -1], [1, 0, 1]])
arr
np.dot(arr.T, arr)

In [None]:
# @ operator is another way to do matrix multiplication (it's prefered )
arr.T @ arr

In [None]:
arr
arr.swapaxes(0, 1)

## Pseudorandom Number Generation
The numpy.random module supplements the built-in Python random module with
functions for efficiently generating whole arrays of sample values from many kinds of
probability distributions.

These random numbers are not truly random (rather, pseudorandom) but instead
are generated by a configurable random number generator that determines determin‐
istically what values are created. Functions like numpy.random.standard_normal use
the numpy.random module’s default random number generator, but your code can be
configured to use an explicit generator.

```python
rng = np.random.default_rng(seed=12345)
data = rng.standard_normal((2, 3))
```
The seed argument is what determines the initial state of the generator, and the state
changes each time the rng object is used to generate data.

NumPy random number generator methods
- **permutation** Return a random permutation of a sequence, or return a permuted range
- **shuffle** Randomly permute a sequence in place
- **uniform** Draw samples from a uniform distribution
- **integers** Draw random integers from a given low-to-high range
- **standard_normal** Draw samples from a normal distribution with mean 0 and standard deviation 1
- **binomial** Draw samples from a binomial distribution
- **normal** Draw samples from a normal (Gaussian) distribution
- **beta** Draw samples from a beta distribution
- **chisquare** Draw samples from a chi-square distribution
- **gamma** Draw samples from a gamma distribution
- **uniform** Draw samples from a uniform [0, 1) distribution

In [534]:
samples = np.random.standard_normal(size=(4, 4))
samples

array([[-0.2047,  0.4789, -0.5194, -0.5557],
       [ 1.9658,  1.3934,  0.0929,  0.2817],
       [ 0.769 ,  1.2464,  1.0072, -1.2962],
       [ 0.275 ,  0.2289,  1.3529,  0.8864]])

In [535]:
# Python’s built-in random module, by contrast, samples only one value at a time.
from random import normalvariate

N = 1_000_000
%timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.standard_normal(N)

272 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.2 ms ± 165 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
rng = np.random.default_rng(seed=12345)
data = rng.standard_normal((2, 3))

In [None]:
type(rng)

In [None]:
arr

np.random.shuffle(arr)
arr

## Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations
on data in ndarrays. You can think of them as fast vectorized wrappers for simple
functions that take one or more scalar values and produce one or more scalar results.
Many ufuncs are simple element-wise transformations, like numpy.sqrt or
numpy.exp. These are referred to as **unary ufuncs**. Others, such as numpy.add or numpy.maximum,
take two arrays (thus, **binary ufuncs**) and return a single array as the result.

Ufuncs accept an optional **out** argument that allows them to assign their results into
an existing array rather than create a new one.

In [None]:
arr = np.arange(10)
arr


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [538]:
np.sqrt(arr)


array([0.    , 1.    , 1.4142, 1.7321, 2.    , 2.2361, 2.4495, 2.6458,
       2.8284, 3.    ])

In [539]:
np.exp(arr)

array([   1.    ,    2.7183,    7.3891,   20.0855,   54.5982,  148.4132,
        403.4288, 1096.6332, 2980.958 , 8103.0839])

In [540]:
x = rng.standard_normal(8)
y = rng.standard_normal(8)
x
y
np.maximum(x, y)

array([ 1.5737,  0.767 , -0.6215, -0.5368,  0.6677,  0.3589, -0.3973,
       -0.1022])

In [543]:
[max(xx, yy) for xx, yy in zip(x, y)]

[np.float64(1.573650217139163),
 np.float64(0.7670415712955189),
 np.float64(-0.6215066357878022),
 np.float64(-0.5367674842527366),
 np.float64(0.6677241819767613),
 np.float64(0.3589164304858842),
 np.float64(-0.397343136386101),
 np.float64(-0.10221420428307952)]

In [546]:
np.random.shuffle(arr)
arr

array([2, 7, 4, 0, 3, 5, 6, 9, 8, 1])

In [None]:
# While not common, a ufunc can return multiple arrays. numpy.modf is one example:
# a vectorized version of the built-in Python math.modf, it returns the fractional and
# integral parts of a floating-point array
arr = rng.standard_normal(7) * 5
arr
remainder, whole_part = np.modf(arr)
remainder
whole_part

In [None]:
arr
out = np.zeros_like(arr)
np.add(arr, 1)
np.add(arr, 1, out=out)
out

<img src="Img/np_unary_binary_ufunc.png" alt="Some unary and binary universal functions" title="Some unary and binary universal functions" />


## Array-Oriented Programming with Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as
concise array expressions that might otherwise require writing loops. This practice
of replacing explicit loops with array expressions is referred to by some people
as **vectorization**.

As a simple example, suppose we wished to evaluate the function sqrt(x^2 +
y^2) across a regular grid of values. The numpy.meshgrid function takes two one-
dimensional arrays and produces two two-dimensional matrices corresponding to all
pairs of (x, y) in the two arrays.

In [None]:
points = np.arange(-5, 5, 0.01) # 100 equally spaced points
xs, ys = np.meshgrid(points, points)
ys

In [None]:
z = np.sqrt(xs ** 2 + ys ** 2)
z

In [None]:
import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray, extent=[-5, 5, -5, 5])
plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")


In [None]:
plt.draw()

In [None]:
plt.close("all")

### Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if
condition else y

In [547]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

In [548]:
result = [(x if c else y)
          for x, y, c in zip(xarr, yarr, cond)]
result

[np.float64(1.1),
 np.float64(2.2),
 np.float64(1.3),
 np.float64(1.4),
 np.float64(2.5)]

This has multiple problems. First, it will not be very fast for large arrays (because all
the work is being done in interpreted Python code). Second, it will not work with
multidimensional arrays

In [549]:
result = np.where(cond, xarr, yarr)
result

array([1.1, 2.2, 1.3, 1.4, 2.5])

The second and third arguments to numpy.where don’t need to be arrays; one or
both of them can be scalars. A typical use of where in data analysis is to produce a
new array of values based on another array. Suppose you had a matrix of randomly
generated data and you wanted to replace all positive values with 2 and all negative
values with –2. 

In [None]:
arr = rng.standard_normal((4, 4))
arr
arr > 0
np.where(arr > 0, 2, -2)

In [None]:
np.where(arr > 0, 2, arr) # set only positive values to 2

### Mathematical and Statistical Methods
A set of mathematical functions that compute statistics about an entire array or
about the data along an axis are accessible as methods of the array class. You can
use aggregations (sometimes called reductions) like sum, mean, and std (standard
deviation) either by calling the array instance method or using the top-level NumPy
function. When you use the NumPy function, like numpy.sum, you have to pass the
array you want to aggregate as the first argument.

In [550]:
arr = rng.standard_normal((5, 4))
arr
arr.mean()


np.float64(-0.2435435099546431)

In [551]:
np.mean(arr)

np.float64(-0.2435435099546431)

In [552]:
arr.sum()

np.float64(-4.870870199092862)

In [None]:
# arr.mean(axis=1) means “compute mean across the columns,” where
# arr.sum(axis=0) means “compute sum down the rows.”
arr.mean(axis=1)
arr.sum(axis=0)

In [555]:
arr.sum(axis=0), arr.sum(axis=1)

(array([-2.4404, -1.6844, -0.6148, -0.1312]),
 array([-2.0768, -0.8474, -0.2201, -0.389 , -1.3376]))

In [556]:
arr, arr.shape

(array([[-0.5769, -0.1792, -0.5935, -0.7271],
        [-1.2896,  0.5939, -0.1789,  0.0272],
        [-1.3335, -0.3033,  0.6182,  0.7985],
        [ 1.1271, -1.9557,  0.4048,  0.0348],
        [-0.3676,  0.1599, -0.8654, -0.2645]]),
 (5, 4))

In [557]:
# Other methods like cumsum and cumprod do not aggregate, instead producing an array
# of the intermediate results
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()

array([ 0,  1,  3,  6, 10, 15, 21, 28])

In [None]:
# In multidimensional arrays, accumulation functions like cumsum return an array of
# the same size but with the partial aggregates computed along the indicated axis
# according to each lower dimensional slice
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr

In [None]:
# arr.cumsum(axis=0) computes the cumulative sum along the rows,
# arr.cumsum(axis=1) computes the sums along the columns
arr.cumsum(axis=0)
arr.cumsum(axis=1)

<img src="Img/np_statistical.png" alt="Basic array statistical methods" title="Basic array statistical methods" />


### Methods for Boolean Arrays

In [558]:
# The parentheses here in the expression (arr > 0).sum() are necessary to be able to
# call sum() on the temporary result of arr > 0.

arr = rng.standard_normal(100)
(arr > 0).sum() # Number of positive values
(arr <= 0).sum() # Number of non-positive values

np.int64(46)

In [559]:
(arr > 0).sum()

np.int64(54)

In [565]:
arr > 0, int(True),  int(False)

(array([ True,  True,  True, False, False, False,  True,  True,  True,
        False, False, False,  True,  True, False,  True, False, False,
        False,  True,  True, False,  True, False, False,  True, False,
        False,  True, False, False,  True,  True,  True, False, False,
        False, False, False,  True,  True, False,  True,  True,  True,
        False,  True, False,  True,  True,  True,  True,  True,  True,
        False,  True,  True,  True,  True,  True, False, False,  True,
        False,  True,  True,  True,  True,  True,  True, False, False,
        False,  True, False, False, False, False,  True,  True,  True,
         True, False, False,  True,  True, False,  True, False, False,
        False, False, False, False,  True,  True,  True,  True,  True,
        False]),
 1,
 0)

In [562]:
arr[arr>0].shape

(54,)

Two additional methods, any and all, are useful especially for Boolean arrays. any
tests whether one or more values in an array is True, while all checks if every value is
True

These methods also work with non-Boolean arrays, where nonzero elements are
treated as True

In [567]:
bools = np.array([False, False, True, False])
bools

array([False, False,  True, False])

In [568]:
bools.any()

np.True_

In [569]:
bools.all()

np.False_

### Sorting

In [573]:
arr = rng.standard_normal(6)
arr
arr.sort()
arr

array([-1.2712, -0.6129, -0.5657,  0.0922,  0.8743,  1.7179])

In [574]:
arr = rng.standard_normal((5, 3))
arr

array([[-0.6991, -0.9853,  1.463 ],
       [ 0.4245,  0.8361,  0.3739],
       [-0.7121, -0.6905,  1.3521],
       [-0.0533, -0.0959,  0.5996],
       [-0.0225,  0.1615,  0.0186]])

In [575]:
# arr.sort(axis=0) sorts the values within each column
# arr.sort(axis=1) sorts across each row

arr.sort(axis=0)
arr


array([[-0.7121, -0.9853,  0.0186],
       [-0.6991, -0.6905,  0.3739],
       [-0.0533, -0.0959,  0.5996],
       [-0.0225,  0.1615,  1.3521],
       [ 0.4245,  0.8361,  1.463 ]])

In [None]:
arr.sort(axis=1)
arr

array([[-0.9853, -0.7121,  0.0186],
       [-0.6991, -0.6905,  0.3739],
       [-0.0959, -0.0533,  0.5996],
       [-0.0225,  0.1615,  1.3521],
       [ 0.4245,  0.8361,  1.463 ]])

The top-level method numpy.sort returns a sorted copy of an array (like the Python
built-in function sorted) instead of modifying the array in place.

In [None]:
arr2 = np.array([5, -10, 7, 1, 0, -3])
sorted_arr2 = np.sort(arr2)
sorted_arr2

### Unique and Other Set Logic
- **unique(x)** Compute the sorted, unique elements in x
- **intersect1d(x, y)** Compute the sorted, common elements in x and y
- **union1d(x, y)** Compute the sorted union of elements
- **isin(x, y)** Compute a Boolean array indicating whether each element of x is contained in y
- **setdiff1d(x, y)** Set difference, elements in x that are not in y
- **setxor1d(x, y)** Set symmetric differences; elements that are in either of the arrays, but not both

In [577]:
names = np.array(["Bob", "Will", "Joe", "Bob", "Will", "Joe", "Joe"])
np.unique(names)


array(['Bob', 'Joe', 'Will'], dtype='<U4')

In [578]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

array([1, 2, 3, 4])

In [579]:
values, counts = np.unique(names, return_counts=True)
values, counts

(array(['Bob', 'Joe', 'Will'], dtype='<U4'), array([2, 3, 2]))

In [None]:
#Contrast numpy.unique with the pure Python alternative
sorted(set(names))

In [581]:
values = np.array([6, 0, 0, 3, 2, 5, 6])
values[np.isin(values, [2, 3, 6])]

array([6, 3, 2, 6])

## File Input and Output with Arrays
NumPy is able to save and load data to and from disk in some text or binary formats.

numpy.save and numpy.load are the two workhorse functions for efficiently saving
and loading array data on disk. Arrays are saved by default in an uncompressed raw
binary format with file extension .npy

In [None]:
arr = np.arange(10)
np.save("some_array", arr)

In [None]:
np.load("some_array.npy")

In [None]:
np.savez("array_archive.npz", a=arr, b=arr)

In [None]:
arch = np.load("array_archive.npz")
arch["b"]

In [None]:
np.savez_compressed("arrays_compressed.npz", a=arr, b=arr)

In [None]:
!rm some_array.npy
!rm array_archive.npz
!rm arrays_compressed.npz

## Linear Algebra
Linear algebra operations, like matrix multiplication, decompositions, determinants,
and other square matrix math, are an important part of many array libraries. Multi‐
plying two two-dimensional arrays with * is an element-wise product, while matrix multiplications require using a function. Thus, there is a function dot, both an array method and a function in the numpy namespace, for matrix multiplication

<img src="Img/np_linalg.png" alt="Commonly used numpy.linalg functions" title="Commonly used numpy.linalg functions" />


In [None]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
y
x.dot(y)

In [None]:
# x.dot(y) is equivalent to np.dot(x, y)
np.dot(x, y)

In [None]:
x @ np.ones(3)

In [None]:
# numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant
from numpy.linalg import inv, qr

X = rng.standard_normal((5, 5))
mat = X.T @ X
inv(mat)
mat @ inv(mat)

## Example: Random Walks

The simulation of random walks provides an illustrative application of utilizing array
operations. Let’s first consider a simple random walk starting at 0 with steps of 1 and
–1 occurring with equal probability.

In [None]:
import random
position = 0
walk = [position]
nsteps = 1000
for _ in range(nsteps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)


In [None]:
plt.figure()

In [None]:
plt.plot(walk[:100])

You might make the observation that walk is the cumulative sum of the random steps
and could be evaluated as an array expression. Thus, I use the numpy.random module
to draw 1,000 coin flips at once, set these to 1 and –1, and compute the cumulative
sum

In [None]:
nsteps = 1000
rng = np.random.default_rng(seed=12345)  # fresh random generator
draws = rng.integers(0, 2, size=nsteps)
steps = np.where(draws == 0, 1, -1)
walk = steps.cumsum()

In [None]:
walk.min()
walk.max()

A more complicated statistic is the first crossing time, the step at which the random
walk reaches a particular value. Here we might want to know how long it took the
random walk to get at least 10 steps away from the origin 0 in either direction.
np.abs(walk) >= 10 gives us a Boolean array indicating where the walk has reached
or exceeded 10, but we want the index of the first 10 or –10. Turns out, we can
compute this using argmax, which returns the first index of the maximum value in
the Boolean array (True is the maximum value)

Note that using argmax here is not always efficient because it always makes a full
scan of the array. In this special case, once a True is observed we know it to be the
maximum value.

In [None]:
(np.abs(walk) >= 10).argmax()

### Simulating Many Random Walks at Once
If your goal was to simulate many random walks, say five thousand of them, you can
generate all of the random walks with minor modifications to the preceding code. If
passed a 2-tuple, the numpy.random functions will generate a two-dimensional array
of draws, and we can compute the cumulative sum for each row to compute all five
thousand random walks in one shot

In [None]:
nwalks = 5000
nsteps = 1000
draws = rng.integers(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(axis=1)
walks

In [None]:
# Now, we can compute the maximum and minimum values obtained over all of the walks
walks.max()
walks.min()

In [None]:
# Out of these walks, let’s compute the minimum crossing time to 30 or –30. 
# This is slightly tricky because not all 5,000 of them reach 30. 
# We can check this using the any method

hits30 = (np.abs(walks) >= 30).any(axis=1)
hits30
hits30.sum() # Number that hit 30 or -30

In [None]:
# We can use this Boolean array to select the rows of walks that actually cross the
# absolute 30 level, and call argmax across axis 1 to get the crossing times
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(axis=1)
crossing_times

In [None]:
crossing_times.mean()

In [None]:
draws = 0.25 * rng.standard_normal((nwalks, nsteps))

In [583]:
# Conta quanti elementi sono uguali al valore massimo
arr = np.array([1,2,3,4,5,5,5,5])
arr

array([1, 2, 3, 4, 5, 5, 5, 5])

In [585]:
arr.max()

np.int64(5)

In [587]:
(arr == arr.max()).sum()

np.int64(4)

In [589]:
values, counts = np.unique(arr, return_counts=True)
values, counts

(array([1, 2, 3, 4, 5]), array([1, 1, 1, 1, 4]))

In [591]:
counts[values.argmax()]

np.int64(4)