# Chapter 4: NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Here are some of the things you will find in NumPy:
* *ndarray*, an efficient, multidimensional array providing fast array-oriented arithmetic operations and flexible boradcasting capabilities.
* Mathetmatical functions for fast operations on entire arrays of data without having to write loops.
* Tools for reading/writing array data to disk and working with memory-mapped files.
* Linear algebra, random number generation, and Fourier transform capabilities.
* A C API or connecting NumPy with libraries written in C, C++, or FORTRAN.

While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array-oriented semantics, like pandas, more effectively. For most data analysis applications, the main areas of functionality to focus on are:
* Fast vectorized array operations for data munging and cleaining, subsetting and filtering, transformation, and any other kinds of computations.
* Common array algorithms like sorting, unique, and set operations.
* Efficient descriptive statistics and aggregating/summarizing data
* Data alignment and relational data manipulations for merging and joining together heterogeneous datasets.
* Grou-wise data manipulations (aggregation, transformation, function aplication)

One of the reasons NumPy is so important for numerical computations is because it is designed for efficiency on large arrays of data. The main reason for this is:
* NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects.

In [1]:
import numpy as np

## 4.1 The NumPy ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to do mathematical operations on while blocks of data using similar syntax to the equivalent operations between scalar elements.

Lets use NumPy to generate a small array of random data:

In [2]:
data = np.random.randn(2,3)
data

array([[-0.55809293,  0.19450723, -0.05263509],
       [ 0.01015906, -1.22043449, -0.00906651]])

In [3]:
data * 10

array([[ -5.58092928,   1.94507234,  -0.52635086],
       [  0.10159063, -12.20434489,  -0.09066512]])

In [4]:
data + data

array([[-1.11618586,  0.38901447, -0.10527017],
       [ 0.02031813, -2.44086898, -0.01813302]])

In [5]:
data.shape

(2, 3)

In [6]:
data.dtype

dtype('float64')

### Creating ndarrays
The *array* function accepts a sequence-like object (including other arrays) and produces a new NumPy array containing the passed data.
* Nested sequences will be converted to mult-dimensional arrays
* Unless explicitly specified, np.array tries to infer a good data type for the array that it creates.
* There are also a number of other functions for creating arrays with a given length or shape.

In [7]:
data1 = [6, 7.5, 8, 1, 0]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 1. , 0. ])

In [8]:
data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [9]:
arr2.ndim

2

In [10]:
arr2.shape

(2, 4)

In [11]:
arr2.dtype

dtype('int32')

In [12]:
arr1.dtype

dtype('float64')

In [13]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [14]:
np.zeros((3,6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [15]:
np.empty((2,3,4))

array([[[1.12606700e-311, 7.75683064e-322, 0.00000000e+000,
         0.00000000e+000],
        [9.34611826e-307, 1.15998412e-028, 2.44171989e+232,
         8.00801729e+159],
        [3.35733962e-090, 9.34288112e-067, 8.54289848e-072,
         8.98701831e-096]],

       [[1.24582593e-047, 4.07356341e+223, 8.94213159e+130,
         1.79453709e-052],
        [3.45374415e-086, 3.89113082e-033, 3.11773483e-033,
         1.95360733e-109],
        [2.86752281e+161, 2.78225500e+296, 9.80058441e+252,
         1.23971686e+224]]])

In [16]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

| function | description |
| -------- | :----------- |
| array | converts input data to an ndarray by inferring or explicitly specifying a data type; copies the input data by default
| asarray | convert input to ndarray, but do not copy if the input is already an ndarray |
| arange | like the built-in *range* but returns an ndarray instead of a list |
| ones, ones_like | Produce an array of all 1s with the given shape and data type; ones_like takes another array and produces a ones array with the same shape and data type|
| zeros, zeros_like | Like *ones* and *ones_like* by producing arrays of 0s instead |
| empty, empty_like | Create new arrays by allocating new memory, but do not populate any values like *ones* and *zeros* |
| full, full_like | Produce an array of the given shape and data type with all values set to the indicated 'fill value'. *full_like* takes another array and produces a filled array of the same shape and dtype |
| eye, identity | creates a square NxN identity matrix (1s on the diagonal and 0s elsewhere ) | 

### Data types for ndarrays
The *data type* or dtype is a special object containing the information (or *metadata*, data about the data) the ndarray needs to interpret a chunk of memory as a particular type of data. 
* The numerical data types are named the same as Python data types: int or float, followed by a number indicating the number of bits per element.
* You can explicitly convert or *cast* an array from one dtype to another using the *astype* method. This *always* creates a new array, even if the new dtype is the same as the old.

In [17]:
arr1 = np.array([1,2,3], dtype=np.float64)
arr1.dtype

dtype('float64')

In [18]:
arr2 = np.array([1,2,3], dtype=np.int32)
arr2.dtype

dtype('int32')

In [19]:
float_arr = arr2.astype(np.float64)
float_arr

array([1., 2., 3.])

In [20]:
float_arr.dtype

dtype('float64')

### Arithmetic with NumPy arrays
Arrays are important because they allow you to perform batch operations on data without writing any loops. 
* This is called *vectorization*.
* Any arithmetic operations between equal-size arrays applies the operation element-wise.
* Arithmetic operations wil scalars propagate to scalar argument to each element in the array.
* Evaluating operations between differently sized arrays is called *broadcasting* and will be discussed more later.

In [21]:
arr = np.array([[1,2,3], [1,2,3]])
arr

array([[1, 2, 3],
       [1, 2, 3]])

In [22]:
arr * arr

array([[1, 4, 9],
       [1, 4, 9]])

In [23]:
arr - arr

array([[0, 0, 0],
       [0, 0, 0]])

In [24]:
1/arr

array([[1.        , 0.5       , 0.33333333],
       [1.        , 0.5       , 0.33333333]])

In [25]:
arr ** 0.5

array([[1.        , 1.41421356, 1.73205081],
       [1.        , 1.41421356, 1.73205081]])

In [26]:
arr2 = np.array([[3,1,0],[0,5,4]])

In [27]:
arr < arr2

array([[ True, False, False],
       [False,  True,  True]])

### Basic indexing and slicing
There are many ways to select subsets of your data or individual elements. 
* Slicing syntax is comparable to slicing built-in Python lists, although an important first distinction is that array slices are views on the original array. Assigning a value to a slice will modify the source array.
* With higher dimensional arrays, you have many more options. 



In [29]:
onedarr = np.arange(10)
onedarr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [30]:
onedarr[5]

5

In [31]:
onedarr[5:8]

array([5, 6, 7])

In [32]:
onedarr[5:8] = 12
onedarr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [33]:
arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d[2]

array([7, 8, 9])

In [34]:
arr2d[0][2]

3

### Boolean Indexing
Like arithmetic operations, comparisons with arrays are also vectorized.
* The boolean array must be the same length as the array axis it's indexing
* Selecting data from an array by boolean indexing *always* creates a copy of the data, even if the returned array is unchanged.

In [35]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7,4)

In [36]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

In [37]:
data[names == 'Bob']

array([[ 0.33551721, -0.37397136,  1.44849202,  1.6623771 ],
       [-0.11551538, -0.2697117 ,  1.06146113, -0.36914901]])

In [39]:
data[data > 0] = 0

In [40]:
data

array([[ 0.        , -0.37397136,  0.        ,  0.        ],
       [-0.76180095, -0.52371648, -0.59598434,  0.        ],
       [ 0.        ,  0.        , -0.39194445,  0.        ],
       [-0.11551538, -0.2697117 ,  0.        , -0.36914901],
       [ 0.        , -0.67043723, -0.91106401,  0.        ],
       [ 0.        ,  0.        , -0.69453861,  0.        ],
       [-0.90410839,  0.        ,  0.        ,  0.        ]])

### Transposing Arrays and Swapping Axes
Transposing is a special form of reshaping that similarly returns a view on the under-lying data without copying anything. 
* Arrays have the *transpose* method and also the *T* attribute.

In [42]:
arr = np.arange(15).reshape((5,3))
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [43]:
arr.T

array([[ 0,  3,  6,  9, 12],
       [ 1,  4,  7, 10, 13],
       [ 2,  5,  8, 11, 14]])

In [44]:
np.dot(arr.T, arr)

array([[270, 300, 330],
       [300, 335, 370],
       [330, 370, 410]])

## 4.2 Universal Functions: Fast element-wise array functions
A universal function, or *ufunc*, is a function that performs element wise operations on data in ndarrays. 
* functions performed on a single array are referred to as *unary* (sqrt, exp)
* functions performed on multiple arrays are referred to as *binary* (add, maximum)

In [47]:
arr = np.arange(12)

In [48]:
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ,
       3.16227766, 3.31662479])

In [49]:
x = np.random.randn(8)
y = np.random.randn(8)
np.maximum(x,y)

array([ 0.47443294,  1.289722  , -0.05433521,  1.21137837, -0.24599187,
        1.34289987, -0.05087455,  1.33434785])

## 4.3 Array-Oriented Programming with Arrays
The practice of replacing explicit loops with array expressions is commonly referred to as *vectorization*.

### Expressing Conditional Logic as Array Operations
The numpy.where function is a vectorized version of the ternary expression x if condition else y.
* The function applies the condition to each element in the array, the first argument
* If the condition is met, the function selects the value in the second argument (optional, if not used, 'True')
* If the condition is not met, the function selects the value in the third argument (optional, if not used, 'False')

In [50]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

In [51]:
result = np.where(cond, xarr, yarr)
result

array([1.1, 2.2, 1.3, 1.4, 2.5])

In [52]:
arr = np.random.randn(4,4)

In [53]:
arr

array([[-0.97379885, -2.46365129,  0.95051259, -1.6289119 ],
       [ 0.08309189,  1.15548822,  1.47540914, -0.17528575],
       [-0.26640127,  0.73801689,  1.40329064, -0.41607266],
       [-0.15336706, -0.27777545, -0.32554683,  0.76789008]])

In [54]:
np.where(arr > 0, 2, -2)

array([[-2, -2,  2, -2],
       [ 2,  2,  2, -2],
       [-2,  2,  2, -2],
       [-2, -2, -2,  2]])

In [55]:
np.where(arr > 0, 2, arr)

array([[-0.97379885, -2.46365129,  2.        , -1.6289119 ],
       [ 2.        ,  2.        ,  2.        , -0.17528575],
       [-0.26640127,  2.        ,  2.        , -0.41607266],
       [-0.15336706, -0.27777545, -0.32554683,  2.        ]])

### Mathematical and Statistical Methods
A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class.
* axis=1 represents 'compute accross the columns'
* axis=0 represents 'compute accross the rows'

In [60]:
arr = np.random.randn(5,4)
arr

array([[ 0.77104192,  0.03353886,  0.02466797, -0.2412257 ],
       [-0.10680479,  0.44385722,  0.23237657, -0.30099352],
       [ 1.80900152, -0.50680531, -0.02402521, -0.88452803],
       [-1.31480543, -1.11673003, -1.60290104,  0.84154146],
       [ 1.26874001,  0.38983986, -1.6867289 , -0.09090534]])

In [57]:
arr.mean()

-0.06086667144376621

In [58]:
arr.sum()

-1.2173334288753241

In [59]:
arr.mean(axis=1)

array([-0.33266198,  0.20519771, -0.30070557,  0.22151406, -0.09767757])

### Sorting
Like Python's built-in list type, NumPy arrays can be sorted in-place with the *sort* method.
* You can sort each one-dimensional section of values in a multidimensional array in-place along an axis by passing the axis number to *sort*

In [61]:
arr = np.random.randn(6)
arr

array([ 0.62428355, -0.19718102, -1.15280505,  0.04549361, -0.77347836,
        1.0649557 ])

In [63]:
arr.sort()
arr

array([-1.15280505, -0.77347836, -0.19718102,  0.04549361,  0.62428355,
        1.0649557 ])

In [64]:
arr = np.random.randn(5,3)
arr

array([[ 1.58955117, -0.03527903, -0.10682715],
       [ 0.49895531, -0.86895158, -0.26852445],
       [ 0.7003458 , -1.57731834, -0.33392661],
       [-2.16472765,  0.41140265,  0.11540004],
       [-0.17193159, -0.87061169, -0.25553669]])

In [66]:
arr.sort(1)
arr

array([[-0.10682715, -0.03527903,  1.58955117],
       [-0.86895158, -0.26852445,  0.49895531],
       [-1.57731834, -0.33392661,  0.7003458 ],
       [-2.16472765,  0.11540004,  0.41140265],
       [-0.87061169, -0.25553669, -0.17193159]])

## 4.4 File Input and Output with Arrays
NumPy is able to save and load data to and from disk either in text or binary format.
* np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk.
* Arrays are saved by default in an uncompressed raw binary format with file extension *.npy*
* You can save multiple arrays in an uncompressed archive using np.savez

In [67]:
arr = np.arange(10)
np.save('some_array', arr)

In [68]:
np.load('some_array.npy')

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])