# Numpy Library

In [1]:
# import numpy as np
import numpy as np

## Numpy General Info
- Numpy arrays must contain all the same datatype
    - if a list contains floats and ints, all will be converted to floats if possible
    - use Pandas if wanting a dataframe with multiple data types

## Importing Flat Files
- Consider using Pandas for this purpose
    - **especially when data are of different types**
- Flat files are row/col files like csv and txt files
- `array = np.loadtxt(filename, delimiter[, skiprows][, usecols][, dtype])`
    - usually store this in a var (numpy array)
    - default deliminiter is whitespace, so usually need to specify
        - ',' for csv
        - '\t' for tab delimited
    - if first row is header, need to `skiprows=1` or get type error
    - to only read certain columns `usecols=[0, 2]` only 1st and 3rd col
    - can set `dtype` if importing `str` for example
        - no quotes around the dtype `dtype=str`
- `array = np.genfromtxt(filename, delimiter[, names][, usecols][, dtype])
    - creates a "structured array" that can include multiple data types
        - this is a 1d array, each element is a row from the file
    - can use with files that include different data types
        - need to pass `dtype=None` for it to figure this out
    - `names=True` means that there is a header
    - row access
        - `data = np.genfromtxt(...)`
        - `data[i]` access the "ith" row
    - column access
        - `data['col_name']` access col_name column
- `array = np.recfromcsv(filename[, delimiter][, names][, dtype])` 
    - behaves like `np.genfromtxt()` with a default `delimiter=','`, `names=True`, and `dtype=None`

## Creating Numpy Arrays
- `np.array(list)`
    - can supply the list as a variable or as a list of values in `[]`
    - can specify type of array `dtype='type'` for 'int', 'float32', etc.

#### Numpy Data Types
- set using `dtype='datatype'` or `dtype=np.datatype` in args
    - `bool_` True or False store as a byte
    - `int_` default integer type, normally either int64 or int32
    - `intc` identical to C int (int32 or int64)
    - `intp` int used for indexing
    - `int8` byte (-128 to 127)
    - `int16` integer (-32768 to 32767)
    - `int32` integer (-2147483648 to 2147483647)
    - `int64` integer (well, the biggest one)
    - `uint8` unsigned int (0 to 255)
    - `uint16` unsigned int (0 to 65535)
    - `uint32` unsigned int (0 to 4294967295)
    - `uint64` unsigned int (the biggest one)
    - `float_` shorthand for float64
    - `float16` half-precision float: sign bit, 5 bits exponent, 10 bits mantissa
    - `float32` single-precision float: sign bit, 8 bits exponent, 23 bits mantissa
    - `float64` double-precision float: sign bit, 11 bits exponent, 52 bits mantissa
    - `complex_` short for complex128
    - `complex64` complex number, with 32-bit floats
    - `complex128` complex number, with 64-bit floats
- more advanced type specification possible at [numpy documentation](http://numpy.org)

#### Create Arrays from Lists

In [2]:
# create an array from a list using np.array()
np.array([1, 2, 3, 4, 5, 6], dtype='float32')

array([1., 2., 3., 4., 5., 6.], dtype=float32)

In [3]:
# create an array from a list stored in a variable using np.asarray()
mylist = [2, 4, 6, 8, 10]
np.asarray(mylist, dtype='int')

array([ 2,  4,  6,  8, 10])

#### Explicit Multidimensional Array

In [4]:
# creates three rows [2, 4, 6] as starting i values, with ranges +3 over i
# row 1 range(2, 5), row 2 range(4, 7), row 3 range(6, 9)
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

#### Creating Arrays from Scratch
- This section contains info on generating random arrays too
- Can specify a *seed* when using the `np.random` package
    - `np.random.seed(num)` where "num" is an integer you supply
    - calling `np.random` using the same parameters will result in the same random numbers after this!
    - this could be useful for reproducibility in some cases

In [5]:
# length-10 int array filled with zeros using np.zeros(num, dtype)
# instead of 10, use (rows, columns) for multi-dimensional array
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [6]:
# 3x5 arrays of floats filled with ones using np.ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [7]:
# create an array where you specify the value to fill using np.full
np.full((4, 6), 3.14, dtype=float)

array([[3.14, 3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14, 3.14]])

In [8]:
# create an array specifying a start, stop, and step using np.arange()
# similar to the range() function
# args are (start_val, stop_not_included, step)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [9]:
# create an array from start to stop with evenly spaced steps between using np.linspace()
# args are (first_val, last_val, num_of_vals)
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [10]:
# create array of uniformly dist random values between 0-1 using np.random.random()
# can set the "seed" using np.random.seed(value)
np.random.random((4, 3))

array([[0.35502258, 0.31347343, 0.72983525],
       [0.77792465, 0.09908551, 0.62816528],
       [0.94468415, 0.37639235, 0.51269002],
       [0.90799355, 0.89604291, 0.12381585]])

In [11]:
# create array of normally dist random values mean of 0 and stdev of 1
# can set the "seed" using np.random.seed(value)
# args are (mu, sigma, shape_or_number_of_vals)
# mu = mean, sigma = stdev
np.random.normal(0, 1, (3, 3))

array([[ 0.77641072, -2.27744594,  1.55008662],
       [ 1.42989637, -0.21757475, -0.46897291],
       [-1.15795166,  0.8040532 , -1.7214144 ]])

In [12]:
# create array of random ints using np.random.randint()
# can set the "seed" using np.random.seed(value)
# args are (start, stop_not_included, shape_or_num_of_vals)
np.random.randint(0, 10, (5, 5))

array([[2, 1, 3, 6, 3],
       [8, 8, 3, 5, 9],
       [8, 5, 5, 8, 8],
       [9, 7, 1, 3, 4],
       [4, 9, 1, 9, 3]])

In [13]:
# create identity matrix (1's on diagonal, all else 0's) using np.eye()
# always a square matrix
# dtype looks like float by default
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

## Array Attributes
- This attributes section will use the following x1, x2, and x3 arrays as examples:
    - these are 1d, 2d, and 3d arrays respectively
- Attributes
    - access using `object.attribute`
    - `ndim` number of dimensions
    - `shape` size of each dimension
    - `size` total size (number of items)
    - `dtype` data type of array
    - `itemsize` size in bytes of each array element
    - `nbytes` total size in bytes of the array
        - `nbytes` should equal `itemsize` times `size`

In [14]:
# seed for reproducibility with np.random
np.random.seed(0)

x1 = np.random.randint(10, size=6) # 1d array len=6
x2 = np.random.randint(10, size=(3, 4)) # 2d array 3x4
x3 = np.random.randint(10, size=(3, 4, 5)) # 3d array 3x4x5

In [15]:
# print number of dimensions
print("x1 ndim: ", x1.ndim)
print("x2 ndim: ", x2.ndim)
print("x3 ndim: ", x3.ndim)

x1 ndim:  1
x2 ndim:  2
x3 ndim:  3


In [16]:
# print total size using nbytes
print("x1 nbytes: ", x1.nbytes)
print("x2 nbytes: ", x2.nbytes)
print("x3 nbytes: ", x3.nbytes)

x1 nbytes:  24
x2 nbytes:  48
x3 nbytes:  240


## Accessing Array Elements
#### Access a single value (uses 0 indexing)
- `array[index]` for 1d
- `array[row_index, column_index]` for 2d
- `array[box_index, row_index, column_index]` for 3d

In [17]:
# show x3 values
x3

array([[[8, 1, 5, 9, 8],
        [9, 4, 3, 0, 3],
        [5, 0, 2, 3, 8],
        [1, 3, 3, 3, 7]],

       [[0, 1, 9, 9, 0],
        [4, 7, 3, 2, 7],
        [2, 0, 0, 4, 5],
        [5, 6, 8, 4, 1]],

       [[4, 9, 8, 1, 1],
        [7, 9, 9, 3, 6],
        [7, 2, 0, 3, 5],
        [9, 4, 4, 6, 4]]])

In [18]:
# select the middle box (index of 1), then first row first column
x3[1, 0, 0]

0

#### Array Slicing
- assigning an array slice or value to a variable can allow you to change the variable's value
    - **this will in turn modify the original array too!**
    - to create a copy of an array that won't modify the original use `copy()`
        - `subarray_copy = array[rows, cols].copy()` 
- 1d arrays
    - `array[start:stop:step]`
    - `array[:5]` first 5 elements (index 0 through 4)
    - `array[5:]` elements starting with index 5 until the end
    - `array[4:7]` elements from index 4 through index 6
    - `array[::2]` every other element
    - `array[1::2]` every other element starting at index 1
    - `array[::-1]` all elements reversed
    - `array[5::-2]` every other element in reverse from index 5 to 0 (1 really)

In [19]:
x1

array([5, 0, 3, 3, 7, 9])

In [20]:
x1[1:4]

array([0, 3, 3])

In [21]:
x1[::-1]

array([9, 7, 3, 3, 0, 5])

- multidimensional arrays
    - creating subarrays
        - `array[:rows, :cols]` accesses :rows number of rows and :cols number of columns
            - `array[:2, :3]` two rows, three columns
        - `array[:rows, ::cols]` ::cols acts like a step
            - `array[:3, ::2]` three rows, every other column
        - `array[::-1, ::-1]` reverses the rows and reverses the columns
    - accessing rows and columns
        - `array[:, col]` accesses all rows in column at specified index
        - `array[row, :]` accesses entire row at specified index

In [22]:
# view x2
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [23]:
# assign 3rd row to a subarray variable
# modifying this subarray will change the original
x2sub = x2[2, :]
print(x2sub)

[1 6 7 7]


In [24]:
# modifying the subarray modifies the original array
x2sub[0] = 99
print(x2)

[[ 3  5  2  4]
 [ 7  6  8  8]
 [99  6  7  7]]


In [25]:
# copies can be modified without altering the original
x2subcopy = x2[2, :].copy()
print(x2subcopy)
x2subcopy[0] = 1
print(x2subcopy)
print()
print(x2)

[99  6  7  7]
[1 6 7 7]

[[ 3  5  2  4]
 [ 7  6  8  8]
 [99  6  7  7]]


## Reshaping Arrays
- Most flexible is the `reshape()` method
    - the size of your original array must match the new array
- Another way is to convert 1d array to 2d or matrix with `newaxis` keyword
    - you can use this to change a row to a column and vice versa

In [26]:
# reshaping an array requires new and old to be the correct size
nums = np.arange(1, 10, 1) # create 1d array 1-9 step of 1
nums2 = nums.reshape((3, 3))
print(nums2)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [27]:
# reshape using the newaxis keyword to convert a row to a column
# the original array is not modified
print(nums)
print(nums[:, np.newaxis])
print(nums)

[1 2 3 4 5 6 7 8 9]
[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
[1 2 3 4 5 6 7 8 9]


## Array Concatenating and Splitting
#### Concatenating
- `np.concatenate([array_list][, axis])`, `np.vstack([array_list])`, `np.hstack([array_list])`
- `np.concatenate([array_list][, axis])` will stitch multiple arrays together into one
    - `axis=0` by default and is 0 indexed
        - can supply this to change how multi-d arrays are appended
    - 1d arrays are appended, resulting in a longer 1d array
    - 2d arrays are appended the same by default, so 3x3 + 2x3 will yield 5x3
        - number of columns stays the same, additional rows are added by default
        - change to `axis=1` to add columns rather than rows
    - 3d arrays or arrays of mixed dimensions, use np.vstack (vertical) or np.hstack (horizontal)
        - your number of cols or rows must match for vstack or hstack respectively
- `np.dstack([array_list])` will stack along the third axis

In [28]:
# np.concatenate() example with default axis
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [29]:
# np.concatenate() with axis=1
np.concatenate([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

In [30]:
# np.vstack() example
# number of columns must match
grid2 = np.array([[ 7,  8,  9],
                  [10, 11, 12],
                  [13, 14, 15]])
np.vstack([grid, grid2])

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12],
       [13, 14, 15]])

In [31]:
# np.hstack() example
# number of rows must match
grid3 = np.array([[4, 3],
                  [7, 9]])
np.hstack([grid, grid3])

array([[1, 2, 3, 4, 3],
       [4, 5, 6, 7, 9]])

In [32]:
# np.dstack() example
# default is mapping which appears to add horizontally sort of
# [1, 2, 3] (first entry is mapped to [1, 2, 3] in next box)
grid3d = np.array([[[1, 2, 3],
                    [4, 5, 6]],
                  
                   [[1, 2, 3],
                    [4, 5, 6]]])
np.dstack([grid3d, grid3d])

array([[[1, 2, 3, 1, 2, 3],
        [4, 5, 6, 4, 5, 6]],

       [[1, 2, 3, 1, 2, 3],
        [4, 5, 6, 4, 5, 6]]])

#### Splitting
- `np.split(array, index(es))`, `np.hsplit(array, index(es))`, `np.vsplit(array, index(es))`
    - supply an index or a list of indexes
    - the number of arrays returned is (number of indexes supplied + 1)
    - each index supplied is the beginning of a new array
- `np.split()` would work well with 1d arrays
- `np.hsplit()` for horizontal splits and `np.vsplit()` for vertical splits
- `np.dsplit()` can split along the 3rd axis

In [33]:
# np.split example
# 2nd array starts at index 3
# 3rd array starts at index 5
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

[1 2 3] [4 5] [6 7 8 9]


In [34]:
# np.vsplit example
# index provides the row index for splitting
# 2nd arrary starts at the row of the first index provided and so on
grid = np.arange(16).reshape((4, 4))
upper, lower = np.vsplit(grid, [2])
print(grid)
print(upper)
print(lower)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


In [35]:
# np.hsplit example
# index provides the col index for splitting
# 2nd array starts at col of first index provided and so on
left, right = np.hsplit(grid, [2])
print(grid)
print(left)
print(right)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


## Universal Functions (UFuncs)
- Python looping is inherently slow because Python is flexible and needs to type check every iteration
    - Use vectorized operations (built-in ufuncs are such) whenever possible to improve performance
    - The key is to **avoid for loops** with operations on large arrays
        - use ufuncs or perform operations directly with the arrays
        - this allows the code to be crunched in optimized C code without checking data types   

#### Arithmetic
- Many operations have numpy equivalents to use
    - if a standard operation is listed below, it is vectorized and good to use
        - even if it has a numpy equivalent
- Use the `out=` argument of a ufunc (np. version) to specify where the output should go
    - this avoids the creation of temporary arrays and saves resources
    - can specify just an object `out=y`
    - can specify a location `out=y[::2]` for every other char
        - could use this to add a row or a column potentially
- Use standard Python arithmetic operators, they do have equivalent ufunc and behave the same
    - perform these operations on scalars, arrays, or combinations of the two
    - `+` equiv `np.add`
    - `-` equiv `np.subtract`
    - `-` equiv `np.negative` for numbers
    - `*` equiv `np.multiply`
    - `/` equiv `np.divide`
    - `//` equiv `np.floor_divide` divide and round down
    - `**` equiv `np.power`
    - `%` equiv `np.mod`
- Absolute value
    - `abs(object)` equiv to `np.absolute()` or `np.abs()`
        - returns the magnitude in the case of complex data
- Trig functions
    - work on values as well as arrays
    - values that should be 0 may be floats, so round if necessary
    - `np.pi` returns the value of pi
    - `np.sin(angle)`
    - `np.cos(angle)`
    - `np.tan(angle)`
    - `np.arcsin(opp/hyp)`
    - `np.arccos(adj/hyp)`
    - `np.arctan(opp/adj)`
- Exponents and logs
    - work on values as well as arrays
    - `np.exp(num)` raises "e" to the "num" power
    - `np.exp2(num)` raises "2" to the "num" power
    - `np.power(base, num)` raises "base" to the "num" power
    - `np.log(num)` is the "ln(num)" (base e)
    - `np.log2(num)` is the log base 2 of "num"
    - `np.log10(num)` is the log base 10 of "num"
    - when input is a very small number, use for higher precision
        - `np.expm1(num)` "exp(num) -1"
        - `np.log1p(num)` "log(1 + num)"
- For more ufuncs
    - visit [numpy documentation](https://docs.scipy.org/doc/)
    - check out scipy.special `from scipy import special`

#### Aggregations
- `.reduce(object)` method
    - will run the function specified until only one result remains
- `.accumulate(object)` method
    - will store the running totals in an array and return that array
- Examples below are simplified
    - in fact there are better ways to do these examples using dedicated np functions
    - see the "summary stats" next section

In [36]:
# .reduce() examples
x = np.arange(1, 6)
print(np.add.reduce(x)) # adds all the values of x together
print(np.multiply.reduce(x)) # multiplies all the values of x together

15
120


In [37]:
# .accumulate() examples
print(np.add.accumulate(x)) # running total of sum of values in x
print(np.multiply.accumulate(x)) # running total of x value products

[ 1  3  6 10 15]
[  1   2   6  24 120]


#### Aggregation Function for Summary Stats
- As functions (with NaN-safe versions)
    - `np.sum(array)` `np.nansum()` sums all of the values
    - `np.min(array)` `np.nanmin()` returns the min value
    - `np.max(array)` `np.nanmax()` returns the max value
- As methods
    - `array.sum()` `array.nansum()`
    - `array.min()` `array.nanmin()`
    - `array.max()` `array.nanmax()`
- Default action to aggregate over the entire array (all values)
    - **change the `axis` arg to modify this**
    - set `axis=0` to return an aggregate value for each column
    - set `axis=1` to return an aggregate value for each row
    - when used on a 2d array, the result is a 1d array
- Additional functions (which may have associated methods)
    - also listed is the NaN-safe version that ignores missing values
    - `np.prod(array)` `np.nanprod()` compute product of elements
    - `np.mean(array)` `np.nanmean()` mean
    - `np.std(array)` `np.nanstd()` standard deviation
    - `np.var(array)` `np.nanvar()` variance (std**2)
    - `np.argmin(array)` `np.nanargmin()` index of min value
    - `np.argmax(array)` `np.nanargmax()` index of max value
    - `np.median(array)` `np.nanmedian()` median value
    - `np.percentile(array)` `np.nanpercentile()` rank-based stats of elements
    - `np.any(array)` evaluate whether any elements are True
        - returns single Boolean unless specified axis
    - `np.all(array)` evaluate whter all elements are True