# Learn NumPy from Beginning

## Install NumPy

Install NumPy using:  

    pip install numpy

or using:  

    conda install numpy

If you use anaconda, the package possibly already included in your installation.  
In the core of NumPy is the calss `ndarray`. for this new type it is efficient in memory consumption and fast in operation/calculation. **All items** inside the array must all be **exactly the same type**, in NumPy which is called a `dtype`.  
After install can use import to verify if it is a success.

In [116]:
import numpy as np

## Create Array

can use python **list or tuple** to create array.  (set and dict, items does not have any order, so not able to use that to create array)

In [117]:
data = [i for i in range(20)]
data

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [118]:
a = np.array(data)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [119]:
type(a)

numpy.ndarray

After created array, the **array data will be independant with the original list**.

In [120]:
id(a), id(data)

(1312934712592, 1312933542592)

In [121]:
data[0]=100
data

[100, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [122]:
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Following are some important attributes of ndarray, which easily accesible to know important information.

In [123]:
a.dtype, a.ndim, a.shape, a.size

(dtype('int32'), 1, (20,), 20)

Can re-arrange shape.

In [124]:
a.reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

If create with an iterable which the items is also iterable, will create multidimentional ndarray. Notice must keep the same shape.

In [125]:
data = [(1.4, 2, 3), (4,5,6), (7,8,9)]
a = np.array(data)
a

array([[1.4, 2. , 3. ],
       [4. , 5. , 6. ],
       [7. , 8. , 9. ]])

In [126]:
a.dtype, a.ndim,  a.shape, a.size, 

(dtype('float64'), 2, (3, 3), 9)

In [127]:
type(data[0][0]), type(data[0][1])

(float, int)

In [128]:
data = [(1,), (2,3), (3,4,5)]
a = np.array(data)
a

  a = np.array(data)


array([(1,), (2, 3), (3, 4, 5)], dtype=object)

Different length will result in using python object and actually it is not a multidimensional but a one dimension, and the items being object dtype.

In [129]:
a.dtype, a.ndim, 

(dtype('O'), 1)

If the inside is same size then will create multi-dimensional array.

The other less often used attributes are: itemsize and data. 

In [130]:
a.itemsize, a.data

(8, <memory at 0x00000131B0F03700>)

They give the size in bytes for each item, and the buffer containing the actual elements. 

### Numpy Functions for Array Creation

using NumPy function, like zeros(), ones(), arange(), linspace(), fromfile(), load(), loadtxt(), etc.

In [131]:
a = np.zeros((4,4))
a

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [132]:
a = np.ones((3,3))
a

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [133]:
a = np.arange(50)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

In [134]:
a = np.arange(36).reshape((3,-1))
a

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]])

The reshape() only can accept one dimension number as -1, which means this dimension how many will be decided by NumPy to auco-calculate.  
The arange() function can accept up to 3 arguments, first one is the start, second one is the end, and third one is the stpe, which is very similar to built in function range().

In [135]:
a = np.linspace(0,10,21)
a

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. ])

The linspace() funciton somehow similar to arange(). The difference is the third argument specify how many numbers. Also the end number is included, while in arange() function the end number is not included.

NumPy random() function can create array with random number, each number is between 0 and 1.

In [136]:
a = np.random.random((3,4))
a

array([[0.89962966, 0.99510811, 0.79419219, 0.09866666],
       [0.65913773, 0.31464606, 0.13354417, 0.3394832 ],
       [0.07068826, 0.84449773, 0.79635988, 0.11005607]])

## Using File to Create Array

When want to save array data and share with others, can use `save()` and `load()` functions. It will save data in .npy file and can share this file to others. It is platform independant and data restored with load() will be same as you array data when created it using save().  
loadtxt() and genfromtxt() can be used to read data from plain text file.

In [137]:
a = np.random.random((3,4))
a

array([[0.8949576 , 0.19516976, 0.84556457, 0.98178804],
       [0.26837714, 0.40540776, 0.8205073 , 0.43764975],
       [0.21062304, 0.01247217, 0.7044077 , 0.1210847 ]])

In [138]:
import tempfile

file_name = tempfile.mkstemp()[1]

np.save(file_name,a)

b = np.load(file_name+'.npy')
b

array([[0.8949576 , 0.19516976, 0.84556457, 0.98178804],
       [0.26837714, 0.40540776, 0.8205073 , 0.43764975],
       [0.21062304, 0.01247217, 0.7044077 , 0.1210847 ]])

In [139]:
np.savetxt('save1.txt',b,delimiter=',')

In [140]:
c = np.loadtxt('sample.txt', dtype=np.float64, delimiter=',', ndmin=2, skiprows=1)
c

array([[0., 1., 2., 3., 4.],
       [5., 6., 7., 8., 9.],
       [1., 2., 3., 4., 5.]])

In [141]:
c.size, c.ndim, c.itemsize

(15, 2, 8)

genfromtxt() can read data while some place does not have values.

In [142]:
d = np.genfromtxt('sample.txt',dtype=np.float64,delimiter=',',\
                  skip_header=0,names=True,autostrip=True,filling_values=np.NAN)
d

array([(0., 1., 2., 3., 4.), (5., 6., 7., 8., 9.), (1., 2., 3., 4., 5.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8'), ('E', '<f8')])

In [143]:
d.size, d.ndim, d.itemsize

(3, 1, 40)

The array created from genfromtxt(), the item is a kind of composite dtype, which in numpy is called record array, each item is a recor.  
It is much like a table, each row is a record, and each item (field) of this record actually is forming a column.  
Just like when we dealing with table, numpy also allow us to give name for each column.

In [144]:
d['A']

array([0., 5., 1.])

When the name is there, can use name to access the column.

In [145]:
d1=d.astype([('AA',np.float64),('BB',np.int64),('CC',np.float32),('DD',np.int32),('EE',np.float64)])
d1

array([(0., 1, 2., 3, 4.), (5., 6, 7., 8, 9.), (1., 2, 3., 4, 5.)],
      dtype=[('AA', '<f8'), ('BB', '<i8'), ('CC', '<f4'), ('DD', '<i4'), ('EE', '<f8')])

As you can see, using astype() you can change the name and also field dtype of recarray.

## Recarray. (Record, Composite Dtype with many fields)
Continue with the recarray, you can directly create customized composite dtype, with current numpy dtypes, as combination of any number of them.  
If you are familiar with C programming language, think about record like struct in C.

In [146]:
dt = np.dtype([('myint',np.int64),('myfloat',np.float64),('mystring', np.unicode_, 10)])
xl = np.zeros((3,2),dtype=dt)
xl

array([[(0, 0., ''), (0, 0., '')],
       [(0, 0., ''), (0, 0., '')],
       [(0, 0., ''), (0, 0., '')]],
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '<U10')])

In [147]:
xl[1] = (3,3,u'XYZ')
xl

array([[(0, 0., ''), (0, 0., '')],
       [(3, 3., 'XYZ'), (3, 3., 'XYZ')],
       [(0, 0., ''), (0, 0., '')]],
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '<U10')])

In [148]:
xl[2, 1] = (4,4,u'ABC')
xl

array([[(0, 0., ''), (0, 0., '')],
       [(3, 3., 'XYZ'), (3, 3., 'XYZ')],
       [(0, 0., ''), (4, 4., 'ABC')]],
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '<U10')])

In [149]:
xl.itemsize, xl.shape

(56, (3, 2))

In [150]:
xl[1,0]

(3, 3., 'XYZ')

## Indexing and Slicing

for each dimension, the index start from 0 and end at the size of the dimension_shapesize-1.

In [151]:
a = np.random.random((3,4))
a

array([[0.71094424, 0.60734672, 0.6793375 , 0.42595081],
       [0.53500716, 0.77361689, 0.33243767, 0.94331506],
       [0.94985031, 0.1828273 , 0.04201963, 0.23769902]])

In [152]:
a[0]

array([0.71094424, 0.60734672, 0.6793375 , 0.42595081])

In [153]:
a[0][0]

0.7109442407064875

You can see there are 2 ways of specify index, one similar as normal Python ways to indexing list-inside-list. Another way is to put all index in one square braket pair.  
Negative index also can be used.

In [154]:
a[0][-1]

0.42595081380157085

But when used as slicing will be different.

In [155]:
a[0:2]

array([[0.71094424, 0.60734672, 0.6793375 , 0.42595081],
       [0.53500716, 0.77361689, 0.33243767, 0.94331506]])

In [156]:
a[0:2][0:2]

array([[0.71094424, 0.60734672, 0.6793375 , 0.42595081],
       [0.53500716, 0.77361689, 0.33243767, 0.94331506]])

As you see the second subscriptive operator still works on the first dimension, not the second dimension. This is because first slicing result did not reduce the first dimension, the result is still 2-dimension array. Only when the first dimension becomes 1, and thus the dimension is reduced, then the second subscriptive operator will work on the next dimension.  
This rule is same like in python for list.

In [157]:
ll = [[1,2,3],[4,5,6,'a'],[7,8,9.,10.]]
ll[:2][:2]

[[1, 2, 3], [4, 5, 6, 'a']]

In [158]:
ll[1][:2]

[4, 5]

Numpy support a new way of giving index: all index in one pair of square braket, and separated by `,`.

In [159]:
a[0,0]

0.7109442407064875

In [160]:
a[0,-1]

0.42595081380157085

In [161]:
a[0:2, 0:2]

array([[0.71094424, 0.60734672],
       [0.53500716, 0.77361689]])

This is the correct way of using slicing.  
So in order to make sure indexing is always correct avoid mistake, for ndarray always using one pair of square braket and put all indexing/slicing inside, separated by `,`.

***
⚠️**NOTE**  
Use **ONE pair square braket**, put all index of different dimension inside the braket and separated by comma.  
NumPy slicing is same like python original, will **exclude the stop index item**.

***

Also can use a list, to specify some selection of rows / columns.

In [162]:
a[[1,2]] # get some rows

array([[0.53500716, 0.77361689, 0.33243767, 0.94331506],
       [0.94985031, 0.1828273 , 0.04201963, 0.23769902]])

In [163]:
a[:,[1,3]] # get some columns

array([[0.60734672, 0.42595081],
       [0.77361689, 0.94331506],
       [0.1828273 , 0.23769902]])

In [164]:
a[[1,2],[1,3]] # get 2 items [1,1] and [2,3]

array([0.77361689, 0.23769902])

If we want get a sub matrix, that is to say, get all items which in row1 and row2, also column1 and column3, 4 items as a matrix of $2 \times 2$, then need to make the index selection according to the broadcast rules.  
The first dimension we need make the shape (2, 1), and second dimension you need make just 2. Then the 2 index will broadcast and make selection of $2 \times 2$

In [165]:
a[[[1],[2]], [1,3]]

array([[0.77361689, 0.94331506],
       [0.1828273 , 0.23769902]])

## Using boolean array to make selection

In [166]:
a

array([[0.71094424, 0.60734672, 0.6793375 , 0.42595081],
       [0.53500716, 0.77361689, 0.33243767, 0.94331506],
       [0.94985031, 0.1828273 , 0.04201963, 0.23769902]])

In [167]:
a<0.5

array([[False, False, False,  True],
       [False, False,  True, False],
       [False,  True,  True,  True]])

In [168]:
a[a<0.5]

array([0.42595081, 0.33243767, 0.1828273 , 0.04201963, 0.23769902])

If we want keep the same shape, then can use:

In [169]:
(a<0.5).astype(np.int32)*a

array([[0.        , 0.        , 0.        , 0.42595081],
       [0.        , 0.        , 0.33243767, 0.        ],
       [0.        , 0.1828273 , 0.04201963, 0.23769902]])

The astype() will change the boolean matrix into an integer matrix, all False item to be 0, and then the multiply will make the filtering of matrix a.

## Array stack and split

NumPy has many functions to manipulate array (or matrix / tensor if multidimensional)

In [170]:
a = np.arange(15).reshape(3,5)
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [171]:
b = np.arange(30,45).reshape(-1,5)
b

array([[30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44]])

In [172]:
c = np.arange(80,89).reshape(3,-1)
c

array([[80, 81, 82],
       [83, 84, 85],
       [86, 87, 88]])

In [173]:
e = np.hstack((a,c))
e

array([[ 0,  1,  2,  3,  4, 80, 81, 82],
       [ 5,  6,  7,  8,  9, 83, 84, 85],
       [10, 11, 12, 13, 14, 86, 87, 88]])

In [174]:
[f,g,h,i] = np.hsplit(e,4)
f

array([[ 0,  1],
       [ 5,  6],
       [10, 11]])

In [175]:
h

array([[ 4, 80],
       [ 9, 83],
       [14, 86]])

In [176]:
e = np.vstack((a, b))
e

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44]])

In [177]:
[f, g] = np.vsplit(e, 2)
f

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [178]:
g

array([[30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44]])

There are more functions like concatenate(), split(), stack(), etc. Please study the NumPy documentation.

## Array arithmetic calculation

arithmetic operation will **broadcast** and make the 2 oprands **same dimension size**.  
dimension broadcast rules: 
- make the 2 items shape compare start from right most.
- for each shape number, if they are same is compatible; or if any one is 1 then also compatible. if both are not 1 and they not same then not compatible for broadcast will raise value error.
- for compatible shape numbers, use the larger one as the final. if the other one is 1, then it is copied to expand the same size. if the other is same then keep as the original.
- if one has more dimension, the other also expanded in dimension, with all data copied.

In [179]:
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [180]:
c = a * 3
c

array([[ 3,  6,  9],
       [12, 15, 18],
       [21, 24, 27]])

Here the scalar value `3` is made similar like (3,3) array with every item as 3

In [181]:
c = a * np.array([[3,3,3],[3,3,3],[3,3,3]])
c

array([[ 3,  6,  9],
       [12, 15, 18],
       [21, 24, 27]])

In [182]:
b = np.array((2,3,4))
b.shape

(3,)

In [184]:
d = np.array([(5,),(6,),(7,)])
d.shape

(3, 1)

In [185]:
c = a*b
c

array([[ 2,  6, 12],
       [ 8, 15, 24],
       [14, 24, 36]])

In [186]:
c = a*d
c

array([[ 5, 10, 15],
       [24, 30, 36],
       [49, 56, 63]])

In [187]:
c = b*d
c

array([[10, 15, 20],
       [12, 18, 24],
       [14, 21, 28]])

## Math Functions (ufunc)

NumPy has many math functions, which can accept array as arguments, and do the math calculations one each item and return the same shape array with item been calculated using the function.

In [188]:
e = np.sin(a)
e

array([[ 0.84147098,  0.90929743,  0.14112001],
       [-0.7568025 , -0.95892427, -0.2794155 ],
       [ 0.6569866 ,  0.98935825,  0.41211849]])

In [189]:
e = np.sqrt(a)
e

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974],
       [2.64575131, 2.82842712, 3.        ]])

In [190]:
e = e**2
e

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [191]:
e = np.log(a)
e

array([[0.        , 0.69314718, 1.09861229],
       [1.38629436, 1.60943791, 1.79175947],
       [1.94591015, 2.07944154, 2.19722458]])

In [192]:
e = np.exp(e)
e

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

Those functions which acting to perform calculation itemwise, are called universal functions (ufunc).  
NumPy package math functions are ufunc. but the Python math module functions are not.

## Aggregate Functions

aggregate functions are those function which will product a scalar result on a set of values.

In [193]:
e

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [194]:
e.mean()

5.0

In [195]:
e.mean(axis=0)

array([4., 5., 6.])

In [196]:
e.mean(axis=1)

array([2., 5., 8.])

Other aggregate function include: max(), min(), std(), sum(), etc.

## Special values: nan and inf

When a value is not a number, NumPy uses np.nan as it's value.  
When a value calculation goes to infinity, NumPy uses np.inf as positive infinity and np.NINF as negative infinity.

In [197]:
np.nan*2

nan

In [198]:
np.inf-2

inf

In [199]:
np.NINF *2

-inf

In [200]:
-np.inf +3

-inf

## Matrix Calculation

There are many other tensor, matrix and vector calculations supported.

In [201]:
a = np.arange(1,13).reshape((3,4))
a

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [202]:
b = np.arange(1,13).reshape((4,3))
b

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [203]:
c = np.dot(a,b)
c

array([[ 70,  80,  90],
       [158, 184, 210],
       [246, 288, 330]])

Matric product is the np.dot() function. For $a$ as $M \times L$ matrix, $b$ as $L \times N$ matrix, the matrix product $c$ will be $M \times N$ matrix, which the items are: $c[i, j] = \sum_{k=1}^L {a[i, k]*b[k, j]}$  
So two matrix can perform matric product, must be the first matrix second dimension size is same as the second matrix first dimension size.  
`np.dot(a,b)` can also be write as `a.dot(b)`. They are same.  
Note that the matrix product can not swith order. `np.dot(a,b)` is not same as `np.dot(b,a)`.

for tensordot() function, when we use tensordot(a,b,1), it is same as the matrix product dot(a,b)