# The Data Scientists` Toolbox

<a id='menu'></a>

# 1- NumPy     
* short for Numerical Python, one of the most improtant and widely used packages in Python for scientific and mathematical applications.   
* it comes with a plethora of mathematical methods and operations.  
* enables the creation of multidimensional arrays (ndarrays).   
* **ndarray** (n dimensional array) is a generic container for homogeneous data, hence all the elements must be the same type.   
* with an indispensable utility it can be used in conjunction with other packages such as `pandas` and `scikit-learn`.    
* it has tools for reading and writing array data and tab dalimited files.

[1.1 Basic Numpy methods](#1.1)  
<a href='#1.1'>
[1.2 Casting ndarrays](#1.2)     
<a href='#1.2'>
[1.3 Arithmatic Operatirs](#1.3)    
<a href='#1.3'>
[1.4 Indexing, slicing and sorting  ndarrays](#1.4)     
<a href='#1.4'>
[1.5 Matrix operations](#1.5)       
<a href='#1.5'>
[1.6 Additional numpy methods](#1.6)     
<a href='#1.6'>
[1.7 Iterating over arrays](#1.7)      
<a href='#1.7'>
[1.8 special values:  `numpy.nan` and `numpy.inf`](#1.8)     
<a href='#1.8'>
[1.9 numpy I/O](#1.9)     
<a href='#1.9'>
[2.0 In-class exercise](#2.0)     
<a href='#2.0'>

<a id='1.1'></a>
&nbsp;

&nbsp;


## 1.1 Basic NumPy methods and operations

* NumPy is very precise about identifying datatypes   
* Numpy is more precise than Python: while Python has just one type of integer, and one type of floating-point number, NumPy has several depending on the number of bits that they take in the memory.
* the pairs below are `type` : `type-abbreviation`. Both can be used to assign dtype for an array. 

    - integers:     
    `numpy.int8` : `i1`    
    `numpy.int16` :`i2`     
    **`numpy.int32` : `i4`**     
    `numpy.int64` :`i8`     
    
    - floats:      
    `numpy.float16` : `f2`    
    `numpy.float32` : `f4` or `f`    
    **`numpy.float64`** : **`f8`** or **`d`**     
    `numpy.float128 : f128` or `g`          
    
    - complex:     
    `numpy.complex64 : c8` or `c16`    
    `numpy.complex128 : c832`    
    `numpy.complex256`      
    
    - string: 
    `S` | `U1` | `U2` | `U4` (the integer refers to the max number of letters allowed)   
    
    - bool: **`?`**
           
    
* when an object is assigned Numpy will do its best to assign the type but that can be modified.  

In [3]:
import numpy as np
import random 

* numpy n-dimansional arrays can be created generally using two methods  
    `numpy.array(object, dtype=None, order = {'C','F'})` 
    `numpy.ndarray(shape = (n,m), dtype=None, order = {'C','F'})`  
    
    
    
* argument `order = {'C','F'}` specifies the memory layout or the array to be `C` row major (default), or `F` column major.   
* dtype can be assisned into the expression itself or be assgiend after the creation of the ndarray using `astype()`.  

In [116]:
data = np.array([[3,5,7,2],[2,3,9,1],[7,1,1,1]])
data

array([[3, 5, 7, 2],
       [2, 3, 9, 1],
       [7, 1, 1, 1]])

In [117]:
data = np.array([[3,5,7,2,2,3,9,1,7,1,1,1]])
data

array([[3, 5, 7, 2, 2, 3, 9, 1, 7, 1, 1, 1]])

* `numpy.ndarray()` can generate an n-by-m array

In [178]:
data_1 = np.ndarray(shape=(3,4), dtype = np.int8)
data_1

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

&nbsp;

there is no direct way to control the seed for ndarray.  
`np.random.seed(n)` sets the seed for all the methods under `np.random` such as `np.random.randint()`  

In [181]:
np.random.seed(33)
data_1 = np.ndarray(shape=(3,4), dtype = np.int8)
data_1

array([[1, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 0, 0, 0]], dtype=int8)

&nbsp;

in order to generate the same array using `np.random.seed()` use `np.array()` in conjunction with `np.random.rand()` as follows:  


In [201]:
np.random.seed(23)
array_obj = np.random.rand(12)
l1 = np.array(array_obj, dtype = 'f8').reshape((3,4))
l1

array([[0.51729788, 0.9469626 , 0.76545976, 0.28239584],
       [0.22104536, 0.68622209, 0.1671392 , 0.39244247],
       [0.61805235, 0.41193009, 0.00246488, 0.88403218]])

In [9]:
# inspect the shape of the array
data_1.ndim

2

In [None]:
# inspect the data-type
data_1.dtype

In [None]:
# inspect the shape
data_1.shape

In [None]:
data_2 = np.array([[3.1,5.6,7.0,2.0],[2.2,3.5,9,1.8],[7.9,1,1.7,1]], dtype=np.float64)
data_2

In [None]:
# without a nested list .reshape(n,k) returns the shape required
data_3 = np.array([[3.1,5.6,7.0,2.0,2.2,3.5,9,1.8,7.9,1,1.7,1]], dtype=np.int8).reshape(3,4)
data_3

* `array.reshape(n,k)` can be used to reshape an existing ndarray

In [None]:
data_3.reshape(2,6)

&nbsp; 

*  3d and 4d arrays

In [None]:
data3d = np.array([[[3,5,7,2],[2,3,9,1],[7,1,1,1]], [[4,8,1,1],[2,7,7,0],[0,2,1,1]]])
data3d

In [None]:
data3d.ndim, data3d.dtype, data3d.shape

In [None]:
data4d = np.array([[[[3,5,7,2],[2,3,9,1],[7,1,1,1]], [[4,8,1,1],[2,7,7,0],[0,2,1,1]]],
                   [[[3,5,7,2],[2,3,9,1],[7,1,1,1]], [[4,8,1,1],[2,7,7,0],[0,2,1,1]]]])
data4d

In [None]:
data4d.ndim, data4d.dtype, data4d.shape

In [None]:
# using the same list from data4d without any nesting
data4d = np.array([3,5,7,2,2,3,9,1,7,1,1,1,4,8,1,1,2,7,7,0,0,2,1,1,
                  3,5,7,2,2,3,9,1,7,1,1,1,4,8,1,1,2,7,7,0,0,2,1,1]).reshape(2,2,3,4)
data4d

In [None]:
data3d_1 = np.array(list(range(5,25))).reshape(4,5)
data3d_1

In [6]:
# more on numpy.arange() shortly
data4d_2 = np.arange(45,5,-1).reshape(2,2,2,5)
data4d_2

array([[[[45, 44, 43, 42, 41],
         [40, 39, 38, 37, 36]],

        [[35, 34, 33, 32, 31],
         [30, 29, 28, 27, 26]]],


       [[[25, 24, 23, 22, 21],
         [20, 19, 18, 17, 16]],

        [[15, 14, 13, 12, 11],
         [10,  9,  8,  7,  6]]]])

In [None]:
data4d_3 = np.ndarray(shape = (2,2,2,5), dtype=np.int16)
data4d_3

&nbsp;

* useful method that generate k by n by m ndarray of zeros, ones.  
  - `zeros(size=(n,m))`, `ones(size=(n,m))`   
  - `empty(size=(n,m))` unlike `zeros()` generates an array of objects infinitesimally small but not zeros.  

In [None]:
zero_arr = np.zeros((5,3), dtype='d') # d and f8 are abbreviation for float64
zero_arr

In [None]:
zero_arr.dtype

In [4]:
ones_arr = np.ones((3,4), dtype=np.int32)
ones_arr

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])

In [203]:
empty_arr = np.empty([4,4], dtype=np.float32)
empty_arr

array([[0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00],
       [0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00],
       [0.00e+00, 0.00e+00, 2.08e-42, 0.00e+00],
       [0.00e+00, 6.53e-43, 0.00e+00, 0.00e+00]], dtype=float32)

&nbsp;

* `numpy.random.random(size=(k,n,m))`              
generates a sample of `n` by `m` array with `k` dimension from a continuous standard uniform distribution between `[0,1]`     

In [206]:
np.random.seed(23)
r_unif = np.random.random((2,3,2))
r_unif

array([[[0.51729788, 0.9469626 ],
        [0.76545976, 0.28239584],
        [0.22104536, 0.68622209]],

       [[0.1671392 , 0.39244247],
        [0.61805235, 0.41193009],
        [0.00246488, 0.88403218]]])

In [207]:
# mean = (a+b)/2, sd = sqrt((b-a)/12)
np.mean(r_unif), np.std(r_unif)

(0.4912870595645145, 0.2841006450984657)

* `.random.randn(size=(k,n,m))`    
generates a sample of `n` by `m` array with `k` dimension from a Gaussian distribution with mean 0, and sd = 1. 

In [208]:
np.random.seed(56)
r_norm1 = np.random.randn(4,4)
print(r_norm1, end='\n\n')
print(np.mean(r_norm1), np.std(r_norm1))

[[-1.03764318  0.59365816  1.10268062 -0.51217773]
 [-0.26541986 -1.61700601 -0.27151449  0.94555425]
 [-0.62699279 -0.26594728  0.68729358  2.04845691]
 [-0.46155895  0.68784742  1.00812262 -0.15650898]]

0.11617776802742645 0.9084234206778787


In [211]:
r_norm2 = np.random.randn(2,3,4)
r_norm2

array([[[ 0.22768782, -2.51261673,  2.70271251,  0.58365058],
        [ 0.33878175,  0.68900141,  1.70034868, -0.44126724],
        [ 0.3144938 ,  0.63629759,  0.08677138, -0.11779163]],

       [[-1.27587343,  0.90675596,  0.00606262, -0.85210364],
        [ 0.49958744, -1.51280927,  2.00429063,  1.25882569],
        [ 1.04015487,  1.0513821 ,  0.97157016, -1.58639114]]])

&nbsp;

`numpy.random.randint(min, max, size=(k,n,m))`          
generats an array of randomly drawn numbers between `min` and `max` of size `n` by `m` and dimension `k`.   
Keep in mind that the `min`, `max` range is exclusive of `max`. 

In [213]:
np.random.seed(56)
r_int = np.random.randint(-7,11,size=(2,4,5))
print(r_int)

[[[-3  8 -7 -5  4]
  [ 7  2 -3  8 10]
  [ 5 -5  4  4  7]
  [ 9 -1 -6  3 -2]]

 [[-5  8 -2  2 -5]
  [ 3 -6  9  6 -1]
  [ 5 -7  1 -3  7]
  [ 8 -7 10  4 10]]]


* `'?'`  or `numpy.bool_` generates a boolean array.

In [214]:
np.ndarray(shape = (2,5), dtype = np.bool_)

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

[back to top](#menu)
<a href='#menu'></a>

<a id='1.2'></a>

&nbsp;

## 1.2 ***casting***   ndarrays

* it is possible to *cast* an array from one dtype to another.

In [None]:
# we can convert the int32 to float.
data = data.astype(np.float64)
data

In [None]:
data.dtype

In [231]:
np.random.rand(16)*10

array([9.85611582, 4.96744922, 3.52318926, 8.67084994, 3.96881746,
       6.43657219, 0.20453313, 8.08084543, 4.29613302, 5.58546523,
       7.7911129 , 4.02557124, 9.07994918, 6.49969611, 7.27272227,
       7.7956104 ])

In [234]:
np.random.seed(55)
data_flt = np.array(np.random.rand(16)*100).reshape((4,4))
data_flt

array([[ 9.31082867, 97.165592  , 48.38599805, 24.25227015],
       [53.11238298, 28.55442355, 86.26303771,  4.11001535],
       [10.8347734 , 76.71600454,  5.14287089, 77.57165386],
       [ 0.91389426, 61.83121135, 81.87093327, 89.8585723 ]])

In [236]:
data_flt.dtype

dtype('float64')

In [235]:
# casting a float array into an int drops all the decimals points. 
data_i4 = data_flt.astype('i4') # using an abbreviation. 
data_i4

array([[ 9, 97, 48, 24],
       [53, 28, 86,  4],
       [10, 76,  5, 77],
       [ 0, 61, 81, 89]])

In [237]:
data_i4.dtype

dtype('int32')

* we can convert a number array (float or int) into a string as long as the correct dtype is plugged in.   
* use `U2` to `U10` for dtype, the number refers to the maximum length of the string.  

In [238]:
# created earlier
r_int

array([[[-3,  8, -7, -5,  4],
        [ 7,  2, -3,  8, 10],
        [ 5, -5,  4,  4,  7],
        [ 9, -1, -6,  3, -2]],

       [[-5,  8, -2,  2, -5],
        [ 3, -6,  9,  6, -1],
        [ 5, -7,  1, -3,  7],
        [ 8, -7, 10,  4, 10]]])

In [239]:
r_int_str = r_int.astype('U4')
r_int_str

array([[['-3', '8', '-7', '-5', '4'],
        ['7', '2', '-3', '8', '10'],
        ['5', '-5', '4', '4', '7'],
        ['9', '-1', '-6', '3', '-2']],

       [['-5', '8', '-2', '2', '-5'],
        ['3', '-6', '9', '6', '-1'],
        ['5', '-7', '1', '-3', '7'],
        ['8', '-7', '10', '4', '10']]], dtype='<U4')

In [240]:
# created earlier
r_norm1

array([[-1.03764318,  0.59365816,  1.10268062, -0.51217773],
       [-0.26541986, -1.61700601, -0.27151449,  0.94555425],
       [-0.62699279, -0.26594728,  0.68729358,  2.04845691],
       [-0.46155895,  0.68784742,  1.00812262, -0.15650898]])

In [241]:
# U10 returns 3 decimal places
r_norm_str = r_norm1.astype('U6')
r_norm_str

array([['-1.037', '0.5936', '1.1026', '-0.512'],
       ['-0.265', '-1.617', '-0.271', '0.9455'],
       ['-0.626', '-0.265', '0.6872', '2.0484'],
       ['-0.461', '0.6878', '1.0081', '-0.156']], dtype='<U6')

In [242]:
# U10 obtains 6 to 7 decimal places 
r_norm_str = r_norm1.astype('U10')
r_norm_str

array([['-1.0376431', '0.59365815', '1.10268062', '-0.5121777'],
       ['-0.2654198', '-1.6170060', '-0.2715144', '0.94555425'],
       ['-0.6269927', '-0.2659472', '0.68729357', '2.04845690'],
       ['-0.4615589', '0.68784742', '1.00812262', '-0.1565089']],
      dtype='<U10')

&nbsp;

* it is possible to cast a numeric array into a complex array. 

In [None]:
randn_cplx = r_norm1.astype(np.complex)
randn_cplx

&nbsp;

* arrays can also contain letters and words.  
* an alpha array cannot be converted into a numeric array. 

In [243]:
r_str = np.array([['foo_','bar_','scam_._','elon','litt'],['spam_','glee_._','more_','elon','beak_._'],
                     ['bar_','elon_._','glue_','elon_','elon___'],['sap_.','elon','brew','may_','litt']])
r_str

array([['foo_', 'bar_', 'scam_._', 'elon', 'litt'],
       ['spam_', 'glee_._', 'more_', 'elon', 'beak_._'],
       ['bar_', 'elon_._', 'glue_', 'elon_', 'elon___'],
       ['sap_.', 'elon', 'brew', 'may_', 'litt']], dtype='<U7')

* casting to a shorter string dtype truncates the stings.   

In [244]:
r_str = r_str.astype('U4')
r_str

array([['foo_', 'bar_', 'scam', 'elon', 'litt'],
       ['spam', 'glee', 'more', 'elon', 'beak'],
       ['bar_', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'brew', 'may_', 'litt']], dtype='<U4')

a string array with type `Ux` truncates all modifications into length x.

In [245]:
r_str[0,0] = 'foolish'
r_str

array([['fool', 'bar_', 'scam', 'elon', 'litt'],
       ['spam', 'glee', 'more', 'elon', 'beak'],
       ['bar_', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'brew', 'may_', 'litt']], dtype='<U4')

&nbsp;

* `numpy.random.binomial(n,p,size)` draws samples from a binomial distribution. 

In [246]:
np.random.seed(32)
binom = np.random.binomial(1,0.5,20).reshape((5,4))
binom

array([[1, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 1, 0, 0],
       [1, 0, 1, 1],
       [1, 0, 1, 1]])

In [247]:
r_bool = binom.astype('?')
r_bool

array([[ True, False,  True,  True],
       [ True,  True, False,  True],
       [ True,  True, False, False],
       [ True, False,  True,  True],
       [ True, False,  True,  True]])

[back to top](#menu)
<a href='#menu'></a>

<a id='1.3'></a>

&nbsp;

## 1.3 Arithmatic operations   
* numeric arrays can be treated as single value variables. So `+ - * / // % abs()` are applicable. 
* arithmatic operations can be applied between equal size arrays.  
* arithmatic operations are carried out element wise. 

In [314]:
np.random.seed(18)
r_int1 = np.random.randint(-7,10, (4,5), dtype=np.int32)
print('r_int1\n',r_int1, end='\n\n')
r_int2 = np.random.randint(34,81, (4,5), dtype=np.int64)
print('r_int2\n', r_int2, end='\n\n')
r_flt = np.random.randn(4,5)
print('r_flt\n',r_flt)

r_int1
 [[ 3 -2  7 -5  1]
 [-5 -2  8  3  3]
 [ 4 -3 -3  4 -6]
 [-4  2 -2  7 -1]]

r_int2
 [[55 47 42 52 34]
 [63 43 47 44 47]
 [49 37 62 37 65]
 [55 45 76 68 59]]

r_flt
 [[ 1.54639052e+00 -7.17570870e-02  8.12733109e-01  4.74184318e-01
  -2.10215798e+00]
 [ 7.03544762e-01  1.35491591e+00  2.70537365e-01 -6.74582648e-01
   4.94452181e-01]
 [-8.93111515e-02  1.10100059e+00 -3.84033307e-04  5.79753330e-01
   3.45074254e-01]
 [ 1.58549456e+00 -2.94118086e+00 -1.45989279e+00 -5.16562957e-01
  -1.57771814e-01]]


In [249]:
r_int1 + r_flt

array([[ 4.54639052, -2.07175709,  7.81273311, -4.52581568, -1.10215798],
       [-4.29645524, -0.64508409,  8.27053737,  2.32541735,  3.49445218],
       [ 3.91068885, -1.89899941, -3.00038403,  4.57975333, -5.65492575],
       [-2.41450544, -0.94118086, -3.45989279,  6.48343704, -1.15777181]])

In [250]:
r_int1 * r_flt

array([[ 4.63917157e+00,  1.43514174e-01,  5.68913176e+00,
        -2.37092159e+00, -2.10215798e+00],
       [-3.51772381e+00, -2.70983182e+00,  2.16429892e+00,
        -2.02374794e+00,  1.48335654e+00],
       [-3.57244606e-01, -3.30300176e+00,  1.15209992e-03,
         2.31901332e+00, -2.07044552e+00],
       [-6.34197825e+00, -5.88236173e+00,  2.91978558e+00,
        -3.61594070e+00,  1.57771814e-01]])

In [251]:
r_int2 / r_flt

array([[ 3.55666950e+01, -6.54987569e+02,  5.16774813e+01,
         1.09661999e+02, -1.61738558e+01],
       [ 8.95465412e+01,  3.17362868e+01,  1.73728313e+02,
        -6.52255140e+01,  9.50546924e+01],
       [-5.48643693e+02,  3.36057950e+01, -1.61444330e+05,
         6.38202458e+01,  1.88365256e+02],
       [ 3.46894914e+01, -1.52999771e+01, -5.20586173e+01,
        -1.31639327e+02, -3.73957796e+02]])

In [252]:
r_int2 % r_int1

array([[ 1, -1,  0, -3,  0],
       [-2, -1,  7,  2,  2],
       [ 1, -2, -1,  1, -1],
       [-1,  1,  0,  5,  0]], dtype=int64)

In [253]:
abs(r_int2 % r_int1)

array([[1, 1, 0, 3, 0],
       [2, 1, 7, 2, 2],
       [1, 2, 1, 1, 1],
       [1, 1, 0, 5, 0]], dtype=int64)

&nbsp;

* numpy also has a number of statistical methods that can be applied to arrays or basic Python containers   
* `numpy.amin()` and `numpy.amax()` return the min and max along an axis.   
* `numpy.nanmean()` and `numpy.nanstd()` returns the mean and std along an axis. 

In [254]:
np.amax(r_int1, axis = 1)

array([7, 8, 4, 7])

In [255]:
np.amin(r_int1, axis = 0)

array([-5, -3, -3, -5, -6])

In [263]:
np.mean(r_int1, axis= 0), np.var(r_int1, axis = 0), np.std(r_int1, axis = 0)

(array([-0.5 , -1.25,  2.5 ,  2.25, -0.75]),
 array([16.25  ,  3.6875, 25.25  , 19.6875, 11.1875]),
 array([4.03112887, 1.92028644, 5.02493781, 4.43705984, 3.34477204]))

&nbsp;
`numpy.nanmean()` and `numpy.nanstd()` are similar to apply() in **R**

In [257]:
# compute the mean along an axis.
np.nanmean(r_int1, axis = 0), np.nanmean(r_int1, axis = 1)

(array([-0.5 , -1.25,  2.5 ,  2.25, -0.75]), array([ 0.8,  1.4, -0.8,  0.4]))

In [258]:
# compute the mean along an axis. 
np.nanstd(r_int1, axis = 0), np.nanstd(r_int1, axis = 1)

(array([4.03112887, 1.92028644, 5.02493781, 4.43705984, 3.34477204]),
 array([4.11825206, 4.49888875, 4.06939799, 3.82622529]))

[back to top](#menu)
<a href='#menu'></a>

<a id='1.4'></a>

&nbsp;

## 1.4 Indexing, slicing and sorting  ndarrays

* numpy method `numpy.arange(start,stop,step,dtype)` is similar -in construct- to `range()`, however the resulting object is a numpy array. 
* `numpy.arange()` is not similar to `range()` in creating a size optimized objects since a numpy array contains all the elements of the array. 
* numpy `numpy.arange()` objects are iterable.  
* `numpy.arange()` argument `step` can be a fraction (`range()` `step` only takes integers)
* for two dimensional arrays in numpy, rows are axis `0` and columns are axis `1`, this is generally true for Pandas dataframes as well with some exceptions in some methods.     
* for n-dimensional arrays, additional dimension start to increment axis `2`, `3` and over.  

In [264]:
vec1 = np.arange(2,9.5,.5).reshape((3,5))
vec1

array([[2. , 2.5, 3. , 3.5, 4. ],
       [4.5, 5. , 5.5, 6. , 6.5],
       [7. , 7.5, 8. , 8.5, 9. ]])

In [265]:
vec2 = vec1[1]
vec2

array([4.5, 5. , 5.5, 6. , 6.5])

<span style="color:blue">why would you use `np.arange()` over python native `range()`</span>   
try the squence above using `range()`   

In [None]:
#skipped code 

* `vec2` is a slice and a shallow copy of `vec1`.  
* any changes made to `vec2` are reflected in `vec1`. 

In [268]:
vec2[:] = 64
vec2

array([64., 64., 64., 64., 64.])

In [269]:
vec1

array([[ 2. ,  2.5,  3. ,  3.5,  4. ],
       [64. , 64. , 64. , 64. , 64. ],
       [ 7. ,  7.5,  8. ,  8.5,  9. ]])

In [270]:
vec2[3] = 223
vec1

array([[  2. ,   2.5,   3. ,   3.5,   4. ],
       [ 64. ,  64. ,  64. , 223. ,  64. ],
       [  7. ,   7.5,   8. ,   8.5,   9. ]])

&nbsp;

* **NumPy slices yield views onto the same piece of memory, modifying a slice (even one that is assigned to a name) results in modifying the original array.** 
* in order to create a slice that is a deep copy of the original object use `.copy()`

In [271]:
vec2 = vec1[1].copy()
vec2[3] = 91
vec2

array([64., 64., 64., 91., 64.])

In [272]:
# vec1[6] remains unchanges
vec1

array([[  2. ,   2.5,   3. ,   3.5,   4. ],
       [ 64. ,  64. ,  64. , 223. ,  64. ],
       [  7. ,   7.5,   8. ,   8.5,   9. ]])

* subset elements from an array based on a condition   
* the resulting subset does not retain the shape of the array. 

In [273]:
# created earlier
r_int1

array([[ 3, -2,  7, -5,  1],
       [-5, -2,  8,  3,  3],
       [ 4, -3, -3,  4, -6],
       [-4,  2, -2,  7, -1]])

&nbsp;

* slices can be created based on a boolean condition

In [274]:
r_int1_subset = r_int1[r_int1 <= 4].copy()
r_int1_subset

array([ 3, -2, -5,  1, -5, -2,  3,  3,  4, -3, -3,  4, -6, -4,  2, -2, -1])

In [275]:
r_int1_subset = r_int1[r_int1 == 3].copy()
r_int1_subset

array([3, 3, 3])

* slicing two dimensional arrays follows the same slicing rules for Python lists separating `n` and `m` with a comma.   

    `array[n,m]`

In [276]:
# first row
r_int1[0,:]

array([ 3, -2,  7, -5,  1])

In [278]:
# first column
r_int1[:,0]

array([ 3, -5,  4, -4])

In [279]:
r_int1[:2,:2]

array([[ 3, -2],
       [-5, -2]])

In [280]:
r_int1[:,:3]

array([[ 3, -2,  7],
       [-5, -2,  8],
       [ 4, -3, -3],
       [-4,  2, -2]])

In [281]:
# a range is exclusive of last index, hence to access last column we slice at m+1
r_int1[:,3:5]

array([[-5,  1],
       [ 3,  3],
       [ 4, -6],
       [ 7, -1]])

In [282]:
r_int1[:3,1:4]

array([[-2,  7, -5],
       [-2,  8,  3],
       [-3, -3,  4]])

In [283]:
r_int1[1:3,:]

array([[-5, -2,  8,  3,  3],
       [ 4, -3, -3,  4, -6]])

* boolean indexing: slicing an array according to a boolean array.

In [284]:
# created earlier
r_int1

array([[ 3, -2,  7, -5,  1],
       [-5, -2,  8,  3,  3],
       [ 4, -3, -3,  4, -6],
       [-4,  2, -2,  7, -1]])

In [285]:
# created earlier
r_bool

array([[ True, False,  True,  True],
       [ True,  True, False,  True],
       [ True,  True, False, False],
       [ True, False,  True,  True],
       [ True, False,  True,  True]])

In [None]:
# first row of the r_bool
r_bool[2,:]

In [None]:
r_int1[r_bool[2,:],:]

above the slice will return **rows** of `r_int1` according to the map of `array([ True,  True, False, False])`

&nbsp;


In [286]:
# first column of r_bool
r_bool[:,2]

array([ True, False, False,  True,  True])

In [287]:
r_int1[:,r_bool[:,2]]

array([[ 3, -5,  1],
       [-5,  3,  3],
       [ 4,  4, -6],
       [-4,  7, -1]])

above the slice will return **columns** of `r_int1` according to the map of `array([ True, False, False,  True,  True])`

the only restriction in boolean slicing is that the vector passed into the row or column position in the slice call has to match the n or m of the array to be sliced.

&nbsp;

In [288]:
# to rows 1 and 2 (and all columns) of r_int1 slice according to r_int2 row 4 for values < 40.
r_int1[1:3, r_int2[3,:] < 60]

array([[-5, -2,  3],
       [ 4, -3, -6]])

&nbsp;

* there are many ways to slice an array

In [289]:
#created earlier
r_str

array([['fool', 'bar_', 'scam', 'elon', 'litt'],
       ['spam', 'glee', 'more', 'elon', 'beak'],
       ['bar_', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'brew', 'may_', 'litt']], dtype='<U4')

In [290]:
r_str[2,:]=='elon'

array([False,  True, False,  True,  True])

In [291]:
# only returns columns containing 'elon'
r_str[:,r_str[2,:]=='elon']

array([['bar_', 'elon', 'litt'],
       ['glee', 'elon', 'beak'],
       ['elon', 'elon', 'elon'],
       ['elon', 'may_', 'litt']], dtype='<U4')

In [292]:
r_int2[2:4, r_str[2,:]=='elon']

array([[37, 37, 65],
       [45, 68, 59]], dtype=int64)

&nbsp;

* `numpy.where()` can be used in two different ways:     
 * 1 `numpy.where(array == value)` for index lookup.     
 * 2 `numpy.where(cond, x, y)` is a vectorizer version of the statement `x if condition else y`
 
    if assigned to a name `np.where()` creates a new deep copy. 

In [295]:
r_str

array([['fool', 'bar_', 'scam', 'elon', 'litt'],
       ['spam', 'glee', 'more', 'elon', 'beak'],
       ['bar_', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'brew', 'may_', 'litt']], dtype='<U4')

In [293]:
np.where(r_str == 'elon')

(array([0, 1, 2, 2, 2, 3], dtype=int64),
 array([3, 3, 1, 3, 4, 1], dtype=int64))

In [297]:
r_bool

array([[ True, False,  True,  True],
       [ True,  True, False,  True],
       [ True,  True, False, False],
       [ True, False,  True,  True],
       [ True, False,  True,  True]])

In [294]:
np.where(r_bool == False)

(array([0, 1, 2, 2, 3, 4], dtype=int64),
 array([1, 2, 2, 3, 1, 1], dtype=int64))

the index lookup returs a tuple of two arrays, the first with all the row indeces and second with all the column indeces

In [9]:
#created earlier
r_int1

In [10]:

np.arrange(0,5)

AttributeError: 'module' object has no attribute 'arrange'

&nbsp;

`np.where()` can be used as vectorized conditional statement   
the vectorized conditional statement replaces values that are > -3 with .9999 else replace with `np.nan`

In [299]:
np.where(r_int1 > -3, .9999, np.nan)

array([[0.9999, 0.9999, 0.9999,    nan, 0.9999],
       [   nan, 0.9999, 0.9999, 0.9999, 0.9999],
       [0.9999,    nan,    nan, 0.9999,    nan],
       [   nan, 0.9999, 0.9999, 0.9999, 0.9999]])

replace every instance of `'elon'` with `'musk'` else keep the value of r_str

<span style="color:blue"> replace every instance of `elon` in the table <U>r_str</U> with the word `musk`</U>

In [None]:
#skipped code

&nbsp;

* **fancy Indexing** refers to passing an array to access multiple array elements of an array.    
* this has to be an array by necessity.    
* fancy indexing allows reordering of rows and elements   
* if assigned to a name, fancy indexing always copies the data into a new array.
* fancy indexing follows the same slicing rules as np arrays but it allows for selective slicing of rows and columns    

`array[[:3],[1,5,1]]`  

the code above selects rows 0,1 and 2 but only columns 1 and 5 in the arrangements assigned!   

In [302]:
r_int3 = np.arange(45).reshape((9,5))
r_int3

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44]])

&nbsp;

<span style="color:blue">slice r_int3 selecting rows 3,1 and 4 only</span>

In [None]:
#skipped code

&nbsp;

<span style="color:blue">slice r_int3 selecting rows 3,1 and 4 and columns 4,4 and 2 only</span>

In [None]:
#skipped code

&nbsp;

<span style="color:blue">slice r_int3 selecting rows 1,5,7,2</span>

In [None]:
#skipped code

&nbsp;

<span style="color:blue">slice r_int3 selecting rows 1,5,7,2 and out of the resulting array all rows for columns 0,3,1,2</span>

hint: fancy slices can be stacked row wise   

In [None]:
#skipped code

&nbsp;

<span style="color:blue">slice r_int3 selecting rows 2,2,3,3 and out of the resulting array all rows for columns 0,4,4,4</span>

In [None]:
#skipped code

&nbsp;
* `numpy.sort(np.array, axis = -1)` a numpy sort method that returns a deep copy of an array (if assigned to a name).   
* `ndarray.sort(axis = -1)` a sort method that produces a sorted array in place.   
* argument `axis = -1` (default) sorts row-wise whereas `axis = 0` sorts column-wise    

In [308]:
# reminder
r_str

array([['fool', 'bar_', 'scam', 'elon', 'litt'],
       ['spam', 'glee', 'more', 'elon', 'beak'],
       ['bar_', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'brew', 'may_', 'litt']], dtype='<U4')

In [309]:
np.sort(r_str, axis = -1)

array([['bar_', 'elon', 'fool', 'litt', 'scam'],
       ['beak', 'elon', 'glee', 'more', 'spam'],
       ['bar_', 'elon', 'elon', 'elon', 'glue'],
       ['brew', 'elon', 'litt', 'may_', 'sap_']], dtype='<U4')

In [310]:
np.sort(r_str, axis = 0)

array([['bar_', 'bar_', 'brew', 'elon', 'beak'],
       ['fool', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'more', 'elon', 'litt'],
       ['spam', 'glee', 'scam', 'may_', 'litt']], dtype='<U4')

In [316]:
r_str

array([['fool', 'bar_', 'scam', 'elon', 'litt'],
       ['spam', 'glee', 'more', 'elon', 'beak'],
       ['bar_', 'elon', 'glue', 'elon', 'elon'],
       ['sap_', 'elon', 'brew', 'may_', 'litt']], dtype='<U4')

&nbsp;

on the other hand this sort method is permanent

In [317]:
r_str.sort(axis = -1)
r_str

array([['bar_', 'elon', 'fool', 'litt', 'scam'],
       ['beak', 'elon', 'glee', 'more', 'spam'],
       ['bar_', 'elon', 'elon', 'elon', 'glue'],
       ['brew', 'elon', 'litt', 'may_', 'sap_']], dtype='<U4')

[back to top](#menu)
<a href='#menu'></a>

<a id='1.5'></a>

&nbsp; 

## 1.5 Matrix operations
&nbsp;

* this includes matrix transpose, dot products, and inverting a matrix. 


* array **transpose** is done using the instance method `ndarrya.T` or `ndarray.transpose()` applied to the array

In [318]:
r_int2

array([[55, 47, 42, 52, 34],
       [63, 43, 47, 44, 47],
       [49, 37, 62, 37, 65],
       [55, 45, 76, 68, 59]], dtype=int64)

In [319]:
r_int2.T

array([[55, 63, 49, 55],
       [47, 43, 37, 45],
       [42, 47, 62, 76],
       [52, 44, 37, 68],
       [34, 47, 65, 59]], dtype=int64)

&nbsp;

* `numpy.dot(ndarray1, ndarray2)`      
returns the dot product of `ndarray1` and `ndarray2`   

$$ \begin{bmatrix} v_1 & v_2 & v_3 \\ v_4 & v_5 & v_6 \end{bmatrix}  * \begin{bmatrix} a & d & g \\ b & e & h \\ c & f & i \end{bmatrix} = \begin{bmatrix} v_1a + v_2b + v_3c & v_1d + v_2e + v_3f & v_1g + v_2h + v_3i\\ v_4a + v_5b + v_6c & v_ad + v_5e + v_6f & v_4g + v_5h + v_6i \end{bmatrix}$$

In [320]:
mat1 = np.array([[.2,5],[8,5],[.6,3.7],[4,11],[2,1]])
mat1

array([[ 0.2,  5. ],
       [ 8. ,  5. ],
       [ 0.6,  3.7],
       [ 4. , 11. ],
       [ 2. ,  1. ]])

In [321]:
np.dot(r_int2, mat1)

array([[ 688.2, 1271.4],
       [ 654.8, 1234.9],
       [ 621. , 1131.4],
       [ 806.6, 1588.2]])

&nbsp; 

* `numpy.outer(ndarray1, ndarray2)`  

$$ \begin{bmatrix} v_1 \\ v_2 \\ v_3 \end{bmatrix}  * \begin{bmatrix} a & b & c & d \end{bmatrix} = \begin{bmatrix} v_1a & v_1b & v_1c & v_1d \\ v_2a & v_2b & v_2c & v_2d \\ v_3a & v_3b & v_3c & v_3d \end{bmatrix}$$

In [322]:
mat2 = np.array([2,3,4])
mat3 = np.array([1,1,1,1,1,1,1,1,1])

In [323]:
np.outer(mat2, mat3)

array([[2, 2, 2, 2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4, 4, 4, 4]])

In [324]:
np.outer(mat3, mat2)

array([[2, 3, 4],
       [2, 3, 4],
       [2, 3, 4],
       [2, 3, 4],
       [2, 3, 4],
       [2, 3, 4],
       [2, 3, 4],
       [2, 3, 4],
       [2, 3, 4]])

&nbsp;

* `numpy.linalg.det()`          
returns the determinant of a **square** matrix.   
               
               
the determinant of a 2x2 matrix: $$ |\ A| = \begin{vmatrix} a & b \\ c & d \end{vmatrix} = ad - bc $$

the determinant of a 3x3 matrix is more elaborate: $$|\ A| = \begin{vmatrix} a & b & c \\ d & e & f \\ g & h & i \end{vmatrix} = 
a \times \begin{vmatrix} e & f \\ g & i \end{vmatrix} - 
b \times \begin{vmatrix} d & f \\ g & i \end{vmatrix} + 
c \times \begin{vmatrix} d & e \\ g & h \end{vmatrix}  = aei + bfg + cdh - ceg - bdi - afg$$     
      
      

In [325]:
mat4 = np.array([[3,5],[1,4]])
print(mat4,'\n\n',np.linalg.det(mat4))

[[3 5]
 [1 4]] 

 7.000000000000001


In [326]:
mat5 = np.array([[1,6,1],[2,1,8],[5,3,5]])
print(mat5,'\n\n',np.linalg.det(mat5))

[[1 6 1]
 [2 1 8]
 [5 3 5]] 

 161.99999999999994


&nbsp;

* `numpy.linalg.inv()`             
returns the inverse matrix of a **square** matrix.    
If the determinant of a matrix is undefined a matrix is not invertible.    
           
for a two dimentional matrix:   

$$ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc} \times \begin{bmatrix} d & -c \\ -b & a \end{bmatrix}$$

In [None]:
np.linalg.inv(mat4)

In [None]:
np.linalg.inv(mat5)

In [None]:
# a non inverible matrix99
np.linalg.inv(np.array([[3,2],[3,2]]))

* the dot product of a matrix by its inverse returs the identity matrix. 

In [None]:
np.dot(mat4, np.linalg.inv(mat4))

In [None]:
np.dot(mat5, np.linalg.inv(mat5))

&nbsp;

* `numpy.linalg.solve(a,b)`   
solves a system of linear equations where `a` is matrix of coefficients variables and `b` is the output. 
     
$3x_0 + 2x_1 - x_2 = 1$   
$x_0 + x_1 - 4x_2 = 11$        
$2x_0 + 5x_1 - x_2 = 0$            

$$\begin{bmatrix} 3 & 2 & -1 \\ 1 & 1 & -4 \\ 2 & 5 & -1\end{bmatrix} \times \begin{bmatrix} \theta_1 \\ \theta_2 \\ \theta_3 \end{bmatrix} = \begin{bmatrix} 1 \\ 11 \\0 \end{bmatrix} $$

In [327]:
a = np.array([[3,2,-1], [1,1,-4], [2,5,-1]])
b = np.array([1,11,0])
theta = np.linalg.solve(a, b)
theta

array([-0.35, -0.45, -2.95])

&nbsp;

eigen values and eigen vectors can be retrieved using `np.linalg.eig()`

In [335]:
a

array([[ 3,  2, -1],
       [ 1,  1, -4],
       [ 2,  5, -1]])

In [336]:
eig = np.linalg.eig(a)

[back to top](#menu)
<a href='#menu'></a>

<a id='1.6'></a>

&nbsp;

## 1.6 Additional numpy methods

* `numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)`    
produces an array of `num` elements between the `start` and `stop` points.  
`endpoint = True` makes this an inclusive end range by default.  
`retstep = True` will return the size of the step.   


In [None]:
# returns an array of 5 elements between 1 and 20 including 20. 
np.linspace(1,20,5)

* argumnent `retstep = True` (return step) prints the increment. 

In [None]:
np.linspace(1,20,13, retstep=True)

In [None]:
np.linspace(22,33,40, retstep=True)

&nbsp;

* `numpy.hstack((array1, array2))` stacks arrays over horizontal axis, number of columns increases, **apply glue to the left/right of an array**.                 
* `numpy.vstack((array1, array2))` stacks arrays over vertical axis, number of rows increases, **apply glue to the top/botton of an array**.                  

In [None]:
arr_1 = np.ones((4,2))
arr_1

In [None]:
arr_2 = np.zeros((4,3))
arr_2

`numpy.round(adarray, d)` can us used to round an array -element wise- to `d` number of given decimals.

In [None]:
np.random.seed(56)
arr_3 = np.ndarray((4,3), dtype = np.int8)
arr_3

In [None]:
# hstack: same number of rows is required. 
arr13 = np.hstack((arr_1, arr_3))
arr13

In [None]:
# vstack: same number of columns is required
arr23 = np.vstack((arr_2, arr_3))
arr23

&nbsp;

* `numpy.hsplit(array, index or section)` splits the horizontal axis, **knife cuts between columns**. number of columns decreases.  
* `numpy.vsplit(array, index or section)` splits the vertical axis, **knife cuts between rows** number of rows decreases.   
    - using `section` is less flexible. The method attempts to split the array into an equivalent arrays (with same number or rows/columns). This requires that the number of rows/columns be divisble by the number specified in the argument.

In [None]:
# since we have an odd and primary number of columns hsplit can only divide to single columns. 
hsplit = np.hsplit(arr23, 3)
print(hsplit[0],'\n\n', hsplit[1],'\n\n', hsplit[2]) 

In [None]:
vsplit = np.vsplit(arr23, 4)

print(vsplit[0],'\n\n', vsplit[1],'\n\n', vsplit[2],'\n\n', vsplit[3]) 

In [None]:
arr_ = np.array([[0,0,0,0,0,0,0,0],
                 [1,1,1,1,1,1,1,1],
                 [2,2,2,2,2,2,2,2],
                 [3,3,3,3,3,3,3,3],
                 [4,4,4,4,4,4,4,4],
                 [5,5,5,5,5,5,5,5]])
arr_

In [None]:
arr_v = np.vsplit(arr_, [1,3])

arr_v[0]

In [None]:
arr_v[1]

In [None]:
arr_v[2]

&nbsp;


In [None]:
# transpose the matrix for learning conveniene. 
arr_t = arr_.T
arr_t

In [None]:
arr_h = np.hsplit(arr_t, [1,2,5])

arr_h

&nbsp;

* `np.apply_over_axes(method, array, axes)` applies a method to an array by rows or columns. `apply_over_axes` is one that takes a one dimensional array by default such as `np.sum` or `np.mean()`



* `np.apply_along_axis(method, axes, array)` is a more general method which an take argumets such as `np.sum` and `np.mean` however unlike `np.apply_over_axes` this method can also take a udf that can apply a logic or a condition to the elements of an n-d array along each row or each column.    


both methods attempt to return a one dimentional array that is the same length at the axis specified (row or column length) 

In [None]:
r_int

In [None]:
r_int_a = r_int[0,:,:].copy()
r_int_b = r_int[1,:,:].copy()

In [None]:
np.apply_over_axes(np.mean, r_int_a, 0)

In [None]:
np.apply_over_axes(np.mean, r_int_a, 1)

`np.apply_over_axes` is very similar to `apply()` in R programming language

&nbsp;

In [None]:
r_int_b

<span style="color:blue"> what does this method do ?</span>

In [None]:
def alternate_add(vec):
    ll = len(vec)
    emty = np.empty(ll)
    emty[::2] = 1
    emty[1::2] = -1
    
    return(np.sum(emty * vec))

In [None]:
np.apply_along_axis(alternate_add, 1,r_int_b)

In [None]:
r_int_b

&nbsp;

&nbsp;

In [None]:
def sq_odds(vec):
    """squares the odd values in a vector permanently"""
    vec[np.where(vec % 2 != 0)] = vec[np.where(vec % 2 != 0)]**2

In [None]:
np.apply_along_axis(sq_odds, 0,r_int_b)

notice that this method is not optimizaed to be used with `np.apply_along_axis()` because it changes the matrix and attempt to print out a 1-d ndarray. 

Notice that the array elements are modified 

In [None]:
r_int_b

&nbsp;

finally, in addition to `hstack` and `vstach` the method `np.append(array_1, array_2, axis)` allows a used to append to ndarrays to each other given that the number of colums/rows is the same along the axis selected     

In [None]:
np.append(r_int_a, r_int_b, axis = 0)

In [None]:
np.append(r_int_a, r_int_b, axis = 1)

[back to top](#menu)
<a href='#menu'></a>

<a id='1.7'></a>

&nbsp;

## 1.7 Iterating over arrays
* iterating over every element in an array and modifying it is carried out using the method `numpy.nditer()` 

In [None]:
arr = np.array(np.arange(1,21)).reshape(4,5)

print('arr', '\n', arr, '\n\n', 'arr transpose', '\n', arr.T)

In [None]:
for x in np.nditer(arr):
    print(x, end = ' ')

* iterating thru an array is done based on the order that matches the memory layout. 
* unless the transpose is copied into a new object the iterator will follow the order in `arr`
* this is where the argument `order = {'C', 'F'}` comes into play when createing a numpy array.  

In [None]:
for x in np.nditer(arr.transpose()):
    print(x, end = ' ')

* a deep copy of **arr** creates a new memory layout of the transposed array. 

In [None]:
for x in np.nditer(arr.transpose().copy()):
    print(x, end = ' ')

* it is possible to control the order of the iterator by using argument `order = C` for `C` language, or `F` for fortran.
* `C` iterates row wise, whereas `F` iterates columns wise. 

In [None]:
for x in np.nditer(arr, order='F'):
    print(x, end = ' ')

In [None]:
for x in np.nditer(arr.T.copy(), order='F'):
    print(x, end = ' ')

* it is possible to iterate through an array and perform an operation on every object however this does not change the values in the original array.

In [None]:
for x in np.nditer(arr):
    print(x**2, end = ' ')

* `numpy.nditer()` supports overwriring the values within an array by providing the argument `op_flags = ['readwrite']`.

In [None]:
for x in np.nditer(arr, op_flags=['readwrite']):
    if x % 2 == 0:
        x[...] = 640

In [None]:
arr

* similar to `enumerate()` , `numpy.nditer()` supports tracking the index during iteration by supplying the argument `flags = ['f_index']`. 
* for this construct to work the object of `numpy.nditer()` is assigned to an iterator.

In [None]:
arr_1 = np.arange(20).reshape(4,5)
arr_1

In [None]:
it = np.nditer(arr_1, flags = ['f_index'])

In [None]:
type(it)

In [None]:
for i in it:
    print((i, it.index))

[back to top](#menu)
<a href='#menu'></a>

<a id='1.8'></a>

&nbsp;

## 1.8 special values:  `numpy.nan` and `numpy.inf`

* two miscallaneous datatypes in numpy are `numpy.nan` and `numpy.inf`.  
* `numpy.nan` fills in for missing values. 
* `.nan` and `.inf` are dtype float.   
* `np.inf` appearn within a numpy array as a result of division by zero.   

In [None]:
a , b = np.nan, np.inf
a,b

In [None]:
type(a), type(b)

In [None]:
arr_nan = np.array([1,6,4,np.inf,np.nan,7,7,np.nan,np.inf,3,2,2]).reshape((3,4))
arr_nan

* `numpy.where()` and regular boolean lookup does not return `np.nan` index.  
* instead pass the object into two specialized methods `numpy.isnan()` and `numpy.isinf()`    

In [None]:
# this always returns False.
a == np.nan

In [None]:
# this does not work 
np.where(arr_nan == np.isnan)

In [None]:
# direct boolean alreays returns False. 
arr_nan[1,0] == np.nan

In [None]:
np.where(np.isnan(arr_nan))

In [None]:
np.isnan(arr_nan)

In [None]:
np.isinf(arr_nan)

In [None]:
np.isinf(a), np.isnan(a)

In [None]:
np.isinf(b), np.isnan(b)

In [None]:
for obj in np.nditer(arr_nan, op_flags=['readwrite']):
    if np.isnan(obj):
        obj[...] = 0.09
    elif np.isinf(obj):
        obj[...] = 0.05

In [None]:
arr_nan

In [None]:
np.arange(4,-1,-1)

[back to top](#menu)
<a href='#menu'></a>

_______________________________________________

<a id='1.9'></a>

## 1.9 numpy I/O

### 1.9.1 reading data using numpy

* numpy has a number of methods that support reading tabluated data such as `genfromtxt()` and `loadtxt()`.

* any dataset that is read using a numpy method is read into an ndarray.    

* since ndarrays are homogenous containers this method is best used with the dataset is already clean and is comprised of a single data type.    
* data that is read using `getfromtxt()` or other `numpy` methods requires one or multiple steps of transformation to that it can be sliced properly as an $n \times m$ dataset

In [None]:
np_read_1 = np.genfromtxt(fname = 'data/airquality.txt', delimiter = '\t', dtype = None, skip_header = True)

setting the argument `dtype` = **<span style='color:green'>None</span>** implies instructing numpy to try and figure out the datatypes on its own.    

In [None]:
type(np_read_1)

In [None]:
np_read_1.shape

this looks like single dimensional array 

In [None]:
np_read_1[:20]

an ndarray that is a list comprised of a tuples, each tuple containing 6 variables corresponding to 6 columns


let's try to access the first column of this array 

In [None]:
np_read_1[:,1]

the data needs to be converted to the following format to be sliced as an ndarray:
&nbsp;

_________________________

In [None]:
r_int_a

In [None]:
r_int_a.shape

In [None]:
r_int_a[:,0]

__________________________

&nbsp;

<span style='color:blue'> attemp to fix `np_read_1` to the desired format </span>     
the resulting object `np_read_2` should have a shape of `(153,6)`     
extract the first column    

In [None]:
#skipped code
#np_read_2

In [None]:
np_read_2[:,0]

In [None]:
np_read_2[:10]

In [None]:
np_read_2.dtype

&nbsp;

if the dataset is not homogenous this will require additional processing steps such as slicing off the index columns

In [None]:
np_read_3 = np.genfromtxt(fname = 'data/mtcars.csv', delimiter = ',', names = True, dtype = None)

In [None]:
np_read_3

In [None]:
np_read_3.shape

<span style='color:blue'>convert this array to one with a proper shape then inspect the shape</span>

In [None]:
#skipped code

In [None]:
np_read_4_index = np_read_4[:,0].copy()
np_read_4_set = np_read_4[:,1:].copy()

In [None]:
np_read_4_set[:10]

notice that after converting the original ndarray from a single dimension to two dimensions everything is coerced to string

In [None]:
np_read_4_set = np_read_4[:,1:].astype('float32').copy()
np_read_4_set[:10]

&nbsp;

&nbsp;

### 1.9.2 using `.npy` extension
* to write `.txt` or `.csv` ndarrays we can use the regualr python input/output methods, however for enhanced reading and writing speeds numpy offers a faster binary file format  `.npy`   


* two methods are provided to read and write `.npy` files, but first let us multiple the length of np_read_4_set 100 times to craete a set that is relatively large


In [None]:
from sys import getsizeof

In [None]:
getsizeof(np_read_4_set)

<span style='color:blue'>use `np.append()` to self append <U>np_read_1</U> 100 times 

In [None]:
#skipped code

In [None]:
large_set.shape

In [None]:
getsizeof(large_set)

In [None]:
np.save('data/large_set.npy',large_set)

In [None]:
reset

In [None]:
import numpy as np

In [None]:
large_set = np.load('data/large_set.npy')

In [None]:
large_set[:10]

In [None]:
large_set.shape

[back to top](#menu)
<a href='#menu'></a>

_______________________________________________

<a id='2.0'></a>

&nbsp;

&nbsp;

## 2.0 In-class exercise

the file <U>character_deaths.csv</U> contains information about characters in Game of Thrones, when they died what book and which chapter.    



<span style="color:blue">read the file <U>character_deaths.csv</U> using np.getfromtxt()</span>   
print the first few lines and check how it is formatted    

In [None]:
#skipped code

can you print the first column ?

what are the dimensions of the table ?    

In [None]:
#skipped code


In [None]:
char_got.shape

&nbsp;

<span style="color:blue">convert the table into a proper numpy array then display the first 15 names in the first column</span>   

In [None]:
#skipped code

<span style="color:blue">split the dataset into two sets, the first comprised of the first column, and the second table with the rest of the columns</span> 

In [None]:
#skippped code

In [None]:
char_got_idx[:10]

In [None]:
char_got_val

In [None]:
char_got_val.shape

<span style="color:blue">the task is to plot the number of death by book</span>  


first convert all the values that are -1 in the table <U>char_got_val</U> to `np.nan`    

note the datatype of the table     

In [None]:
#skipped code


In [None]:
char_got_val

In [None]:
char_got.dtype

&nbsp;

### Note:

`[i**2 for i in vector if i > 2]` is the proper way to carry out a list comprehension when there is a single
condidion.    


For multiple condition including `else` the syntax needs to be modified as such:  

`[i**2 if i > 2 else i**.5 for i in vector]`    

in the case of multiple conditions, the conditional statement by its entieiry preceeds the iterator `for`

&nbsp;

<span style="color:blue"> create a new binary column where 0 is dead and 1 is alive</span>

you can base on the second column of the table `Book_of_death` using list comprehension with a conditional 

In [None]:
#skipped code


&nbsp;

<span style="color:blue">append the new vector to <U>char_got_val</U></span>

In [None]:
#skipped code

In [None]:
#skipped code
char_got_val

&nbsp;

<span style="color:blue">finally extract the sum of deaths occurring by book</span>

In [None]:
#skipped code

In [None]:
#skipped code

In [None]:
death_count_by_book

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.plot(death_count_by_book[0], death_count_by_book[1],'ro')