![](http://thecads.org/wp-content/uploads/2017/02/adax_logo.jpg)
# Module 3: Handling Large Data with NumPy

### Contents:

* [Numpy](#Numpy)
* [Broadcasting](#Broadcasting)
* [Manipulating Arrays](#Manipulating Arrays)
* [Linear Algebra](#Linear Algebra)
* [Data Processing](#Data Processing)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Numpy

Datasets can include collections of documents, images, sound clips, numerical measurements, or, really anything. Despite the heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.

| Data type	    | Arrays of Numbers? |
|---------------|-------------|
|Images | Pixel brightness across different channels|
|Videos | Pixels brightness across different channels for each frame | 
|Sound | Intensity over time |
|Numbers | No need for transformation | 
|Tables | Mapping from strings to numbers |


Therefore, the efficient storage and manipulation of large arrays of numbers is really fundamental to the process of doing data science. Numpy and pandas are the libraries within the SciPy stack that specialize in handling numerical arrays and data tables. 

[Numpy](http://www.numpy.org/) is short for _numerical python_, and provides functions that are especially useful when you have to work with large arrays and matrices of numeric data, like matrix multiplications.  

The array object class is the foundation of Numpy, and Numpy arrays are like lists in Python, except that every thing inside an array must be of the same type, like int or float. As a result, arrays provide much more efficient storage and data operations, especially as the arrays grow larger in size. However, in other ways, NumPy arrays are very similar to Python's built-in list type, but with the exception of Vectorization.

### Creating arrays

In [2]:
# Create array from lists:
lis = [[1,2,3,4,5],[6,7,8,9,10]]

ary = np.array(lis)
print(lis)
print(ary, type(ary))

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]] <class 'numpy.ndarray'>


### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in numpy that generate arrays of different forms. Some of the more common are:


### zeros and ones

In [3]:
# We use these when the elements of the 
# array are originally unknown but its size is known.

np.zeros((3,4))

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [4]:
np.ones((2,3,4), dtype = np.int16)

array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]], dtype=int16)

In [5]:
np.empty( (2,3) )   

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [6]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14]])

<font color="#ec1c24">There's so many new functions!<br> 
**TIP**: For a quick reference to these functions, place your keyboard cursor at the function name and press Shift-Tab. You will see a pop out description of the function. </font>

### arange

In [7]:
# Large operations work too, and quickly
np.arange(10000)

array([   0,    1,    2, ..., 9997, 9998, 9999])

In [10]:
# prints the corners, mainly
np.arange(25).reshape(5,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

### random data

In [11]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[ 0.56599058,  0.6464604 ,  0.11883304],
       [ 0.26836242,  0.09870388,  0.71460893],
       [ 0.92372262,  0.02335478,  0.74086905]])

In [13]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[-0.66416136, -0.46736368, -1.62090967],
       [ 0.46895317, -0.42070322,  0.52184843],
       [-0.5406307 , -0.15985984,  1.16406887]])

In [12]:
# Create a 3x3 array of random integers in the interval [0, 10]
np.random.randint(0, 10, (3, 3))

array([[4, 2, 0],
       [5, 7, 8],
       [9, 8, 9]])

In [14]:
# Create a 3x3 identity matrix
np.eye(3)

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [15]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

array([ 1.,  1.,  1.])

### linspace, logspace

In [30]:
# Make several equally spaced points in linear space
# linspace( start, end, difference)
np.linspace(0,np.pi,5)

array([ 0.        ,  0.78539816,  1.57079633,  2.35619449,  3.14159265])

In [31]:
np.logspace(0, 10, 10, base=np.e)

array([  1.00000000e+00,   3.03773178e+00,   9.22781435e+00,
         2.80316249e+01,   8.51525577e+01,   2.58670631e+02,
         7.85771994e+02,   2.38696456e+03,   7.25095809e+03,
         2.20264658e+04])

### diag

In [32]:
# a diagonal matrix
np.diag([1,2,3,])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [None]:
# diagonal with offset from the main diagonal
np.diag([1,2,3], k=1)

### Vectorization

In [16]:
lis = [1,2,3,4,5]

In [17]:
lis + lis

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In [18]:
# See the difference???
np_array = np.array(lis)
np_array + np_array

array([ 2,  4,  6,  8, 10])

What happened? When we add two numpy arrays that are of same size together, it really does what it should do: ADD them up. Think of it as adding together two vectors or matrices that are compatible in size. 

In [19]:
# Doing the same using normal lists requires a loop or list comprehension (which still loops) !
print([x+x for x in lis])
print([x**2 for x in lis])

[2, 4, 6, 8, 10]
[1, 4, 9, 16, 25]


So we call operations on numpy arrays **vectorized**.  For almost all data intensive computing, we use numpy because of this feature, and because the whole scientific and numerical python stack is based on numpy.  

To explain it another way, in a spreadsheet you would add an entire column to another one by writing a formula in the first cell and autofilling the rest of the column.  Numpy allows you to do such commands in one go.  





In [21]:
array = np.array([1, 4, 5, 8], float)
print(array)
print("")
array = np.array([[1, 2, 3], [4, 5, 6]], float)  # a 2D array/Matrix
print(array)

[ 1.  4.  5.  8.]

[[ 1.  2.  3.]
 [ 4.  5.  6.]]


Numpy has all of its functionality written in _compiled_ code written in C, that is much faster.  But this can only be the case because all of the items in the numpy array are of the same data type! (i.e. Python is dynamically typed whereas C is not - this gives extra flexibility and simplicity to Python, but makes it slower). 

In [22]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

131 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
883 µs ± 8.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Look at the amazing speed-up achieved by functions that operate on the numpy array!

You can index, slice, and manipulate a Numpy ***array*** much like you would with a Python list. 

Python has a certain way of doing things. For example lets call one of these ways listiness. Listiness works on lists, dictionaries, files, and a general notion of something called an iterator.

That's because they both support **the iterator protocol** - when something behaves in a list-like way. 

# Broadcasting

It is also possible to do operations on arrays of different sizes if numpy can transform these arrays so that they all have
the same size -- this conversion is called <font color="#ec1c24">Broadcasting</font>.

Broadcasting is simply a set of rules for applying binary universal functions (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes.
<img src="http://www.scipy-lectures.org/_images/numpy_broadcasting.png" width="600" />

In [None]:
M = np.ones((3, 3))
M

In [None]:
M + 5

This might seem strange, but it's correct when Python performs broadcasting! By right, it is not possible to add a single scalar number to a matrix, but broadcasting auto-replicates (or auto-pads) the scalar number to match the shape of the matrix, making addition possible. It's the same as doing....

In [None]:
M + np.full((3, 3), 5)

Let's try another example...

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

## Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

- **Rule 1:** If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is *padded* with ones on its leading (left) side.
- **Rule 2:** If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- **Rule 3:** If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

To make these rules clear, let's consider a few examples in detail.

In [2]:
# Rule one
M = np.ones((2, 3))
a = np.arange(3)
M + a

array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.]])

In [3]:
# Rule two
a = np.arange(3).reshape((3, 1))
b = np.arange(3)
print(a,b)

[[0]
 [1]
 [2]] [0 1 2]


In [None]:
a + b

In [4]:
# Rule three
M = np.ones((3, 2))
a = np.arange(3)
print(M, a)
M + a

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]] [0 1 2]


ValueError: operands could not be broadcast together with shapes (3,2) (3,) 

In [13]:
# The second array is a 1-D array (only 1 bracket), it needs the 2nd dimension to be declared, but we can keep it as 1  
# To get over the problem, create a new axis for 'a':
print(a)
print(a.shape)
print()
print(a[:, np.newaxis])
print(a[:, np.newaxis].shape)
print()


[0 1 2]
(3,)

[[0]
 [1]
 [2]]
(3, 1)


In [5]:
# This will work
M + a[:, np.newaxis]

array([[ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.]])

In [17]:
np.logaddexp(M, a[:, np.newaxis])

array([[ 1.25982505,  0.74836824,  0.83965838],
       [ 1.41793109,  1.33477045,  1.47585827],
       [ 2.12968336,  2.16207373,  2.19182985]])

# Manipulating Arrays
 
### Indexing
We can index elements in an array using square brackets and indices:

In [16]:
# a vector: the argument to the array function is a Python list
v = np.array([1,2,3,4])
v[0]

1

In [15]:
M = np.random.random([3,3])
print(M)
# M is a matrix, or a 2 dimensional array, taking two indices 
M[2,1]

[[ 0.92616374  0.10755167  0.27427444]
 [ 0.34379403  0.0777403   0.50471123]
 [ 0.02288281  0.26242716  0.44630122]]


0.26242716431858615

You can understand this notation like how we define the coordinates in a 2-D matrix, where the first value indicates the row index, and second value indicates the column index.

### Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

Source: _Python Data Science Handbook_

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)

In [18]:
M

array([[ 0.92616374,  0.10755167,  0.27427444],
       [ 0.34379403,  0.0777403 ,  0.50471123],
       [ 0.02288281,  0.26242716,  0.44630122]])

In [19]:
M[1]

array([ 0.34379403,  0.0777403 ,  0.50471123])

The same thing can be achieved with using : instead of an index:

In [20]:
M[1,:] #row 1

array([ 0.34379403,  0.0777403 ,  0.50471123])

In [21]:
M[:,1] #column 1

array([ 0.10755167,  0.0777403 ,  0.26242716])

We can assign new values to elements in an array using indexing:

In [22]:
M[0,0] = 1

In [23]:
M

array([[ 1.        ,  0.10755167,  0.27427444],
       [ 0.34379403,  0.0777403 ,  0.50471123],
       [ 0.02288281,  0.26242716,  0.44630122]])

In [24]:
# assignment can also work for rows and columns. This is really powerful and fast!
M[1,:] = 0
M

array([[ 1.        ,  0.10755167,  0.27427444],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.02288281,  0.26242716,  0.44630122]])

In [25]:
M[:,2] = -1
M

array([[ 1.        ,  0.10755167, -1.        ],
       [ 0.        ,  0.        , -1.        ],
       [ 0.02288281,  0.26242716, -1.        ]])

### Index Slicing
Index slicing is the technical name for the syntax M[lower:upper:step] to extract part of an array:

In [26]:
A = np.array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [27]:
A[1:3]

array([2, 3])

Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:

In [28]:
A[1:3] = [-2,-3]
A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in M[lower:upper:step]:

In [29]:
A[::] # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [30]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [31]:
A[:3] # first three elements

array([ 1, -2, -3])

In [32]:
A[3:] # elements from index 3

array([4, 5])

Index slicing works exactly the same way for multidimensional arrays:


In [33]:
A = np.random.randint(0, 10, (5, 5))
A

array([[1, 3, 4, 1, 8],
       [9, 5, 5, 2, 6],
       [9, 7, 7, 6, 5],
       [7, 7, 4, 6, 3],
       [8, 3, 7, 9, 1]])

In [37]:
# slice a block from the original array
A[1:4, 1:4]

array([[5, 5, 2],
       [7, 7, 6],
       [7, 4, 6]])

In [38]:
# slice with different strides
A[::2, ::2]

array([[1, 4, 8],
       [9, 7, 5],
       [8, 7, 1]])

### Fancy indexing
Fancy indexing is the name for when an array or list is used in-place of an index:

In [47]:
row_indices = [1, 2, 3,4]
A
#A[row_indices]

array([[1, 3, 4, 1, 8],
       [9, 5, 5, 2, 6],
       [9, 7, 7, 6, 5],
       [7, 7, 4, 6, 3],
       [8, 3, 7, 9, 1]])

In [48]:
col_indices = [3, 2, 1,0] # remember, index -1 means the last element
A[row_indices, col_indices]      # Try figure this out!

array([2, 7, 7, 8])

We can also use index masks: If the index mask is an numpy array of data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position of each element:

In [49]:
B = np.array([n for n in range(5)])    # notice how list comprehensions can be used as well
B

array([0, 1, 2, 3, 4])

In [50]:
row_mask = np.array([True, False, True, False, False])    # boolean mask can be used to select elements
B[row_mask]

array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [51]:
x = np.arange(0, 10, 0.5)
x

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [52]:
mask = (5 < x) * (x < 7.5)    # a mask is built from some conditions
mask

array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True, False, False, False,
       False, False], dtype=bool)

In [53]:
x[mask]

array([ 5.5,  6. ,  6.5,  7. ])

### Using arrays in conditions

When using arrays in conditions,for example ```if``` statements and other boolean expressions, one needs to use ```any``` or ```all```, which requires that any or all elements in the array evalutes to ```True```:

In [54]:
M = np.array([[ 1,  4],[ 9, 16]])
M

array([[ 1,  4],
       [ 9, 16]])

In [55]:
#any
if (M > 5).any():
    print("at least one element in M is larger than 5")
else:
    print("no element in M is larger than 5")

at least one element in M is larger than 5


In [56]:
#all
if (M > 5).all():
    print("all elements in M are larger than 5")
else:
    print("all elements in M are not larger than 5")

all elements in M are not larger than 5


## Functions for extracting data from arrays and creating arrays

**where**

The index mask can be converted to position index using the where function

In [57]:
print(mask)
indices = np.where(mask)

indices

[False False False False False False False False False False False  True
  True  True  True False False False False False]


(array([11, 12, 13, 14], dtype=int64),)

In [58]:
x[indices] # this indexing is equivalent to the fancy indexing x[mask]

array([ 5.5,  6. ,  6.5,  7. ])

**diag**

With the diag function we can also extract the diagonal and subdiagonals of an array:

In [59]:
print(A)
np.diag(A)

[[1 3 4 1 8]
 [9 5 5 2 6]
 [9 7 7 6 5]
 [7 7 4 6 3]
 [8 3 7 9 1]]


array([1, 5, 7, 6, 1])

In [66]:
np.diag(A, -3)

array([7, 3])

**take**

The take function is similar to fancy indexing described above:

In [67]:
v2 = np.arange(-3,3)
v2

array([-3, -2, -1,  0,  1,  2])

In [68]:
row_indices = [1, 3, 5]
v2[row_indices] # fancy indexing

array([-2,  0,  2])

In [69]:
v2.take(row_indices)

array([-2,  0,  2])

But take also works on lists and other objects:


In [70]:
np.take([-3, -2, -1,  0,  1,  2], row_indices)

array([-2,  0,  2])

**choose**

Constructs an array by picking elements from several arrays:

In [125]:
which = [1, 0, 1, 0]
choices = [[-2,-2,-2,-2], [5,5,5,5]]

np.choose(which, choices)
%whos

Variable      Type       Data/Info
----------------------------------
A             ndarray    5x5: 25 elems, type `int32`, 100 bytes
B             ndarray    5: 5 elems, type `int32`, 20 bytes
M             matrix     [[1 3 4 1 8]\n [9 5 5 2 6<...>7 7 4 6 3]\n [8 3 7 9 1]]
a             ndarray    15: 15 elems, type `int32`, 60 bytes
b             ndarray    3: 3 elems, type `int32`, 12 bytes
c             ndarray    2x2: 4 elems, type `complex128`, 64 bytes
choices       list       n=2
col_indices   list       n=4
ex1           matrix     [[0 1 2 3 4 5 6 7]]
indices       tuple      n=1
mask          ndarray    20: 20 elems, type `bool`, 20 bytes
ndarray       ndarray    2x3: 6 elems, type `int32`, 24 bytes
np            module     <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
plt           module     <module 'matplotlib.pyplo<...>\\matplotlib\\pyplot.py'>
row_indices   list       n=3
row_mask      ndarray    5: 5 elems, type `bool`, 5 bytes
v             ndarray    3: 3 el

# Linear Algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

### Scalar-array operations
We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [73]:
v1 = np.arange(5)
v1

array([0, 1, 2, 3, 4])

In [74]:
v1 * 2

array([0, 2, 4, 6, 8])

In [75]:
v1 + 2

array([2, 3, 4, 5, 6])

In [76]:
A * 2, A + 2

(array([[ 2,  6,  8,  2, 16],
        [18, 10, 10,  4, 12],
        [18, 14, 14, 12, 10],
        [14, 14,  8, 12,  6],
        [16,  6, 14, 18,  2]]), array([[ 3,  5,  6,  3, 10],
        [11,  7,  7,  4,  8],
        [11,  9,  9,  8,  7],
        [ 9,  9,  6,  8,  5],
        [10,  5,  9, 11,  3]]))

### Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is element-wise operations:

In [126]:
A * A # element-wise multiplication

array([[ 1,  9, 16,  1, 64],
       [81, 25, 25,  4, 36],
       [81, 49, 49, 36, 25],
       [49, 49, 16, 36,  9],
       [64,  9, 49, 81,  1]])

In [127]:
v1 * v1

array([ 0,  1,  4,  9, 16])

If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:


In [79]:
A.shape, v1.shape

((5, 5), (5,))

In [81]:
print(A)
A * v1

[[1 3 4 1 8]
 [9 5 5 2 6]
 [9 7 7 6 5]
 [7 7 4 6 3]
 [8 3 7 9 1]]


array([[ 0,  3,  8,  3, 32],
       [ 0,  5, 10,  6, 24],
       [ 0,  7, 14, 18, 20],
       [ 0,  7,  8, 18, 12],
       [ 0,  3, 14, 27,  4]])

### Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the dot function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:

In [82]:
np.dot(A,A)

array([[135,  77, 107, 109,  57],
       [161, 119, 146, 115, 139],
       [217, 168, 179, 146, 172],
       [172, 135, 136, 108, 139],
       [169, 154, 139, 119, 145]])

In [85]:
print(A,v1)
print(np.dot(A, v1))
print(np.dot(v1,A))

[[1 3 4 1 8]
 [9 5 5 2 6]
 [9 7 7 6 5]
 [7 7 4 6 3]
 [8 3 7 9 1]] [0 1 2 3 4]
[46 45 59 45 48]
[80 52 59 68 29]


In [87]:
np.dot(v1,v1)

30

We can cast the array objects to the type `matrix`. This changes the behavior of the standard arithmetic operators +, -, * to use matrix algebra.

In [91]:
M = np.matrix(A)
v = np.matrix(v1).T # make it a column vector by doing a transpose


In [92]:
M

matrix([[1, 3, 4, 1, 8],
        [9, 5, 5, 2, 6],
        [9, 7, 7, 6, 5],
        [7, 7, 4, 6, 3],
        [8, 3, 7, 9, 1]])

In [93]:
v

matrix([[0],
        [1],
        [2],
        [3],
        [4]])

In [94]:
M * M

matrix([[135,  77, 107, 109,  57],
        [161, 119, 146, 115, 139],
        [217, 168, 179, 146, 172],
        [172, 135, 136, 108, 139],
        [169, 154, 139, 119, 145]])

In [95]:
M * v

matrix([[46],
        [45],
        [59],
        [45],
        [48]])

If we try to add, subtract or multiply objects with incomplatible shapes we get an error:


In [97]:
v = np.matrix([1,2,3,4,5,6]).T
v

matrix([[1],
        [2],
        [3],
        [4],
        [5],
        [6]])

In [98]:
np.shape(M), np.shape(v)

((5, 5), (6, 1))

In [99]:
M * v #error due to different dimension

ValueError: shapes (5,5) and (6,1) not aligned: 5 (dim 1) != 6 (dim 0)

## NumPy Standard Data Types

NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations.
Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

The standard NumPy data types are listed in the following table.
Note that when constructing an array, they can be specified using a string:

```python
np.zeros(10, dtype='int16')
```

Or using the associated NumPy object:

```python
np.zeros(10, dtype=np.int16)
```

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the [NumPy documentation](http://numpy.org/).
NumPy also supports compound data types, which will be covered in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb).

Source: Jake VanderPlas's _Python Data Science Handbook_

### Attributes of Numpy Arrays

In [100]:
# Create a ranged array: 
# arange = array range
a = np.arange(15)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

### Reshaping, resizing and stacking arrays

The shape of a Numpy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays.

In [101]:
# reshape it
a.reshape(3,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [102]:
# You can specify the type of an array:
c = np.array([[1,2],[3,4]], dtype=complex) # complex numbers!
c

array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

In [103]:
ndarray = np.array([[1,2,3],[4,5,6]])
type(ndarray), ndarray

(numpy.ndarray, array([[1, 2, 3],
        [4, 5, 6]]))

In [104]:
# Number of axes or dimensions of the array
ndarray.ndim

2

In [105]:
# Dimensions of the array:
# For a matrix with n rows and m columns, 
# shape will be (n,m).
ndarray.shape

(2, 3)

In [106]:
# Type of elements in the array
ndarray.dtype

dtype('int32')

In [107]:
# Size in bytes of each element of the array
# int64 has itemsize 8
# complex32 has itemsize 4
print("itemsize:", ndarray.itemsize, "bytes")
print("nbytes:", ndarray.nbytes, "bytes")

itemsize: 4 bytes
nbytes: 24 bytes


### Adding a new dimension: newaxis

With newaxis, we can insert new dimensions in an array, for example converting a vector to a column or row matrix:

In [108]:
v = np.array([1,2,3])

In [109]:
np.shape(v)

(3,)

In [110]:
# make a column matrix of the vector v
v[:, np.newaxis]

array([[1],
       [2],
       [3]])

In [111]:
# column matrix
v[:,np.newaxis].shape

(3, 1)

In [112]:
v[np.newaxis,:].shape

(1, 3)

### Array Concatenation and splitting

Write some code to test the following functions and figure out what they do!

In [None]:
# Try the following
np.concatenate (axis = 1)
# np.split
# np.hstack
# np.vstack
# np.dstack
# np.floor
# np.hsplit
# np.vsplit
# np.dsplit

## **Exercises:**


1. Create a 3x3 matrix with values ranging from 0 to 8
2. Create a 10x10 array with random values and find the minimum and maximum values
3. Create a 8x8 matrix and fill it with a checkerboard pattern 
3. Create random vector of size 10 and replace the maximum value by 0
4. Create a $4 * 4$ identity matrix.
5. Generate the 2D array
6. Generate a random $4 \times 4 \times 4$ array of Gaussianly distributed numbers.   
7. Generate `n` evenly spaced intervals between 0. and 1.  
8. Create a vector and then reverse the vector (first element becomes last)

Looking for more? [Here's](http://www.loria.fr/~rougier/teaching/numpy.100/) more questions (and sample answers as well) categorized into three levels of difficulty -- Apprentice, Novice and Neophyte level.

In [124]:
ex1 = np.arange(0,8)
ex1 = np.matrix(ex1)
ex1.shape


(1, 8)

# Data Processing

### Comma-separated values (CSV)

A very common file format for data files is comma-separated values (CSV), or related formats such as TSV (tab-separated values). To read data from such files into Numpy arrays we can use the numpy.genfromtxt function. For example,


In [128]:
%pwd

'C:\\Users\\firdaus afifi\\Documents\\Python'

In [129]:
!head stockholm_td_adj.dat #print out thr top 10 lines

'head' is not recognized as an internal or external command,
operable program or batch file.


In [130]:
#store data from dat format to 'data' variable
data = np.genfromtxt('stockholm_td_adj.dat') 

OSError: stockholm_td_adj.dat not found.

In [None]:
#77431 rows and 7 columns
data.shape

In [None]:
#visualize the data
fig, ax = plt.subplots(figsize=(14,4))
ax.plot(data[:,0]+data[:,1]/12.0+data[:,2]/365, data[:,5])
ax.axis('tight')
ax.set_title('tempeatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('temperature (C)');

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.

For example, let's calculate some properties from the Stockholm temperature dataset used above.

### mean

In [None]:
# the temperature data is in column 3
np.mean(data[:,3])

The daily mean temperature in Stockholm over the last 200 years has been about 6.2 C.

### standard deviations and variance

In [None]:
np.std(data[:,3]), np.var(data[:,3])

### min and max

In [None]:
# lowest daily average temperature
np.min(data[:,3])

In [None]:
# lowest daily average temperature
np.max(data[:,3])

## Computations on subsets of arrays

We can compute with subsets of the data in an array using indexing, fancy indexing, and the other methods of extracting data from an array (described above).

For example, let's go back to the temperature dataset:

In [None]:
!head -n 3 'stockholm_td_adj.dat'

The dataformat is: year, month, day, daily average temperature, low, high, location.

If we are interested in the average temperature only in a particular month, say April, then we can create a index mask and use it to select only the data for that month using:

In [None]:
np.unique(data[:,1]) # the month column takes values from 1 to 12

In [None]:
mask_april = data[:,1] == 4

In [None]:
# the temperature data is in column 3
np.mean(data[mask_april,3])

With these tools we have very powerful data processing capabilities at our disposal. For example, to extract the average monthly average temperatures for each month of the year only takes a few lines of code:

In [None]:
months = np.arange(1,13)
monthly_mean = [np.mean(data[data[:,1] == month, 3]) for month in months] # the power of list comprehension!

fig, ax = plt.subplots()
ax.bar(months, monthly_mean)
ax.set_xlabel("Month")
ax.set_ylabel("Monthly avg. temp.");

## Specifying the axis on higher-dimensional data

When functions such as min, max, etc. are applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the axis argument we can specify how these functions should behave:

In [None]:
m = np.random.rand(3,3)
m

In [None]:
# global max
m.max()

In [None]:
# max in each column
m.max(axis=0)

In [None]:
# max in each row
m.max(axis=1)

### Some other things to try (more advanced level)
(from Python Data Science Handbook)

In [None]:
from scipy import special
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x)     =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2)   =", special.beta(x, 2))

In [None]:
# Error function (integral of Gaussian)
# its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x)  =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))

In [None]:
# 'out' terminology (which skips the stepp of assigning it to a temporary array)
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)

In [None]:
x = np.arange(1, 6)
print(np.add.reduce(x))
print(np.multiply.reduce(x))
print(np.add.accumulate(x))
print(np.multiply.accumulate(x))

In [None]:
# Outer Products
x = np.arange(1, 6)
np.multiply.outer(x, x)

In [None]:
# operations across axes 
M = np.random.random((3, 4))
print(M)
# Axis here refers to the axis that will be collapsed!
print(M.min(axis = 0))
print(M.min(axis = 1))

### Other aggregation functions

NumPy provides many other aggregation functions, but we won't discuss them in detail here.
Additionally, most aggregates have a [``NaN``](https://en.wikipedia.org/wiki/NaN)-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point ``NaN`` value (for a fuller discussion of missing data, see [Handling Missing Data](03.04-Missing-Values.ipynb)).
Some of these ``NaN``-safe functions were not added until NumPy 1.8, so they will not be available in older NumPy versions.

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

Source: Python Data Science Handbook

# Resources:  
- [Numpy Quickstart Guide](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
- [Official Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
- [Rahul Dave's CS109 lab1 content at Harvard](https://github.com/cs109/2015lab1)  
- [The Data Incubator](https://www.thedataincubator.com)  
- [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)