<img src="http://www.contribute.geeksforgeeks.org/wp-content/uploads/numpy-logo1.jpg" align="left" alt="Drawing" style="width: 80px;"/>
# NumPy 
 Numerical Python

Arrays
- all elements have same data type
- occupy continuous segment of memory (as opposed to lists, which are just pointers to different objects in various part of memory)
- constant access time (if you have a very large `list`, getting elements from it will be progressively more slow; not the case with arrays: they are always fast)
- but insertion of elements or appending is inefficient -- hence always preallocate, if you can

In [1]:
import numpy as np

In [2]:
x = np.array([1,2,3])
print(x)

[1 2 3]


In [3]:
x = np.zeros(3)
print(x)

[ 0.  0.  0.]


In [5]:
x = np.zeros(10, dtype='int')
print(x)
print("ndim: ", x.ndim)
print("shape:", x.shape)
print("size: ", x.size)
print("dtype:", x.dtype)

[0 0 0 0 0 0 0 0 0 0]
ndim:  1
shape: (10,)
size:  10
dtype: int32


In [6]:
x = np.arange(20)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [9]:
x = np.arange(10,30)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29])

In [10]:
x[5]

15

In [11]:
x[3]

13

In [12]:
# in Python indexing starts with 0
x[0]

10

In [13]:
# indexing from the end
x[-1]

29

In [14]:
x[-5]

25

In [21]:
# multidimesional indexing
x2 = np.random.randint(10,size=(3,5))
x2

array([[6, 2, 8, 9, 0],
       [4, 8, 8, 4, 1],
       [7, 9, 5, 5, 3]])

In [22]:
x2[1,4]

1

In [23]:
x2[1,-1]

1

In [24]:
x3 = np.random.randint(10,size=(3,5,7))
x3

array([[[0, 3, 0, 2, 1, 0, 9],
        [4, 9, 9, 8, 8, 2, 5],
        [3, 4, 6, 1, 9, 1, 1],
        [2, 0, 5, 4, 7, 2, 6],
        [5, 9, 8, 3, 1, 4, 5]],

       [[0, 2, 7, 8, 6, 5, 5],
        [8, 0, 8, 1, 4, 5, 9],
        [7, 9, 2, 7, 6, 8, 8],
        [0, 4, 2, 3, 1, 6, 5],
        [5, 4, 3, 7, 8, 6, 6]],

       [[5, 8, 9, 8, 5, 0, 7],
        [5, 8, 4, 0, 7, 6, 8],
        [8, 1, 3, 7, 0, 9, 0],
        [8, 3, 5, 7, 0, 2, 0],
        [8, 7, 9, 6, 5, 6, 3]]])

In [25]:
x3[1,3,4]

1

In [26]:
x2

array([[6, 2, 8, 9, 0],
       [4, 8, 8, 4, 1],
       [7, 9, 5, 5, 3]])

In [27]:
# modifying items
x2[0,0] = 101
x2

array([[101,   2,   8,   9,   0],
       [  4,   8,   8,   4,   1],
       [  7,   9,   5,   5,   3]])

Keep in mind that NumPy arrays have fixed type, and they will not "upcast" automatically!

In [28]:
x2[0,0] = 3.1415
x2

array([[3, 2, 8, 9, 0],
       [4, 8, 8, 4, 1],
       [7, 9, 5, 5, 3]])

## Array slicing
Using *:* within brackes we can access slices of the array with the following pattern:
        
    x[start:stop:step]
    
If any of these are unspecified, they are assumed as following: `start=0, stop=`*size of dimension*`, step=1`

In [35]:
x = np.random.rand(50)
x

array([ 0.29321397,  0.64620599,  0.81297395,  0.38714208,  0.94854791,
        0.94731726,  0.44091809,  0.93350442,  0.76437026,  0.23981492,
        0.70126879,  0.82433103,  0.3051342 ,  0.04662624,  0.19796374,
        0.16194123,  0.08374339,  0.32051101,  0.4834996 ,  0.56243106,
        0.07155169,  0.28387709,  0.39108488,  0.35176688,  0.03825869,
        0.49987322,  0.26915081,  0.5909597 ,  0.48866361,  0.77231087,
        0.84282155,  0.75819683,  0.6441539 ,  0.70216505,  0.94717495,
        0.28269038,  0.82150238,  0.23740336,  0.67533479,  0.84545469,
        0.33172509,  0.00334157,  0.43351097,  0.51702103,  0.27719912,
        0.72409194,  0.11893006,  0.78762225,  0.60021991,  0.1532463 ])

**Note**: don't confuse: this is not a 2 dimensional array, it is 1 dimensional, it is just displayed that way for convenience. This is what a 2 dimensional array of the same size would look like, note that each line has its own brakets (it is basically array of arrays):

In [37]:
np.random.rand(10,5)

array([[ 0.86576998,  0.53425312,  0.23133677,  0.13195801,  0.98185796],
       [ 0.60188193,  0.61929176,  0.76995121,  0.9666304 ,  0.17754879],
       [ 0.52340889,  0.19142667,  0.54531866,  0.84033475,  0.52400748],
       [ 0.66486732,  0.24953655,  0.7812089 ,  0.16308436,  0.65610586],
       [ 0.15279213,  0.93430499,  0.48696475,  0.8661386 ,  0.00432703],
       [ 0.34346393,  0.17684623,  0.26061919,  0.59420374,  0.61735461],
       [ 0.90334767,  0.17588474,  0.62023914,  0.04278716,  0.20944617],
       [ 0.49858368,  0.04903326,  0.88833496,  0.64771028,  0.07340003],
       [ 0.26422479,  0.97501566,  0.53971943,  0.29902486,  0.53515982],
       [ 0.32080885,  0.43276821,  0.5309956 ,  0.45565513,  0.53532185]])

In [38]:
x = np.random.rand(50)
x

array([ 0.56521065,  0.7490577 ,  0.56090977,  0.24330419,  0.91901016,
        0.47649473,  0.70010838,  0.91975602,  0.01264104,  0.00718346,
        0.21153331,  0.55812251,  0.9060424 ,  0.13196302,  0.96441105,
        0.3779054 ,  0.78419461,  0.67951171,  0.99086808,  0.76915495,
        0.11953132,  0.11429784,  0.81736927,  0.89809105,  0.34834365,
        0.56853237,  0.30090528,  0.84773626,  0.95580238,  0.34881011,
        0.9746324 ,  0.61583875,  0.22735463,  0.91629119,  0.34711097,
        0.10889407,  0.19309785,  0.58833867,  0.24784295,  0.85030467,
        0.87833185,  0.72782677,  0.05862603,  0.06383694,  0.21594808,
        0.01794569,  0.56641526,  0.27368431,  0.00785444,  0.00489954])

In [39]:
# first 5 elements
x[:5]

array([ 0.56521065,  0.7490577 ,  0.56090977,  0.24330419,  0.91901016])

In [40]:
# from 5th to 10th element
x[4:10]

array([ 0.91901016,  0.47649473,  0.70010838,  0.91975602,  0.01264104,
        0.00718346])

In [41]:
# from 11th element until the end
x[10:]

array([ 0.21153331,  0.55812251,  0.9060424 ,  0.13196302,  0.96441105,
        0.3779054 ,  0.78419461,  0.67951171,  0.99086808,  0.76915495,
        0.11953132,  0.11429784,  0.81736927,  0.89809105,  0.34834365,
        0.56853237,  0.30090528,  0.84773626,  0.95580238,  0.34881011,
        0.9746324 ,  0.61583875,  0.22735463,  0.91629119,  0.34711097,
        0.10889407,  0.19309785,  0.58833867,  0.24784295,  0.85030467,
        0.87833185,  0.72782677,  0.05862603,  0.06383694,  0.21594808,
        0.01794569,  0.56641526,  0.27368431,  0.00785444,  0.00489954])

In [42]:
# every second element
x[::2]

array([ 0.56521065,  0.56090977,  0.91901016,  0.70010838,  0.01264104,
        0.21153331,  0.9060424 ,  0.96441105,  0.78419461,  0.99086808,
        0.11953132,  0.81736927,  0.34834365,  0.30090528,  0.95580238,
        0.9746324 ,  0.22735463,  0.34711097,  0.19309785,  0.24784295,
        0.87833185,  0.05862603,  0.21594808,  0.56641526,  0.00785444])

In [43]:
# every third element
x[::3]

array([ 0.56521065,  0.24330419,  0.70010838,  0.00718346,  0.9060424 ,
        0.3779054 ,  0.99086808,  0.11429784,  0.34834365,  0.84773626,
        0.9746324 ,  0.91629119,  0.19309785,  0.85030467,  0.05862603,
        0.01794569,  0.00785444])

In [44]:
# reversed array
x = np.arange(10)
print(x)
print(x[::-1])

[0 1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1 0]


In [47]:
x2

array([[3, 2, 8, 9, 0],
       [4, 8, 8, 4, 1],
       [7, 9, 5, 5, 3]])

In [49]:
# access third column
x2[:,2]

array([8, 8, 5])

# Distinction between memory views and copies

When accessing the sub-arrays, it is important to keep in mind that you get a view on the array, not a copy of it! It means that by default the new array is not a separate entity, but is actually accessing the same memory as the original array. Here is a simple example:

In [50]:
x2

array([[3, 2, 8, 9, 0],
       [4, 8, 8, 4, 1],
       [7, 9, 5, 5, 3]])

In [51]:
# get first two elements from both dimensions
x2_sub = x2[:2,:2]
x2_sub

array([[3, 2],
       [4, 8]])

In [52]:
# modify an element in the new array
x2_sub[1,1] = 99
x2_sub

array([[ 3,  2],
       [ 4, 99]])

In [53]:
# see that the original array also got modified
x2

array([[ 3,  2,  8,  9,  0],
       [ 4, 99,  8,  4,  1],
       [ 7,  9,  5,  5,  3]])

This behavior is very useful for working with datasets, because it saves you a lot of memory. If you need to make a copy of the array, you must do it explicitly:

In [55]:
# make a copy
x2_sub = x2[:2,:2].copy()

In [57]:
# modify a copy and verify that the original array is intact
x2_sub[1,1] = -666
print(x2_sub)
print()
print(x2)

[[   3    2]
 [   4 -666]]

[[ 3  2  8  9  0]
 [ 4 99  8  4  1]
 [ 7  9  5  5  3]]


# Operations on arrays

We use arrays to speed up computations. The key to understand here is that when you make an operation on each element of an array or a list, each object has to be dynamically typed: during the execusion for each element Python core looks up at the type of the element (whether it is int, float, string, etc) to see which of the compiled functions to apply to the element. This is slow. Consider the following piece of code.

In [59]:
def compute_reciprocals(values):
    # preallocate array for reciprocals
    output = np.empty(len(values))
    
    # compute reciprocal of each element
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    
    return output

# example use
values = np.arange(1, 10)
print(values)
compute_reciprocals(values)

[1 2 3 4 5 6 7 8 9]


array([ 1.        ,  0.5       ,  0.33333333,  0.25      ,  0.2       ,
        0.16666667,  0.14285714,  0.125     ,  0.11111111])

Now let's see how much time it takes to run this function on an array of 1 million integers. We use `%timeit`, which will tell you the time it takes to run the function you wrote afterwards:

In [60]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

1 loop, best of 3: 2.87 s per loop


Now we do the same thing, but instead we just divide `1` over our `big_array`. `numpy` automatically assumes that we want to divide `1` by each element of the array. Moreover, it has more efficient ways of doing it for us:

In [62]:
%timeit (1/big_array)

100 loops, best of 3: 6.17 ms per loop


Results will wary based on the current state of your computer, but here is what I got: for `compute_reciprocals` function I got `2.87 s`, for `(1/big_array)` I got `6.17 ms`. This is almost 500 fold difference in run time! Imagine that you had a script using `numpy` which ran for 10 seconds. If you had written it in a wrong way using loops, it would take almost 1.5 hours to run!

>**Pro-tip**: although we used `/` to divide `1/big_array`, this `/` is actually a shortcut for a function `np.divide':

In [65]:
np.divide(1,big_array)

array([ 0.02222222,  0.02222222,  0.0212766 , ...,  0.01075269,
        0.01265823,  0.14285714])

>All `numpy` operators have functions associated with them, here they are:

| Operator	    | Equivalent ufunc    | Description                           |
|---------------|---------------------|---------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

> Why would you want to use the full function notation instead of using an operator? There are some curcumstances when using full notation will give you more flexibility. If you do a lot of crunching on very large numerical datasets and experience problems with speed and/or memory, certainly take a look at the [NumPy](http://www.numpy.org)(especially at the [ufunc](https://docs.scipy.org/doc/numpy-1.10.0/reference/ufuncs.html) section) and [SciPy](http://www.scipy.org) documentation.

# Aggregator functions

Functions which reduce an array (or a dimension of an array) to a single value are called aggregator functions. Some of the most useful include `sum`, `min`, `max`, `mean`, `median`, `std`, etc. Python have in-build versions of some of these functions, but NumPy versions are much faster and you should be always using them:

In [66]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

10 loops, best of 3: 157 ms per loop
1000 loops, best of 3: 949 µs per loop


Most of the aggregator function include a sister-function with `nan` prefix, which does the same, but ignores `NaN` (stands for *Not a Number*) elements. `NaN` is usually used as a placeholder for missing data, so these functions are very useful for working with data. We will revisit this in the future lesson.

The following table provides a list of useful aggregation functions available in NumPy:

|Function name      |   NaN-ignoring version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute median of elements                    |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are True        |
| ``np.all``        | N/A                 | Evaluate whether all elements are True        |

We won't discuss each in detail, but feel free to try them for youself.

Just to mention 2 things about aggregates. 

**First**, some of them (`sum`, `min`, `max` and some others) can be accessed via method notation, like so:

In [67]:
# print min, max and sum of the array
print('Min:', big_array.min())
print('Max:', big_array.max())
print('Sum:', big_array.sum())

Min: 2.08059462026e-06
Max: 0.99999898494
Sum: 499962.810488


And **second**, for multidimensional arrays, you can specify `axis` parameter to make aggregation only over a specific axis. By default, they will aggregate over all the array:

In [68]:
multi_dim_array = np.random.randint(100, size=(5,10))
multi_dim_array

array([[92, 47, 97, 77, 27, 21, 99, 43, 80, 71],
       [10, 31,  1, 30, 96, 93, 73, 33, 92, 32],
       [44, 18, 49,  9, 91, 71, 39, 99, 93,  8],
       [35, 88, 66, 91, 54, 30, 17, 72,  4, 67],
       [57, 56, 37, 55,  0, 63, 53, 59, 72, 50]])

In [69]:
# default behavior gives maximum of all elements of the array 
np.max(multi_dim_array)

99

In [70]:
# specifying axis gives you control over which dimension is aggregated;
# in this particular case, the function will give max of every column 
np.max(multi_dim_array, axis=0)

array([92, 88, 97, 91, 96, 93, 99, 99, 93, 71])