### Numpy performance is a lot better because it is API to algorithms written in C

In [1]:
import numpy as np


In [2]:
my_arr = np.arange(1000000)

In [3]:
my_list = list(range(1000000))

In [4]:
%time for _ in range (10): my_arr * 2

CPU times: user 31.2 ms, sys: 15.6 ms, total: 46.9 ms
Wall time: 21.1 ms


In [5]:
%time for _ in range (10): my_list2 = [x * 2 for x in my_list]

CPU times: user 234 ms, sys: 250 ms, total: 484 ms
Wall time: 487 ms


### ndarray - a multidimentional array object

<p> It is a fast and flexible container for large datasets in python</p>

<p>Arrays enable you to perform math operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements</p>

In [6]:
# example 
# first generate sample data

data = np.random.randn(2,3)

In [7]:
data

array([[ 0.50131196,  1.05539851, -0.60223058],
       [ 0.24812627,  3.01095585, -2.43963855]])

In [8]:
# I then write mathematical operations with data

data * 10

array([[  5.01311956,  10.55398514,  -6.02230578],
       [  2.48126274,  30.10955849, -24.39638551]])

In [9]:
data + data

array([[ 1.00262391,  2.11079703, -1.20446116],
       [ 0.49625255,  6.0219117 , -4.8792771 ]])

<p>A ndarray is a generic multidimentional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimention and a dtype, an object describing the dta type of the array</p>

In [10]:
data.shape

(2, 3)

In [11]:
data.dtype

dtype('float64')

### Creating ndarrays

In [15]:
data1 = [1,5,7,3,9]
arr1 = np.array(data1)
arr1

array([1, 5, 7, 3, 9])

Nested sequences like a list of equal length lists will be converted into a multidimentional array

In [16]:
data2 = [[1,3,6,9],[2,5,8,0]]
arr2 = np.array(data2)
arr2

array([[1, 3, 6, 9],
       [2, 5, 8, 0]])

since data2 was a list if lists the NumPy array arr2 has two dimentions with shape inferred from the data. We can inspect the shape like this

In [18]:
arr2.ndim # number of dimentions 

2

In [20]:
arr2.shape # shape of data ( 2 dimentions with 4 entries each)

(2, 4)

In [27]:
arr2.dtype # Numpy will attempt to give the data the right data type

dtype('int64')

There a re a number of other ways to create arrays, here are a few

In [21]:
np.zeros(10) # creates an array with zeros                                                                                                           

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [22]:
np.zeros((3,6)) # creates 3 rows with 6 columns

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [28]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

> Take notice of the data type array. This is a numpy data type and is not a list/set/etc even though hit looks like it 

In [23]:
np.empty((2,3,2)) # creates 2 tables with 3 rows and 2 columns

array([[[1.96331648e-316, 0.00000000e+000],
        [1.01855798e-312, 9.54898106e-313],
        [1.16709769e-312, 1.01855798e-312]],

       [[1.23075756e-312, 1.06099790e-312],
        [1.12465777e-312, 9.76118064e-313],
        [1.16709769e-312, 1.90979621e-312]]])

In [24]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

## Data types for ndarrays

The data type or dtype is a special object containing the information (metadata) the ndarray needs to interpret a chunk of memory as a particular type of data.

dtypes are a source of Numpys flexibility for interacting with data coming from other sources/systems 

In [32]:
arr1 = np.array ([1,2,3], dtype=np.float64) # setting the type to float
arr2 = np.array ([1,2,3], dtype=np.int32) # setting the type to float

In [39]:
arr1.dtype

dtype('float64')

In [40]:
arr2.dtype

dtype('int32')

You can cast an array from one to another using ndarray's ```astype``` method

In [41]:

arr = np.array([1,2,3,4,5])
arr.dtype

dtype('int64')

In [43]:
float_arr = arr.astype(np.float64)
float_arr.dtype

dtype('float64')

In [52]:
numeric_strings = np.array(['1.25', '9.6', '42'], dtype=np.string_)

In [54]:
numeric_strings.astype(dtype=float)

array([ 1.25,  9.6 , 42.  ])

> PS! Calling ```astype``` always creates a new array ( a copy)

### Arithmetic with numpy arrays

Arrays are important because they enable youto express batch operations on data without writing any for loops. 
<p>Numpy users call this <i>vectorization</i></p>

In [58]:
arr = np.array([[1.,2.,3.],[4.,5.,6.]])
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [59]:
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [60]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

In [61]:
1 / arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [62]:
arr ** 0.5

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

In [64]:
# Comparisons between arrays of the same size yeild boolean arrays

arr2 = np.array([[0.,4.,1.],[7.,2.,12]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [65]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

## Basic indexing and slicing

Array indexing is a rich topic.  Ther are many ways you may want to select a ***subset*** or ***individual elements*** of your data.

In [66]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [67]:
arr[5]

5

In [68]:
arr[5:8]

array([5, 6, 7])

In [69]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

<p> The first important distinction between a numpy array and a python list is that array slices are <b>views</b> on the original array. This means that the data is <b>not copied</b> and any modifications to the view will be reflected in the source array

In [70]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

In [71]:
arr_slice[1] = 12345
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

In [72]:
# The bare slice [:] wil lassign to all values in an array

arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

#### Higher dimentional arrays have many more options
<p>In a two-dimentional array the elements at each index are no longer scalars but rather one-dimentional arrays</p>

In [73]:
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])

In [74]:
arr2d[2]

array([7, 8, 9])

You can pass a comma seperated list of incices to select individual elements. The following is equal

In [75]:
arr2d[0][2]

3

In [76]:
arr2d[0,2]

3

> For indexing on a two-dimentional array it is useful to think of axis 0 as rows and axis 1 as columns

In [81]:
arr3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])

In [82]:
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [84]:
arr3d[0] # note that this is now a 2x3 array

array([[1, 2, 3],
       [4, 5, 6]])

In [85]:
# Both scalar values and arrays can be assigned to arr3d[0]

old_values = arr3d[0].copy()

In [86]:
arr3d[0] = 42

In [87]:
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [88]:
arr3d[0] = old_values

In [89]:
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [93]:
# similarly, arr3d[1,0] gives you all of the values whose indices start with (1,0), forming a 1-dim array
arr3d[1,0]

array([7, 8, 9])

In [95]:
# it is the same as doing it in two operations
x = arr3d[1]
x[0]

array([7, 8, 9])

#### Indexing with slices


In [97]:
# One-dim arrays is just like python lists
arr[1:6]

array([ 1,  2,  3,  4, 64])

In [98]:
# two dimentional is a bit different. AS you can see it slices along axis 0 (the first axis).

arr2d[0:2]

array([[1, 2, 3],
       [4, 5, 6]])

In [99]:
# You can pass multiple indexes (here selected two first rows and two last columns)

arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

> By mixing indexes and slices you get lower dimentional indexes

In [100]:
# For example, I can select the second row but inly the first two columns like so:

arr2d[1, :2]

array([4, 5])

In [103]:
# Similarly, I can select the third column but only the first two rows like so: 

arr2d[:2,2]

array([3, 6])

Above we select row 0 and 1 with the [:2] slice and then only the third column with the [2] slice, giving us 3 and 6 

In [108]:
arr2d[:, :1] # select all row, but only first colum

array([[1],
       [4],
       [7]])

In [110]:
arr2d[:2, 1:] = 0 # assigning to a slice experession assignes to hte whole selection
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

### Boolean Indexing
Lets consider having two arrays, array one contains names (with duplicate names), array two contains a numeric float

In [111]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [112]:
data = np.random.randn(7,4)

In [113]:
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [114]:
data

array([[-0.45842291, -0.66935973,  1.82306529,  2.50555609],
       [-0.51420392, -0.07446451,  0.21598591, -1.12409716],
       [ 1.05399223,  1.60989212,  0.40741886,  0.22565729],
       [ 0.39924425, -0.3701473 ,  0.92880454, -0.08545965],
       [ 2.0296943 , -0.47769433,  0.80240461, -1.14444456],
       [ 0.2114909 ,  0.62040077,  1.187415  ,  0.85313084],
       [-0.9857196 ,  1.12338668,  1.38131975, -1.27044535]])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob.
Like arithmetic operations, comparisons (such as == ) with arrays are also vectorized. Thus comparing names with the string 'bob' yields a boolean array

In [118]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

In [119]:
# This boolean array can be passed when indexing the array
data[names == 'Bob']

array([[-0.45842291, -0.66935973,  1.82306529,  2.50555609],
       [ 0.39924425, -0.3701473 ,  0.92880454, -0.08545965]])

In [120]:
# In these examples I select from the rows where names == 'Bob' AND index the columns too;
data[names == 'Bob', 2:]

array([[ 1.82306529,  2.50555609],
       [ 0.92880454, -0.08545965]])

In [122]:
data[names == 'Bob', 3]

array([ 2.50555609, -0.08545965])

In [121]:
# too select everything except for bob you can use !=
data[names != 'Bob', 2:]

array([[ 0.21598591, -1.12409716],
       [ 0.40741886,  0.22565729],
       [ 0.80240461, -1.14444456],
       [ 1.187415  ,  0.85313084],
       [ 1.38131975, -1.27044535]])

Selecting two of the three names to combine multiple boolean conditions use boolean arithmetic operators like & (and) and | (or)

In [123]:
mask = (names == 'Bob') | (names == 'Will')
mask

array([ True, False,  True,  True,  True, False, False])

In [124]:
data[mask]

array([[-0.45842291, -0.66935973,  1.82306529,  2.50555609],
       [ 1.05399223,  1.60989212,  0.40741886,  0.22565729],
       [ 0.39924425, -0.3701473 ,  0.92880454, -0.08545965],
       [ 2.0296943 , -0.47769433,  0.80240461, -1.14444456]])

> Selecting boolean indexing always creates a copy of the data even if the returned array is unchanged

> The python keywords ```and``` and ```or``` does not work with boolean arrays. 

In [125]:
# Setting values with boolean strings works in a common-sence way:
data[data < 0] = 0
data

array([[0.        , 0.        , 1.82306529, 2.50555609],
       [0.        , 0.        , 0.21598591, 0.        ],
       [1.05399223, 1.60989212, 0.40741886, 0.22565729],
       [0.39924425, 0.        , 0.92880454, 0.        ],
       [2.0296943 , 0.        , 0.80240461, 0.        ],
       [0.2114909 , 0.62040077, 1.187415  , 0.85313084],
       [0.        , 1.12338668, 1.38131975, 0.        ]])

In [126]:
# Setting whole rows or columns using a one-dim boolean array is also easy
data[names != 'Joe'] = 7
data

array([[7.        , 7.        , 7.        , 7.        ],
       [0.        , 0.        , 0.21598591, 0.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [0.2114909 , 0.62040077, 1.187415  , 0.85313084],
       [0.        , 1.12338668, 1.38131975, 0.        ]])

### fancy indexing

is a term adopted by numpy to describe indexing using integer arrays

In [127]:
# suppose we ahve a 8 x 4 array

arr = np.empty((8,4))


In [128]:
for i in range(8):
    arr[i] = i

In [129]:
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

In [130]:
# to select a subset of the rows in a particular order we can pass 
# in a list or ndarray of integers that specify the order

arr[[4,3,0,6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

In [131]:
# using negative indices seleects rows from the end

arr[[-3,-5,-7]]

array([[5., 5., 5., 5.],
       [3., 3., 3., 3.],
       [1., 1., 1., 1.]])

In [132]:
# passing multiple index arrays does something slightly different. 
# it selects a one-dim array of elements corresponding to each tuple of indicies. 

arr = np.arange(32).reshape((8,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [133]:
arr[[1,5,7,2],[0,3,1,2]] # row 1 col 0, row 5 col 3, etc. In tuple form it would be (1,0)(5,3)(7,1)(2,2)

array([ 4, 23, 29, 10])

## Transposing arrays and swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything. Arrays have the ```transpose``` method and also the special ```T``` attribute. 

In [134]:
arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [135]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

In [136]:
arr = np.random.randn(6,3)
arr

array([[-0.59544336,  0.81471705,  0.04850572],
       [ 1.6171196 ,  1.40502565,  0.05916545],
       [-1.70272032,  1.39216506, -1.07644825],
       [-0.28560356, -0.85587397, -0.45892962],
       [-0.27074132,  0.05056587,  0.83200664],
       [-0.74785268, -0.44768124,  0.80896206]])

In [138]:
np.dot(arr.T, arr) 

# dot is what? (dot product or dot notation ??? ) its probably from math :/ 
# Look at MIT lecture in math

array([[ 6.58303895, -0.01794109,  1.20051443],
       [-0.01794109,  5.51148016, -1.30324631],
       [ 1.20051443, -1.30324631,  2.72186526]])

In [142]:
# For higher dimentional arrays transpose will accept a tuple of axis numbers to permute the axes

arr = np.arange(16).reshape((2,2,4))
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [143]:
arr.transpose((1,0,2))

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

simple transposing with .T is a special case of swapping axes. 
ndarray has the method ```swapaxes```, which takes a pair of axis numbers and swithes the indicated axis to rearrange the data

In [145]:
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [146]:
arr.swapaxes(1,2) 

array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]]])

> ```swapaxes``` retunrs a view on the data without making a copy