## Making ranges

In [3]:
type(range(15))

range

In [4]:
x = list(range(15))
x

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [5]:
L = [3,5,7,8]
for ind in range(len(L)):
    print(ind)

0
1
2
3


In [6]:


for ind in range(len(L)):
    print(L[ind])

3
5
7
8


In [7]:
list(range(1,5))

[1, 2, 3, 4]

In [1]:
list(range(5,10))

[5, 6, 7, 8, 9]

In [9]:
list(range(5,100,5))

[5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

In [10]:
list(range(10,0,-1))

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [11]:
list(range(10,-1,-1))

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [12]:
type(x)

list

In [13]:
L = list(range(10,-1,-1))
print(L)
L[2:8]

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


[8, 7, 6, 5, 4, 3]

Ranges are like splices.

In [14]:
L[8:2:-1]

[2, 3, 4, 5, 6, 7]

In [15]:
L

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [16]:
L[::-1]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

## 2.  Numpy arrays

In [69]:
import numpy as np
a = np.arange(15)
print(a)
a

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

This is called an **array** and for now you can think of this as sequence with only one type.  You can do various sequence like things, with arrays, such as indexing by number and taking splices.

But we haven't yet seen what arrays really can do.   In addition to these sequence-like arrays there are also **2D arrays**, which are **tables of numbers**.  Here both row and column structure matter.

In [18]:
LL_list = list(range(15))
print(LL_list)
LL = np.array([LL_list[0:5],LL_list[5:10],LL_list[10:15]])
print(LL)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


In [19]:
a[13]

13

In [20]:
LL[2,3]

13

In [21]:
LL[2,:]

array([10, 11, 12, 13, 14])

In [22]:
LL[2,0:2]

array([10, 11])

In [23]:
print(a)
a[2]

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


2

Here is a simpler way to define LL.  Use `np.arange`, which is like the default Python `arange`, except that it produces an array.  More on `reshape` below.

In [24]:
LL2 = np.arange(15).reshape((3,5))
print(LL2)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


`a_1d` is a 1D array.  `a` is a 2D array, meaning it has rows and columns.

In [25]:
a_1d = np.arange(15)
print(a_1d)
a = a_1d.reshape(3, 5)
print(a)
b = np.arange(15).reshape(5, 3)
a = np.arange(15).reshape(3, 5)
print(b)
print(b.transpose())

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]
[[ 0  3  6  9 12]
 [ 1  4  7 10 13]
 [ 2  5  8 11 14]]


The need for 2D arrays is obvious if you've taken a linear algebra class.  They correspond to the mathematical object called a **matrix**.  One application for matrices is in solving systems of equations, but this really only scratches the surface.  A more fundamental application for a data scientist is they can represent all that is known about a single data set.  Generally each row represents an item (an individual or event in the data), and the entry in each column its value for a particular attribute. For example let's say each room in a hotel has lamps, tables, chairs, and beds, but in varying numbers.  We might represent the inventory of items in 5-room hotel with a 5x4 matrix (5 rows, 4 columns), as follows:

```
6  3 4 1
5  2 3 2
8  3 6 2
5  1 3 1
10 4 7 2
```

So the first row represents a room with 6 lamps, 3 tables, 4 chairs, and 1 bed.
Now if we represent the cost of each item as 1D cost array (or **vector**) to use the mathematical term,

```
40 175 90 450,
```

where the costs are ordered in the same way as our columns above: lamps costs, table costs, chair costs, and bed costs.  Then we can compute the per room cost or the first room as follows:

In [26]:
6*40 + 3*175 + 4*90 + 1*450

1575

Now the computation above, the cost of the items in the first room, can also be done as the "dot product" (or "dot") of the first row of the room_matrix and the cost vector.  The dot product of two 1D arrays is just the sum of the product of the corresponding terms in the two arrays (they need to be the same length).  That is, 

```
6*40 + 3*175 + 4*90 + 1*450
```

In [27]:
room_matrix = \
np.array(
[[6,  3, 4, 1],
[5,  2, 3, 2],
[8,  3, 6, 2],
[5,  1, 3, 1],
[10, 4, 7, 2]])

cost_vector = np.array([40, 175, 90, 450])

In [28]:
cost_vector.dot(room_matrix[0,:])

1575

If we ask for the dot product of an M x N array A with a 1D N-array B, the result is a 1D array containing the dot product of the M rows of A with B.  Applied to our example, the dot product of the room matrix with the cost vector yields the costs of the 5 rooms.

In [29]:
print(room_matrix)
print(cost_vector)
print(room_matrix.dot(cost_vector))

[[ 6  3  4  1]
 [ 5  2  3  2]
 [ 8  3  6  2]
 [ 5  1  3  1]
 [10  4  7  2]]
[ 40 175  90 450]
[1575 1720 2285 1095 2630]


The mathematical name for the `dot` method, which computes the per room cost here, is **matrix multiplication**.

Arrays have many attributes.

In [30]:
print(a.shape)
print(a.ndim)          # a is a 2D arrayam
print(a.dtype.name)    # np.int64 maxint = 2**63 - 1
print(a.itemsize)      # Such an int takes up 8 bytes of memory
print(a.size)          # Number of elements
print(type(a))
print(np.ndarray)

(3, 5)
2
int64
8
15
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


The python interpreter and the print function will print out arrays differently (Although this is an option any Python type can take, most types look the same whether they're printed by `print` or the interpreter).  Either way, arrays look different from lists.

In [31]:
b = np.array([6, 7, 8])
print(b)
print(type(b))
b

[6 7 8]
<class 'numpy.ndarray'>


array([6, 7, 8])

A sequence of sequences can be used to define an array, but since the inner sequences are rows, they must all be the same length.

In [32]:
b = np.array( [ (1.5,2,3), (4,5,6) ] )
b

array([[1.5, 2. , 3. ],
       [4. , 5. , 6. ]])

A very convenient way to fill an array is to start with an array containing all 0's or 1's and then update the contents:

In [35]:
X = np.zeros( (3,4) )
X

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## Reshaping of an array 

```
A.reshape((m,n))
```

means recasting the data in and `A` into
m rows and n columns.,  This means `A` has to have m x n 
cells.  Reshaping can be done on any array that has the right number
of cells.  For example, a 2D 3x4 array can be reshaped into a 6x2
array.

In [36]:
Y = np.arange(12).reshape((3,4))
print(Y)
Y[2,3]

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


11

## Indexing, Splicing, updating

Indexing with 1D arrays works exactly as it does with ordinary Python sequences.

In [14]:
import numpy as np
a = np.array(range(12))
print(a)
a[2]

[ 0  1  2  3  4  5  6  7  8  9 10 11]


2

Indexing with works similarly, but now there are two dimensions to worry about.

In [7]:
a2d = a.reshape((3,4))
print(a2d)
a2d[0,3]

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


3

In [9]:
a2d[1,2]

6

In [10]:
a2d[2,2]

10

Lower right corner

In [11]:
a2d[2,3]

11

### example 2 (with splices & assignment)

In [26]:
b2d = np.zeros((3,4)) + np.arange(4) + np.arange(3).reshape((3,1))
b2d

array([[0., 1., 2., 3.],
       [1., 2., 3., 4.],
       [2., 3., 4., 5.]])

In [27]:
b2d[0,0]  # first row, first col

0.0

In [28]:
b2d[1,2]  # second row, third col

3.0

In [29]:
print(b2d)
b2d[2,0]  # third row, first col

[[0. 1. 2. 3.]
 [1. 2. 3. 4.]
 [2. 3. 4. 5.]]


2.0

A splice from the first row:

In [43]:
print(b2d[0,1:3])
b2d[0,1:3].shape

[1. 2.]


(2,)

Splices also can be 2D. A 2x2 splice

In [38]:
print(b2d[1:3,1:3])
b2d[1:3,1:3].shape

[[2. 3.]
 [3. 4.]]


(2, 2)

Now let's change the value of  the last row, first col element `b2d[2,0]`.

In [41]:
b2d = np.zeros((3,4)) + np.arange(4) + np.arange(3).reshape((3,1))
print(b2d)
b2d[2,0] = 3
print()
print(b2d)

[[0. 1. 2. 3.]
 [1. 2. 3. 4.]
 [2. 3. 4. 5.]]

[[0. 1. 2. 3.]
 [1. 2. 3. 4.]
 [3. 3. 4. 5.]]


Now let's make all of that first column be 3.

In [32]:
b2d[:,0]  = [3,3,3]

In [33]:
b2d

array([[3., 1., 2., 3.],
       [3., 2., 3., 4.],
       [3., 3., 4., 5.]])

Next we'll update a small square array in the middle;

In [46]:
spl = 5*np.ones((2,2))
print(spl)
print()
b2d = np.zeros((3,4)) + np.arange(4) + np.arange(3).reshape((3,1))
print(b2d)
print()
print(b2d[1:3,1:3])
b2d[1:3,1:3] = spl
print()
print(b2d)

[[5. 5.]
 [5. 5.]]

[[0. 1. 2. 3.]
 [1. 2. 3. 4.]
 [2. 3. 4. 5.]]

[[2. 3.]
 [3. 4.]]

[[0. 1. 2. 3.]
 [1. 5. 5. 4.]
 [2. 5. 5. 5.]]


### reshaped arrays are views

Reshaping an array creates a new array.  The original array can only be indexed according to its original shape:

In [15]:
Y = np.arange(12).reshape((3,4))
Z = Y.reshape((4,3))
print(Z)
print(Y)
print(Y[2,3])

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
11


However, the two arrays `Z` and `Y` still share the same **data**.  What that means is that changing one also changes the other. 

We call `Z` and `Y` different views of the same data.

In [47]:
print('Y:')
print(Y)
print('Z:')
print(Z)
print('    ==>')
Y[1,0] = 14
print('Y:')
print(Y)
print('Z:')
print(Z)

Y:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Z:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
    ==>
Y:
[[ 0  1  2  3]
 [14  5  6  7]
 [ 8  9 10 11]]
Z:
[[ 0  1  2]
 [ 3 14  5]
 [ 6  7  8]
 [ 9 10 11]]


## Assignment (extended examples)

Basic operation.

In [99]:
X = np.zeros((3,4))
X[1,2] = 3
X

array([[0., 0., 0., 0.],
       [0., 0., 3., 0.],
       [0., 0., 0., 0.]])

Filling an entire array with data in a loop.

In [100]:
X = np.zeros( (3,4), dtype=int )
ctr = 0
(rows,cols) = X.shape
for i in range(rows):
    for j in range(cols):
        ctr += 1
        X[i,j] += ctr*10
X

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

A big difference between an array and a list is that arrays are more flexible about assignments.  The value assigned to a splice can be a scalar, in which case the assignment is "broadcast" to each of the cells in the slice (we discuss broadcasting in more detail in another `numpy` notebook).

In the next cell we demonstrate assignment to a splice.  In normal Python the value that is assigned to a splice has to be a sequence.  But in numpy it is treated as an elementwise operation, so in the next  example, we assign the value -1000 to each of the positions in the splice.

In [48]:
a = np.arange(15)
print(a)
a[:6:2] = -1000    # equivalent to a[0:6:2] = -1000; from start to position 6, exclusive, set every 2nd element to -1000
print(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[-1000     1 -1000     3 -1000     5     6     7     8     9    10    11
    12    13    14]


This is based on the idea of **broadcasting**.

In [50]:
a = np.arange(15)
print(a)
a + 5

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [53]:
b = a.reshape((3,5))
print('b')
print(b)
c = np.arange(5)
print()
print('c')
print(c)
print()
print('b + c')
print(b+c)

b
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

c
[0 1 2 3 4]

b + c
[[ 0  2  4  6  8]
 [ 5  7  9 11 13]
 [10 12 14 16 18]]


First col incremented by 0, second by 1, third by 2, and so on.

### transposition

Array transposition is sometimes a nice way to get to the 2D array you eally want.  The transposition of an array `M` is called `M.T`, and the definition is that 

```
M.T[i,j] = M[j,i]
```

So if `M` is an `m` x `n` array, then `M.T` is an `n` x `m`  array.   Look at `X.T` and verify these definitions.  The `m`th row of `X` becomes the `m`th column of `X.T`. The `n`th column  of `X` becomes the `n`th row of `X.T`.

In [108]:
print(X)
print()
print(X.T)
print(X[1,2], X.T[2,1])

[[ 10  20  30  40]
 [ 50  60  70  80]
 [ 90 100 110 120]]

[[ 10  50  90]
 [ 20  60 100]
 [ 30  70 110]
 [ 40  80 120]]
70 70


## 3 (and more) D arrays [not used on the midterm]

There are also 3D arrays, which have a third dimension; each position along the third dimension defines a 2D array.

In [None]:
A =np.arange(24).reshape((2,3,4))
A

 In a 2D array, a single index slice is a row or column (a 1D array).  In a 3D array, a single index slice is a 2D array.

In [None]:
A[1,:,:]

In [None]:
A[:,:,3]

In [None]:
A[:,2,:]

We will not be using 3D arrays much, and they will not be featured on quizzes.  But it's nice to know they're out there. 

An easy example of where you might want a 3D array is a color image. Specifying each pixel position in the image takes two numbers, and in a black and white image we can just have the cell at that position contain a single number representing the grayscale magnitude.  But for a color image we need three numbers, and it is very convenient for many purposes to use a third dimension for those three numbers; for example, using the third dimension for the color, and the first "layer" in the 3rd dimension for red, there is an easily retrievable 2D array representing all the "red" values for the image `I`:

```
I[:,:, 0]
```

In machine learning applications, especially in deep learning applications, 3D arrays and higher are not at all uncommon.  One reason is that it is very convenient (and efficient) to use one dimension for the batch number.  Another is that words in the input are often represented as "vectors" (1D arrays of floating point numbers).  Then

```
D[22, 12,:]
```

retrieves the word vector for the 13th word in the 23rd batch.

## 3.  Elementwise arithmetic operations

Basic idea on which broadcasting is based:

In [None]:
import numpy as np
a = np.array( [20,30,40,50] ).reshape((2,2))
b = np.arange( 4 ).reshape((2,2))
print(a)
print(b)

[[20 30]
 [40 50]]
[[0 1]
 [2 3]]


In [None]:
a + b

array([[20, 31],
       [42, 53]])

In [None]:
a * b

array([[  0,  30],
       [ 80, 150]])

In [None]:
a - b

array([[20, 29],
       [38, 47]])

In the next example we briefly illustrate **broadcasting**.  Elementwise operations are extended to apply between arrays and objects that are not arrays.  So for example, 2 * a (a is the array above) returns an array that contains all the elements of a multiplied by 2

In [None]:
print(a)
print()
print(2 * a)

[[20 30]
 [40 50]]

[[ 40  60]
 [ 80 100]]


This works by "broadcasting" 2 into an array the same size as a and then doing elementwise mutliplication on the two arrays.  We will say more about broadcasting in a subsequent notebook.

The next example applies two operations to `a`, first `sin`, then multiplication by 10.

You can multiply two arrays of the same shape together as we did with `a` and `b`
above, but you can't rely on broadcasting to figure out what to do with mismatched array sizes. In general, multiplying arrays of different sizes together will fail:

In [None]:
c = np.array([3,2,1,5,4])
print(a)
c * a

But as we sdaw above this will work if the 1D array has a size that matches the number
of rows in the 2D array.

In [55]:
a = np.array( [20,30,40,50] ).reshape((2,2))
print(a)
d = np.array([3,2])
print()
print(d)
print()
print("a + d")
print(a + d)

[[20 30]
 [40 50]]

[3 2]

a + d
[[23 32]
 [43 52]]


We add `d` to each row of `a`.

Put another way: we add 3 to the first column of `a and 2 to the second column of `a`.

### universal functions

The next example demonstrates the possibility of applying a function of floats to an nd-array.

It actually applies to all the elements of the nd-array.

In [57]:
print(a)
print(np.sin(a))

[[20 30]
 [40 50]]
[[ 0.91294525 -0.98803162]
 [ 0.74511316 -0.26237485]]


Wew can combine this with the arithmetic capabilities illustarted below.

Eseentially we treeat arrays just like numbers.

In [None]:
10*np.sin(a)

The same idea works with Boolean tests, a point that is important on the homework assignment.

In [58]:
print(a)
Y = a < 35
print(Y)

[[20 30]
 [40 50]]
[[ True  True]
 [False False]]


We apply the Boolean test to ecah element of the array produc imng
a new array.

The resulting Boolean array is sometimes called **mask.**

We see why below.

### Using a Boolean array as a mask

One of the moist important functions of a boolean array
constructed as a condition on the elements of
a matrix `a`  is to serve as a **mask** that can be used to index 
the elements of `a` for 1hich the condition is **true**.

In [59]:
print(a)
Y = a < 50
print(Y)
a[Y]

[[20 30]
 [40 50]]
[[ True  True]
 [ True False]]


array([20, 30, 40])

The last line is a 1D array that hands us the elements of `a`  less than 50.

In [60]:
a = np.array( [20,30,40,50] ).reshape((2,2))
b = np.arange( 4 ).reshape((2,2))
print(a)
print(b)
a == b

[[20 30]
 [40 50]]
[[0 1]
 [2 3]]


array([[False, False],
       [False, False]])

In [None]:
if a == b:
    print("Hi!")

As with any Boolean test the Boolean result takes array shape of `a` and `b`.

Hence `a` and `b` must be the same shape or broadcasting must work.

In [64]:
A = np.array( [[1,1],
            [0,1]] )
B = np.array( 
            [0,4] )
print(A)
print()
print(B)
print()
print(A==B)

[[1 1]
 [0 1]]

[0 4]

[[False False]
 [ True False]]


The result on my machine was that using `np.fromfunction`  is a little under 2 orders of magnitude faster.  Try it on yours.

Note a key point here.  The differences between these two ways of array filling are dependent on the sizes of the arrays being filled.  The same code run with small (for example, 4x5) arrays is a virtual tie.

## Elementwise Boolean operations, Masking examples

A very important extension of elementwise operations is the extension to Boolean tests.

Applying a Boolean test to an array returns an array of truth-values: Just as

```
3 + X
```

adds 3 to every element of array `X`, so

```
X > 2
```

returns an array of truth-values which tells us which elements of `X` are greater than 2.

The property that makes Boolean masking work is this:

Any Boolean array can be used to index another other array
of teh same shape.  They must be the same shape as the array
they index.   So in the simplest case:

In [66]:
X= np.arange(8)
print(X)
print(X[np.array([False,  True, False,  True,  True,  True,  True, False])])

[0 1 2 3 4 5 6 7]
[1 3 4 5 6]


But now I can take advantage of the fact `X>=3`  of the same shape as X:

In [None]:
print(X)
print(X>=3)
print(X[X>=3])

This returns exactly the members of X that are greater than or equal to 3.

We could do the same with a list comprehension, but the array computation above is much faster:

In [None]:
np.array([x for x in X if x >= 3])

Boolean arrays can also be used to **count** the number of elements in an array that satisfy some constraint. Although `len` works on arrays and could be used, `sum` is faster.  The Boolean `True` is treated as 1 and the Boolean `False` as 0, so summing a Boolean array counts
the number of trues.  The following expression correctly counts the number of elements in `X` that are greater than or equal to 3.

In [None]:
sum(X>=3)

array([4, 4, 4, 5])

More on array operations and efficiency below.  It's a tricky subject.

The reason that `X[X>=3]` works has to do with a basic fact of array indexing we haven't
really made explicit yet.  Any Boolean array Y of the right length can be used to index an array X if X and Y are the same length.  

A hokey but essential example.  Suppose we have an array of length 8 and we want to access the first, third, and seventh elements.  We can of course do it with a Boolean array:

In [None]:
X = np.arange(8) + 1
Y = np.array([True,False,True,False,False,False,True,False])
print(X)
X[Y]

But there's a better way!

Cook up a sequence consisting of the **indices** of the first, third, and seventh elements, namely:

```
[0,2,6]
```

Now use that sequence as a **fancy** index on X:

In [None]:
X[[0,2,6]]

Note the need for **double** square brackets.

Passing three different indices to the 1D array `X` is an error.

In [None]:
X[0,2,6]

Fancy indexing also works on 2D arrays.

Study this example.

In [67]:
Y = np.arange(12).reshape((3,4))
print(Y)
print(Y[[0,2],[1,2]])

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[ 1 10]


This is one array containing `Y[0,1]` and `Y[2,2]`.

Now suppose we have an array of data that has 4 columns, but we only want to keep the
first, second and fourth columns.  Here's a simple way to do that using
a Boolean array:

In [None]:
X = np.arange(8).reshape((2,4)) + 1
print(X)
print(X[:,np.array([True,True,False,True])])

Back to Boolean operations.  Boolean operations work more or less as expected on 2D arrays:

## Data in the form of arrays

Many python modules that provide data do so in array form.  As an example,
we load the famous **iris data set** due to Ronald Fisher, familiar to many who've had a statistics class, but also to many who've a computer science or machine learning class in which data analysis plays a role.

In [109]:
from sklearn.datasets import load_iris
data = load_iris()
features = data['data']
target = data['target']

The name `data` has been set to a dictionary.

In [110]:
list(data.keys())

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename']

In the cell loading the data, `features` is set to `data['data']`, a `numpy` array:

In [None]:
print(features.shape)
print(features.ndim)
print(features.dtype.name)
print(features.size)
print(type(features))
print(features[:10])

(150, 4)
2
float64
600
<class 'numpy.ndarray'>
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


The last thing printed is the first 10 rows of `features`.  This array represents data
about irises.  Each row represents a different iris and gives 4 measurements for that 
exemplar.  So there are 150 iris exemplars; with 4 measurements for each, that's 600 items of
data in the array (`features.size`).  The data is used for classification studies.

Let's apply what we've just learnt to pick a certain subset of the data. 
The first number
in each row represents the **sepal length** of that particular iris exemplar (to be demonstrated below).

Find the flowers whose sepal length is exactly 5.

In [111]:
# 1D array: first column
first_col = features[:,0]
print(first_col == 5.0)
print((first_col == 5.0).shape)

[False False False False  True False False  True False False False False
 False False False False False False False False False False False False
 False  True  True False False False False False False False False  True
 False False False False  True False False  True False False False False
 False  True False False False False False False False False False False
  True False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False]
(150,)


Note:  The Boolean array is a 1D array of length 150, one Boolean for each row in the data.  Any Boolean array of this size can be used to index the rows, combined with a `:` to indicate we want all columns.

In [None]:
features[first_col == 5.0,:]

array([[5. , 3.6, 1.4, 0.2],
       [5. , 3.4, 1.5, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5. , 3.2, 1.2, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5. , 3.5, 1.6, 0.6],
       [5. , 3.3, 1.4, 0.2],
       [5. , 2. , 3.5, 1. ],
       [5. , 2.3, 3.3, 1. ]])

So there are only 10 plants out of 150 that have sepal lengths of exactly 5.

Of course we can do this all in one step, with the same result:

In [None]:
features[features[:,0] == 5.0,:]

array([[5. , 3.6, 1.4, 0.2],
       [5. , 3.4, 1.5, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5. , 3.2, 1.2, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5. , 3.5, 1.6, 0.6],
       [5. , 3.3, 1.4, 0.2],
       [5. , 2. , 3.5, 1. ],
       [5. , 2.3, 3.3, 1. ]])

Since we're indexing rows here, the `:` can also be left out:

In [None]:
features[features[:,0] == 5.0]

array([[5. , 3.6, 1.4, 0.2],
       [5. , 3.4, 1.5, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5. , 3.2, 1.2, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5. , 3.5, 1.6, 0.6],
       [5. , 3.3, 1.4, 0.2],
       [5. , 2. , 3.5, 1. ],
       [5. , 2.3, 3.3, 1. ]])

A more interesting use of Boolean array indexing is to find all the irises of a particular class, using the variable `target` defined when we loaded the data; `target` is an array containing the class of each iris in the data set:

In [None]:
print(type(target))
print(target.shape)
print(set(target))
# Grab the first rows of both arrays
print(features[0], target[0])
# Grab row indexed 90 from both arrays
print(features[90], target[90])

As the printouts indicate , `target` is a 1D array (a **vector**) of length 150, containing only the values 0,1, and 2.  These are the three classes to which an iris can belong.  There are exactly as many entries in the `target` array as there are rows in the `features` array. For any iris, its row-index in the `features` array is the index of its class in the `target`.  Above we printed the features and target class for irises 0 and 90.

We can take advantage of this structure to find all the irises of class 1 very efficiently.
Note that `target == 1` is a Boolean array of length 150.

In [None]:
target==1

To find the irises of class 1, we simply use that Boolean array to specify the rows
we want in the `features` array:

In [None]:
features[target==1,:].shape

There are 50 rows whose class is `1`; you can verify for yourselves that are also 50 rows for classes 2 and 3.  This is a balanced data set.

Going back to the dictionary we loaded when we loaded the datset, we can
find the meanings of our 4 columns:

In [None]:
data['feature_names']

So our 4 columns have numerical data on these attributes. The column indexed 0 contains the sepal length measurement for 1 flower in our sample.  The column indexed 1 contains sepal width measurements. And so on.

We know there are three classes (0, 1, and 2). We can look up their names under the `'target_names'` key of the dictionary we got when we loaded the data.  

In [None]:
data['target_names']

So class 0 represent the iris species `setosa`, and so on.

Here's what the entire target array looks like:

In [None]:
data['target']

Find the irises whose species is 1, and whose sepal length is greater than 5.

In [None]:
# New smaller data table consisting of those flowers of species 1.
species1 = features[target == 1,:]
print(len(species1))
species1[species1[:,0]>=5.0,:]
print(len(species1[species1[:,0]>=5.0,:]))

In [None]:
print(species1[:,0]>=5.0)
sum(species1[:,0]>=5.0)

Now suppose we wanted to just see the **sepal lengths** (column 0) of the target class 1 irises, and compare them to the sepal lengths of the target class 0 irises, to see if sepal lengths provided a good way of telling the two classes apart.  We could do:

In [None]:
print(features[target==1,0])
print(features[target==0,0])

And though there is some overlap, we see that there is a tendency for the column-0
value be higher in target 1 than it is target class 0.

## Cosine

One way of measuring the similarity between two arrays `x` and `y`. is to take their cosine.  The name for this similarity measure is quite appropriate.  The cosine of `x` and `y` is a measure of the geometric cosine of the angle between them.  It is 1 when that angle is 0 (i.e., `x` and `y` are identical in direction), and -1 when the vectors point in opposite directions, and 0 when the vectors are orthogonal (the do not share a component in any direction).  It is computed by taking the dot product of the unit vectors pointing in the same direction as `x` and `y`.  To do this we divide `x` and `y` by their Euclidean length (`LA.norm` in the code below) and take the dot product of the results.

In [None]:
import numpy as np
import numpy.linalg as LA
x = np.array([3,4])
y = np.array([4,3])

def find_unit_vector(x):
    return x/LA.norm(x,ord=2)

def cosine (x,y):
    return (find_unit_vector(x)).dot(find_unit_vector(y))

The maximum possible value cosine can have for two vectors is `1.0`.  The maximum possible value is achieved when you compare a vector `x` with itself `x`.  So nothing can be more similar to `x` than `x` is to itself.  Whew.  That's reassuring.

Check my claim about how similar things are top themselves.

In [None]:
cosine(x,x)

1.0

Let's mutliply x by -1 to get a vector pointing in exactly the opposite direction from `x`, and take the cosine of the two vectors.

In [None]:
cosine(x,-x)

-1.0

`x` and `y` are different but not **that** different. What value does `cos` assign to their similarity?

In [None]:
cosine(x,y)

0.96

What does `find_unit_vector` really do?  It finds a vector pointing in exactly the same direction as `x` but having length 1.  How similar will `x` be to its unit vector?  To find that out for yourself, create a Code cell and execute:

```
cosine(x, find_unit_vector(x))
```

Before doing that, look at the definition of cosine again and
try to predict what the value will be.

Does a unit vector really have have length 1?

In [None]:
LA.norm(find_unit_vector(x))

1.0

In [None]:
((135 * 6) + 120 + 40)* 1.11

1076.7

## Why use arrays?

Arrays provide all kinds of indexing convenience for accessing data that
is genuinely **tabular**.  Tabular data is data that has good reason for being represented in
a table;  that is, there is good reason to think of the rows as a unit, and good reason
to think of the columns as units.  And therefore, there is good reason to think we might want
to apply some computational operations to columns and rows.  We'll explore those ideas a bit more
in a future assignment.  

All kinds of data is fundamentally tabular, as we'll see.  But although that's very important,
it's probably not the main reason why so many scientists, engineers, and data analysts use `numpy`
and other programming tools like `matlab` and `R` that provide tables, or **matrices** (the mathematical
name for a similar concept), as a fundamental data structure.  The main reason is computational efficiency.

Having a single data structure with predictable types and elementwise basic arithmetic operations makes
it possible to write very efficient ways of performing more complex operations.  The code snippet. below
demonstrates this by providing a timing comparison between two ways of computing the sum of the
squares between 0 and 999 inclusive; the first is used in computing the variable `normal_py_sec`, is the ordinary Python way of computing such a sum:  Cook up a container of theose integers, square each of them and sum the result.
The second, stored in the varibale `np_sec`, uses numpy arrays.  There is actually a very important
mathematical operation **dot product** that multiplies all the corresponding elements of two vectors (elemntwise) and
sums the result.  Using this `precompiled` fast math operation we speed the code up by a factor of about
 80 on my machine, and your results, on your various home machines will be different, but in the same ballpark.
 

In [None]:
import timeit
import numpy as np

py_sec = timeit.timeit('sum(x*x for x in range(1000))',
                              number=10000)

np_sec = timeit.timeit('na.dot(na)',
                            setup="import numpy as np; na=np.arange(1000)",
                            number=10000)

print(f"Normal Python: {py_sec} sec")

print(f"NumPy: {np_sec} sec")

Normal Python: 0.8127167950005969 sec
NumPy: 0.01427987700299127 sec


Note that numpy arrays in and of themselves don't help.  In the code snippet below, we multiply the `na` array times itself and sum the result.  This code is actually *slower* than the basic Python code, because it pays the overhead of creating the `numpy` array but doesn't exploit it by using the proper optimized operation.  The lesson here is that coding so as to  take maximum advantage of the efficiencies of `numpy` can be tricky, but if you have a good reason to be efficient, it's definitely worth some experimentation. That starts with knowing about the kinds of builtin array specific functions `numpy` offers.

In [None]:
naive_np_sec = timeit.timeit('sum(na*na)',
                             setup="import numpy as np; na=np.arange(1000)",
                             number=10000)
print(("Naive NumPy: %f sec" % naive_np_sec))

Naive NumPy: 1.982286 sec
