### Drill 10.5.1 - Numpy drills 

First let's import the system library into a script and determine the version we're using.  What do you see?

In [2]:
#answer
import sys
print(sys.version)

3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


<font color='red'>Import</font> numpy as np and confirm the version.

In [2]:
import numpy as np
print(np.__version__)

1.14.3


Create an array using NumPy.  Assign the output to a var, called a, and print

In [4]:
import numpy as np
a = np.arange(10)
print(a)

[0 1 2 3 4 5 6 7 8 9]


<font color='red'>Measures of central tendancy</font> (the total number of data points, min, max, standard deviation, variance) is the first step in using data.  Let's do this using Numpy and the list just created above. 

In [19]:
# [Answer]
print("mean:", a.mean())
print("min :", a.min())
print("max :", a.max())
print("std :", a.std())
print("var :", a.var())

mean: 8.0
min : 0
max : 16
std : 4.898979485566356
var : 24.0
25


<h4>returning boolean expressions checking an np array</h4>
Let&rsquo;s check whether an array contains desired values.

In [6]:
import numpy as np
#create a range of values
a = np.arange(17)
print(a)
# does this array contain odd values?
print(a[a % 3 == 0])

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
[ 0  3  6  9 12 15]


In [8]:
# yup!  So let's see ... 
a % 3 == 0

array([ True, False, False,  True, False, False,  True, False, False,
        True, False, False,  True, False, False,  True, False])

<h4>NumPy runs a lot faster</h4>  <blockquote>A little optional fyi:  Numpy is based on Atlas, a library for linear algebra operations (see http://math-atlas.sourceforge.net/).  Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of <font color='red'>locality of reference</font>.  If we're summing integers, there's a specialized CPU vector operation (https://superuser.com/questions/1170062/whats-the-difference-between-a-superscalar-and-a-vector-processor).

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you&rsquo;re performing, but a few orders of magnitude isn&rsquo;t uncommon in number crunching programs.</blockquote>

In [9]:
# using Unix's "timeit ... "
demoNp = np.arange(25000)
%timeit [x for x in demoNp if x % 2 == 0]

8.26 ms ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
%time demoNp[demoNp % 2 == 0]

CPU times: user 745 µs, sys: 282 µs, total: 1.03 ms
Wall time: 477 µs


array([    0,     2,     4, ..., 24994, 24996, 24998])

<h4>Get to know your np int types ... </h4><blockquote>See https://en.wikipedia.org/wiki/Integer_(computer_science) for more</blockquote>

In [14]:
# using our demoNp
print(type(demoNp))
# and the int type...
print(demoNp.dtype)
# changing a to floats ... 
npa = np.arange(25)
np.float64(25)

<class 'numpy.ndarray'>
int64


25.0

In [16]:
print(np.float32(npa))
# and 64?
print(np.float64(npa))
# converting an array to a number type

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24.]
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24.]


In [18]:
#Mixing types?
np.array(range(20)).dtype
np.array([1.0, 0, 2, 3]).dtype # and drop down to the most precise but 
# mixing more types can yield interesting results - here we add a boolean to floats and ints ...
np.array([True, True, 0.1, 1, 2.5, 7]).dtype

dtype('float64')

In [32]:
# let's explore some properties of arrays ... 
# recalling the demoNp - what's the # of elements in the array?
print(demoNp.size)
# using the np array from above ... what's its size?
print(np.size)
# and the shape of it?
print(npa.shape)
# what's the min/max
print(npa.argmin()) # min
print(npa.argmax()) # max
print("-"*60, "\nlinspace with endpoint:")
# let's explore linspace [https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html#numpy-linspace]
print(np.linspace(0, 5, 10))
# and we can exclude the right end point:
print("-"*60, "\nlinspace without endpoint:")
print(np.linspace(0, 5, 10, endpoint=False))
# finally, let's set the type to int:
print("-"*60, "\nsame data changed to ints:")
print(np.linspace(0, 5, 10).astype(int))

25000
<function size at 0x106aa5ea0>
(25,)
0
24
------------------------------------------------------------ 
linspace with endpoint:
[0.         0.55555556 1.11111111 1.66666667 2.22222222 2.77777778
 3.33333333 3.88888889 4.44444444 5.        ]
------------------------------------------------------------ 
linspace without endpoint:
[0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5]
------------------------------------------------------------ 
same data changed to ints:
[0 0 1 1 2 2 3 3 4 5]


<h4>Whenever we clean data</h4> or perform work on our data, make sure the data are there!  We want to check for <font color='red'>not-a-number</font> or <b>nan</b>

In [38]:
# say we're importing data and we want to check for data being "not a number":
# for demo, we need a nan value.  Not that we cannot convert float NaN to int.
# note: python 3.6 and 3.7 are likely to throw a runtime warning ... 
ap = np.linspace(0, 10, 11)
ap[0] = np.nan
# okay, now let's see how we can track this nan...
print(ap.min())
print(ap.max())
print(ap.min())
print(ap.mean())

nan
nan
nan
nan


  return umr_minimum(a, axis, None, out, keepdims)
  return umr_maximum(a, axis, None, out, keepdims)
