Introduction to Python: 3. NumPy and SciPy basics
============
***

NumPy and SciPy are crucial libraries for data-analysis. 

NumPy provides the foundation for all data-analytical frameworks as it defines the data-types and functions that allows Python to process data quickly. NumPy makes extensive use of C code under the hood.

SciPy is a large package of scientific computing tools, including statistical tools and machine learning tools. I've barely even started to skim the surface of what SciPy can do, so you should explore for yourselves. Also, if you installed Python manually and not via a pre-packaged version like EPD's or Anaconda, SciPy is terribly hard to install, because it enails lots of C and Fortran libraries that need to be manually compiled. For this reason, you should stick with pre-packaged versions of Python whenever you move to a new system, unless you really know what you're doing.

For both NumPy and SciPy, I will simply cover what I think is useful - both libraries have a wealth of features that I've never used.

##NumPy##

NumPy is a library build for speeding numerical computation. You usually need to install it as a separate module as it doesn't come packaged with Python, but the EPD package comes with it. Also, IPython always pre-imports it via the <code>from numpy import *</code>(this is a guess, I'm not sure if it imports every function), so you don't (rarely) need to manually import it or references functions via <code>numpy.SOMFUNCTION</code>.

The basic workhorse data structure of NumPy is the array. Arrays are like lists in some senses: they contain an ordered, 0-index list of objects of the same type, and they can be indexed in exactly the same ways as a list can.

However, the crucial difference between arrays and lists is that arrays are fixed-length. I believe they are build on C-style arrays under the hood. What that means is that you should never append things to an array - there is a function to do so, but what it does is basically copy the entire array into a new array one item longer, which is terrible inefficient.

Like in the case of tuples, changing between arrays and lists is trivial.

In [1]:
import numpy as np

ls = range(10)
print(ls)

arr = np.array(ls)
print(arr)

print(type(arr))
print(arr[3])
print(arr[::-1])

range(0, 10)
[0 1 2 3 4 5 6 7 8 9]
<class 'numpy.ndarray'>
3
[9 8 7 6 5 4 3 2 1 0]


(You can also create multidimensional arrays i.e. matrices, but I won't be doing so, as we will rarely have a use for it, at least for now. If you are interested, you can look into <code>ndarray</code>)

A big advantage of NumPy arrays is that you can easily do column.(or row, depending on how you look at it)-wise functions. For example, if you want to multiple every element in a list by two. In lists, you would need a loop, at at least list comprehension. In NumPy, you can simply treat it as if it were a number and perform a scalar operation on it.

In [2]:
import numpy as np

ls = range(10)
arr = np.array(ls)
print(arr)
print(arr*2)
print(arr+1)

[0 1 2 3 4 5 6 7 8 9]
[ 0  2  4  6  8 10 12 14 16 18]
[ 1  2  3  4  5  6  7  8  9 10]


In general only numerical operations and NumPy functions can be used as scalar operations in this manner. All other functions will probably break. Wherever possible, use column-wise operations, because they are much, much faster.

Another big benefit of NumPy is the ability to exploit Boolean arrays. As you might guess, we are able to create an array of Trues and Falses by directly running the whole array against some (scalar) condition/expression.

In [3]:
import numpy as np

arr = np.arange(10)   # short-cut for array of range
is_even = arr%2 == 0
print(is_even)

[ True False  True False  True False  True False  True False]


What's really brilliant is that you can index an array with a boolean array, and that filters out all the corresponding "False" elements.

In [4]:
import numpy as np

arr = np.arange(10)   # short-cut for array of range
is_even = arr%3 == 0
even = arr[is_even]
print(even)
print(is_even)

[0 3 6 9]
[ True False False  True False False  True False False  True]


This will prove extremely useful in filtering data. Again, not all conditions can be directly applied column-wise. There are also functions such as <code>logical_or</code> and <code>logical_and</code> that are column-wise logical oeprations on the arrays.

## Exercises

1. <b><u>A smarter logical_and/or function</u></b>: I'm not sure why this doesn't exist in NumPy's library, considering how often one would use it. The problem with <code>logical_or</code> and <code>logical_and</code> is that it only takes in 2 NumPy arrays at a time. But what if you want to <code>and</code> across 10 boolean arrays at once? Create 2 functions that take in a list of boolean arrays, that run the <code>or</code> (or <code>and</code>) operation across all of them.
2. <b><u>2-dimensional arrays</u></b>: So far, we have only worked with 1-dimensional arrays. NumPy can also handle n-dimensional arrays, though I tend to steer away from it since I prefer to use Pandas' Dataframes (another data structure we will learn about soon). That said, some functions require work with n-dimensional arrays, particularly those involving matrices. We'll just work with 2 dimensions. Write a function that takes in 2 arguments $X$ and $Y$, and returns and $X$-by-$Y$ multiplication tables in a 2d-Array. Experiment with indexing within the 2d-array.
3. There is a function <code>hist2d</code> in matplotlib that generates a heatmap. Add an optional argument to your solution in (II) to also generate a heat map.
4. We will now have a more efficient approach to finding prime numbers. Look up http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes and implement the same algorithm. Write a fucntion take takes an argument <code>n</code>, and returns an array of all prime numbers up to n.
5. There is a function <code>random.randn</code> that takes 2 arguments $X$ and $Y$ and gives and $X$-by-$Y$ matrix of random numbers. Make a function that generates a random $X$-by-$Y$ heat map.

In [5]:
import numpy as np

arr = np.arange(30)
is_two = arr%2 == 0
is_three = arr%3 == 0
is_five = arr%5 == 0

mlist = [is_two, is_three, is_five]

def bor(ls):
    condition=np.arange(30)
    condition.fill(0)
    for i in range(30):
        for carr in ls:
            if carr[i]==1:
                condition[i]=1
    return condition

print(is_two*1)
print(is_three*1)
print(is_five*1)
print(bor(mlist))
                
                


[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
[1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0]
[1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0]
[1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0]


In [24]:
import numpy as np

arr = np.arange(60)
is_two = arr%2 == 0
is_three = arr%3 == 0
is_five = arr%5 == 0

mlist = [is_two, is_three, is_five]

def band(ls):
    condition=np.arange(60)
    condition.fill(1)
    for i in range(60):
        for carr in ls:
            if carr[i]==0:
                condition[i]=0
    return condition

print(is_two*1)
print(is_three*1)
print(is_five*1)
print(band(mlist))
                

[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
[1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0]
[1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [13]:
#Exercise 2

import numpy as np

marray = np.ndarray([4,8])

for i in range(marray.shape[0]):
    for j in range(marray.shape[1]):
        marray[i,j]=(i+1)*(j+1)

print(marray)

[[  1.   2.   3.   4.   5.   6.   7.   8.]
 [  2.   4.   6.   8.  10.  12.  14.  16.]
 [  3.   6.   9.  12.  15.  18.  21.  24.]
 [  4.   8.  12.  16.  20.  24.  28.  32.]]


In [14]:
#Exercise 3

from matplotlib.colors import LogNorm
import matplotlib.pyplot as plt
import numpy as np

# normal distribution center at x=0 and y=5
x = np.random.rand(100000)
y = np.random.rand(100000) + 5

plt.hist2d(x, y, bins=40, norm=LogNorm())
plt.colorbar()
plt.show()

In [72]:
#Exercise 4

import numpy as np

arr = np.arange(2,1000)   # short-cut for array of range
siz = arr.size+3
primes = np.array

for i in range(2,siz):
    if arr.size!=0:
        primes = np.append(primes, arr[0])
        is_coprime = arr%arr[0] != 0
        arr = arr[is_coprime]
    
print(primes)

[<built-in function array> 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59
 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157
 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251 257
 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359 367
 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463 467
 479 487 491 499 503 509 521 523 541 547 557 563 569 571 577 587 593 599
 601 607 613 617 619 631 641 643 647 653 659 661 673 677 683 691 701 709
 719 727 733 739 743 751 757 761 769 773 787 797 809 811 821 823 827 829
 839 853 857 859 863 877 881 883 887 907 911 919 929 937 941 947 953 967
 971 977 983 991 997]


In [11]:
import numpy as np

print(np.random.randn(100))

[-0.27631239  0.08536108  1.6901549   0.88861965 -0.03634583  1.06556724
  1.3652261  -1.49673773 -0.48553124 -0.79772738 -0.76025134  2.64103459
 -1.79407916  1.57244193  2.09373457  1.41938966  0.27255908  1.75575293
 -0.43953305 -1.69637116  0.90155888  0.27646154 -0.78171695  1.45757156
  1.59027922  0.60919107  1.72896884  0.57794291 -0.8668053   0.67252599
  1.70213761 -0.27575088  0.16542461 -1.29085528 -1.33141528  0.51227722
 -0.95795231  1.04797354  1.60070886 -0.10302578  1.08124661 -0.34744638
 -1.72399643 -0.03399855  0.15719831 -1.33365313 -0.01294738  0.16905265
 -0.6530314   2.21648997  2.14785722 -0.22134651  0.53973688 -0.87367505
  1.87807135  0.6522814   0.65630554 -0.67302292  1.63615009  0.73284346
  0.13382904  1.85915159 -1.37622445  0.97762738  0.2262762  -0.39973755
 -1.14974746 -1.49233684 -0.7465381   1.46663167  1.13855129  0.7216774
 -0.54961656 -0.37238401 -0.34403215  1.28434845  0.63561709 -0.52899821
 -0.35809656  0.75199035  0.67694122 -0.27193564  0.

In [1]:
randn(7,3)

array([[-1.8037604 ,  1.9797056 , -0.11104845],
       [-2.30568632, -0.33805288, -0.17024923],
       [ 1.44654806,  0.16460886, -0.6833319 ],
       [-0.47898826, -1.08868133,  0.37095317],
       [ 0.73271976,  0.88327063,  0.6428354 ],
       [ 0.9462552 , -0.92551678,  0.15696999],
       [ 0.12011212,  0.35512276,  0.72902347]])

***
Miscellaneous formatting code:

In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()