## Python Libraries for Data Analysis  I


# <font color='DB2464' >Numpy</font>

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, tools for working with these arrays and useful linear algebra, and random number capabilities. This tutorial is based on the Numpy's official quick start quide:
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

### <font color='A31A48'> 1. Numpy Arrays </font>

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy dimensions are called axes. The number of axes is rank.

* [1, 2] is an array of rank 1, because it has one axis (dimension). That axis has a length of 2.
* [ [1,2,3], [1,2,3] ] is an array of rank 2, because it has two axes (dimensions). Each dimension has a length of 3.

In [3]:
#first import numpy
import numpy as np

#create a Python list
npArray1 = np.array([1,2,3])
npArray2 = np.array([[1,2],[1,2]])

In [7]:
type(npArray1)

numpy.ndarray

In [9]:
# see the shape of the array (dimensions and their ength)
npArray1.shape

(3,)

In [14]:
#number of dimensions
npArray1.ndim

1

In [17]:
#total number of elements in the array
npArray1.size

3

In [18]:
#create an array with the specific number of elements
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [23]:
#create an array with tin the specific range with the given incremental
np.arange(10,40,5)

array([10, 15, 20, 25, 30, 35])

In [24]:
#array of zeros
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [25]:
#array of ones
np.ones(10)

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [52]:
#reshape the array into specific dimensions of given length
np.arange(15).reshape(3,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

If an array is too large to be printed, NumPy automatically skips the central part of the array and only prints the corners. To disable this behaviour and force NumPy to print the entire array, you can change the printing options using set_printoptions.



In [53]:
#to print all the array without truncating
np.set_printoptions(threshold=np.inf)

In [38]:
np.arange(24).reshape(2,3,4)

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [37]:
np.ones(80).reshape(2,2,20)

array([[[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
          1.,  1.,  1.,  1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
          1.,  1.,  1.,  1.,  1.,  1.,  1.]],

       [[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
          1.,  1.,  1.,  1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
          1.,  1.,  1.,  1.,  1.,  1.,  1.]]])

In [39]:
np.arange(20).reshape(4,-1)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
Return evenly spaced numbers over a specified interval.

In [68]:
np.linspace(0,20,10)

array([  0.        ,   2.22222222,   4.44444444,   6.66666667,
         8.88888889,  11.11111111,  13.33333333,  15.55555556,
        17.77777778,  20.        ])

In [67]:
np.linspace(start=0,stop=20,endpoint=False,num=10)

array([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.])

### <font color='4C49A2'>Exercise </font>

1. Create an array of first 9 numbers that are divisible by 3
2. Turn that array into 2-dimensional array where each dimension has equal length
3. Turn that 2-dimensional array to 3-dimensional
4. Change the dimension lengths and create a new array
5. Draw on the paper that array --> np.ones(12).reshape(2,2,3,1)
6. Create 20 equal intervals between 0 and 100


### <font color='A31A48'> 2. Basic Operations </font>

Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result.


In [55]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
b


array([0, 1, 2, 3])

In [56]:
c = a-b
c


array([20, 29, 38, 47])

In [58]:
b**2

array([0, 1, 4, 9])

In [59]:
10*np.sin(a)

array([ 9.12945251, -9.88031624,  7.4511316 , -2.62374854])

In [60]:
a<35

array([ True,  True, False, False], dtype=bool)

In [71]:
d=np.ones(4).reshape(2,2)
d+=5

In [72]:
d

array([[ 6.,  6.],
       [ 6.,  6.]])

Unlike in many matrix languages, the product operator * operates elementwise in NumPy arrays. The matrix product can be performed using the dot function or method:


In [73]:
A = np.array( [[1,1],
            [0,1]] )

B = np.array( [[2,0],
            [3,4]] )


In [74]:
A*B                         # elementwise product

array([[2, 0],
       [0, 4]])

In [75]:
A.dot(B)                    # matrix product

array([[5, 4],
       [3, 4]])

In [76]:
np.dot(A, B)                # another matrix product

array([[5, 4],
       [3, 4]])

In [90]:
a = np.arange(1,10,2)
a

array([1, 3, 5, 7, 9])

In [89]:
# get the sum of array elements
a.sum()

25

In [91]:
#get the maximum
a.max()

9

In [92]:
#get the minimum
a.min()

1

In [95]:
b = np.array([9,4,12,1,-1,0,5])

In [98]:
#sort the array
b.sort()
b

array([-1,  0,  1,  4,  5,  9, 12])

In [99]:
b = np.arange(12).reshape(3,4)
b

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [100]:
b.sum(axis=0)                            # sum of each column   

array([12, 15, 18, 21])

In [101]:
b.min(axis=1)                            # min of each row

array([0, 4, 8])

In [102]:
#cumulative sum
b.cumsum(axis=1) 

array([[ 0,  1,  3,  6],
       [ 4,  9, 15, 22],
       [ 8, 17, 27, 38]])

In [199]:
#flatten the array
b.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [6]:
x = np.array([1,0,2,0,3,0,4,5,6,7,8])
np.where(x == 0)[0]

array([1, 3, 5])

In [7]:
x > 5

array([False, False, False, False, False, False, False, False,  True,
        True,  True], dtype=bool)

In [8]:
x[x>5]

array([6, 7, 8])

In [23]:
a = np.random.randint(0,10,10)

In [24]:
a

array([9, 6, 5, 1, 9, 4, 1, 9, 5, 7])

In [25]:
np.where( (a>6) & (a<8))[0]

array([9])

### <font color='4C49A2'>Exercise </font>

1. Create a matrix of (3,2) and another matrix of (2,3)
2. Multiply those two matrices
3. Divide the final matrix by 10
4. Do you see the decimal points of the matrix elements? If not, get them!
5. Create an array of (10,10) and find the minimum in each column
6. Create an array of (4,2) and find the cumulative sum of the columns
7. Find the indexes of the values which are divisible by five in a list of 100 values from 0 till 100

### <font color='A31A48'> 3. Random Number Generation</font>

Numpy has a rich set of functions to generate random functions very effectivey. https://docs.scipy.org/doc/numpy/reference/routines.random.html

In [105]:
#Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
np.random.rand(3,2)

array([[ 0.52533355,  0.48592118],
       [ 0.76096692,  0.90684981],
       [ 0.33070514,  0.83957619]])

The dimensions of the returned array, should all be positive. If no argument is given a single Python float is returned.

In [135]:
np.random.seed(12321)

In [137]:
np.random.rand()

0.7969142046245936

In [108]:
#get a python list with 10 random float numbers in the interval of [0,1)
list(np.random.rand(10))

[0.73953894073896076,
 0.57084754031971385,
 0.76709386829445159,
 0.14870524696179843,
 0.38068776235985191,
 0.077025809092930886,
 0.63779377297756812,
 0.49263018439736705,
 0.35644055672724839,
 0.43318734275380366]

Generate random float numbers in the given interval

In [187]:
np.random.uniform(-1,10,20)

array([ 0.80300991,  0.47484882,  2.93133191,  5.20035611,  7.45020006,
        3.45995925,  2.11113123,  5.23937344,  8.81823493,  7.38534248,
        9.83390344,  0.81435432,  2.0807445 ,  1.40879874,  5.29916245,
        7.90826454,  2.27042796,  0.86472326,  4.95909645,  1.52208122])

 <b>randn</b> generates an array of shape (d0, d1, ..., dn), filled with random floats sampled from a univariate “normal” (Gaussian) distribution of mean 0 and variance 1 

In [110]:
#generate 12 random numbers from a normal distribution
np.random.randn(12)

array([-1.06225622,  0.21732287,  3.56351961,  0.53778708,  1.41255994,
        0.28597198, -0.31727111, -1.45191324,  0.91633374,  1.32990586,
       -0.6289516 ,  1.57884778])

For random samples from $N(\mu, \sigma^2)$, use:

$\sigma$ * np.random.randn(...) + $\mu$

In [111]:
#Two-by-four array of samples from N(3, 6.25):
2.5 * np.random.randn(2, 4) + 3

array([[ 4.2521598 ,  3.06326605,  2.90575311,  3.57326858],
       [ 2.6939324 ,  4.11118864, -0.04485233,  3.04668178]])

**Generate random integers in the half-open interval given:**

np.random.randint(low, high=None, size=None)

In [118]:
#generate only 1 integer in the given interval [0,10)
np.random.randint(0,10)

9

In [115]:
#generate 5 random integers between 0 and 10 (10 is not included!)
np.random.randint(0,10,5)

array([6, 7, 2, 0, 3])

In [119]:
#generate an array of (2,3) with random integers between 0 and 10 (10 is not included!)
np.random.randint(0,10,(2,3))

array([[1, 4, 5],
       [8, 8, 5]])

In [120]:
np.random.randint(0,10,6).reshape(2,-1)

array([[2, 8, 0],
       [9, 0, 4]])

**Generates a random sample from a given 1-D array**

numpy.random.choice(a, size=None, replace=True, p=None)

In [121]:
np.random.choice(5, 3)
#This is equivalent to np.random.randint(0,5,3)

array([1, 3, 0])

In [123]:
#choose 2 random out of the list
np.random.choice([1,2,3,4,5,6],2)

array([3, 4])

In [129]:
#choose 6 random (default by replacing)
np.random.choice([1,2,3,4,5,6],6)

array([1, 5, 4, 1, 1, 3])

In [128]:
#Do not replace!
np.random.choice([1,2,3,4,5,6],6,replace=False)

array([4, 3, 5, 6, 1, 2])

In [134]:
#assing probabilities to every element and choose according to this probability
np.random.choice([1,2,3,4,5,6],4,p=[0.6,0.1,0,0.1,0.1,0.1])

array([1, 1, 2, 5])

**Draw samples from a binomial distribution.**

Samples are drawn from a binomial distribution with specified parameters, n trials and p probability of success where n an integer >= 0 and p is in the interval [0,1]. (n may be input as a float, but it is truncated to an integer in use)

In [180]:
n, p = 10, .5  # number of trials, probability of each trial
s = np.random.binomial(n, p, 20)


In [181]:
s # that means how many of the n trials are success with the probability p in each iteration

array([5, 7, 5, 4, 4, 4, 2, 3, 6, 7, 3, 5, 3, 5, 5, 5, 7, 6, 7, 4])

### <font color='4C49A2'>Exercise </font>

1. Create 20 random float numbers between 0 and 10 and put the result into (4,5) matrix
2. Create 100 random integer numbers smaller than 100
3. 60% of people like football, 20% basketball, 10% tennis, 10% other sports. Select 20 sport choices from the list of [football,basketball,tennis,other] and show that the number of sport choices in your list fit the given probabilities
4. A company drills 9 wild-cat oil exploration wells, each with an estimated probability of success of 0.1. All nine wells fail. What is the probability of that happening?



In [197]:
#(np.random.binomial(9,0.1,20000)  == 0).sum() / 20000.0

### <font color='A31A48'> 4. Indexing, Slicing and Iterating</font>

One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.

In [201]:
a = np.arange(10)**3
a

array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729])

In [202]:
a[2:5]

array([ 8, 27, 64])

In [203]:
a[:6:2] = -1000    # equivalent to a[0:6:2] = -1000; from start to position 6, exclusive, set every 2nd element to -1000

In [204]:
a

array([-1000,     1, -1000,    27, -1000,   125,   216,   343,   512,   729])

In [205]:
a[ : :-1]                                 # reversed a

array([  729,   512,   343,   216,   125, -1000,    27, -1000,     1, -1000])

Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas:

In [206]:
b = np.random.randint(1,10,(4,5))

In [207]:
b

array([[2, 9, 1, 4, 9],
       [1, 5, 8, 3, 4],
       [9, 2, 7, 1, 2],
       [4, 5, 6, 5, 1]])

In [211]:
b[3,2]

6

In [213]:
b[0:5, 1] # each row in the second column of b

array([9, 5, 2, 5])

In [214]:
b[ : ,1]   

array([9, 5, 2, 5])

In [215]:
b[1:3, : ]    

array([[1, 5, 8, 3, 4],
       [9, 2, 7, 1, 2]])

In [216]:
b[-1]                                  # the last row. Equivalent to b[-1,:]

array([4, 5, 6, 5, 1])

Iterating over multidimensional arrays is done with respect to the first axis:



In [219]:
for row in b:
    print(row)

[2 9 1 4 9]
[1 5 8 3 4]
[9 2 7 1 2]
[4 5 6 5 1]


In [220]:
for row in b:
    print[element for element in row]

[2, 9, 1, 4, 9]
[1, 5, 8, 3, 4]
[9, 2, 7, 1, 2]
[4, 5, 6, 5, 1]


In [221]:
for element in b.flat:
    print(element)


2
9
1
4
9
1
5
8
3
4
9
2
7
1
2
4
5
6
5
1


Boolean array indexing: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition.

In [222]:
a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # shape as a, where each slot of bool_idx tells
                    # whether that element of a is > 2.

print bool_idx

[[False False]
 [ True  True]
 [ True  True]]


In [223]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print a[bool_idx]


[3 4 5 6]


In [224]:

# We can do all of the above in a single concise statement:
print a[a > 2]

[3 4 5 6]
