# Programming for Data Science (Python)

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Introduction to numpy:
</p><br>

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold"><br>
Package for scientific computing with Python
</p><br>

Numerical Python, or "Numpy" for short, is a foundational package on which many of the most common data science packages are built.  Numpy provides us with high performance multi-dimensional arrays which we can use as vectors or matrices.  

The key features of numpy are:

- ndarrays: n-dimensional arrays of the same data type which are fast and space-efficient.  There are a number of built-in methods for ndarrays which allow for rapid processing of data without using loops (e.g., compute the mean).
- Vectorization: enables numeric operations on ndarrays.
- Broadcasting: a useful tool which defines implicit behavior between multi-dimensional arrays of different sizes.
- Input/Output: simplifies reading and writing of data from/to file.

<b>Additional Recommended Resources:</b><br>
<a href="https://docs.scipy.org/doc/numpy/reference/">Numpy Documentation</a><br>
<i>Python for Data Analysis</i> by Wes McKinney<br>
<i>Python Data science Handbook</i> by Jake VanderPlas


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Getting started with ndarray<br><br></p>

**ndarrays** are time and space-efficient multidimensional arrays at the core of numpy.  Like the data structures in Week 2, let's get started by creating ndarrays using the numpy package.

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

How to create Rank 1 numpy arrays:
</p>

In [3]:
#Rank 1 means 1 dimension
import numpy as np
an_array = np.array([3,33,333])
an_array

type(an_array )


numpy.ndarray

In [5]:
# test the shape of the array we just created, it should have just one dimension (Rank 1)


#check the dimension of the array
an_array.shape

# because this is a 1-rank array, we need only one index to accesss each element
#access components just like we do in list

an_array[0]

# ndarrays are mutable, here we change an element of the array
an_array[0] = 88

an_array

array([ 88,  33, 333])

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

How to create a Rank 2 numpy array:</p>

A rank 2 **ndarray** is one with two dimensions.  Notice the format below of [ [row] , [row] ].  2 dimensional arrays are great for representing matrices which are often useful in data science.

In [17]:
# Create a rank 2 array
another = np.array([[11,12,13],[21,22,23]])

another

print(another.shape)

#change the list into 2 dimensional
another2 = np.array([[3,33,333]])

another2

another2.shape

(2, 3)


(1, 3)

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

There are many way to create numpy arrays:
</p>

Here we create a number of different size arrays with different shapes and different pre-filled values.  numpy has a number of built in methods which help us quickly and easily create multidimensional arrays.

In [5]:
# create a 2x2 array of zeros

ex1 = np.zeros((2,2))
print(ex1)
# create a 2x2 array filled with 9.0
ex2 = np.full((2,2), 9.0)
print(ex2)

# create an array of ones
ex3 = np.ones((1,2))
print(ex3)

[[0. 0.]
 [0. 0.]]
[[9. 9.]
 [9. 9.]]
[[1. 1.]]


In [8]:
# notice that the above ndarray (ex3) is actually rank 2, it is a 2x1 array
print(ex3.shape)

# which means we need to use two indexes to access an element
ex3[0,0]

(1, 2)


1.0

In [10]:
# create an array of random floats between 0 and 1
ex4 = np.random.random((2,2))
print(ex4)#this will generate a matrix with mean and std dev of 1

[[0.1490699  0.4258907 ]
 [0.64376609 0.54443006]]


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Datatypes
<br><br></p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Datatypes:
</p>

In [11]:
# Python assigns the  data type
ex1.dtype

dtype('float64')

In [13]:
# Python assigns the  data type
ex5 = np.array([1,2,3])
ex5.dtype

dtype('int32')

In [15]:
#You can also tell Python the  data type
ex5 = np.array([1,2,3], dtype = np.int64)
print(ex5)
print(ex5.dtype)

[1 2 3]
int64


In [16]:
# you can use this to force floats into integers (using floor function)
ex6 = np.array([1.5,2.2,3.8], dtype = np.int64)
print(ex6)

[1 2 3]


In [None]:
# you can use this to force integers into floats if you anticipate
# the values may change to floats later


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Array Indexing
<br><br></p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
Slice indexing:
</p>

Similar to the use of slice indexing with lists and strings, we can use slice indexing to pull out sub-regions of ndarrays.

In [29]:
# Rank 2 array of shape (3, 4)

an_array1 = np.array([[11,12,13,14],[21,22,23,24],[31,32,33,34]])
print(an_array1)

#Use array slicing to get a subarray consisting of 2 rows x 2 columns.
a_slice = an_array1[:2,1:3]
print(a_slice)
#When you modify a slice, you actually modify the underlying array.

print("before: ", an_array1[0,1])
a_slice[0,0] = 1000

print("after: ", an_array1[0,1])
#To avoid that, you need to explicitly use the np.array()function.
another_slice = np.array(an_array1[:2,1:3])
print(another_slice)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]
[[12 13]
 [22 23]]
before:  12
after:  1000
[[1000   13]
 [  22   23]]


In [31]:
# You may generate an array of lower rank
  
row_rank1 = an_array1[1,:] #this will give the second row
print(row_rank1.shape)
# Or an array of the same rank as the an_array
row_rank2 = an_array1[:2,:]
print(row_rank2.shape)


#We can do the same thing for columns of an array:



(4,)
(2, 4)


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Fancy indexing: array of indices
</p>

Sometimes it's useful to use an array of indexes to access or change elements.

In [32]:
# Create a new array

array2=np.array([[11,12,13],[21,22,23],[31,32,33],[41,42,43]])
print(array2)
print(array2.shape)

[[11 12 13]
 [21 22 23]
 [31 32 33]
 [41 42 43]]
(4, 3)


In [33]:
# Create an array of indices
col_indices = np.array([0,1,2,0])
print(col_indices)
row_indices=np.arange(4)
print(row_indices)

[0 1 2 0]
[0 1 2 3]


In [34]:
# Examine the pairings of row_indices and col_indices.  These are the elements we'll change next.
for row,col in zip(row_indices,col_indices):
    print(row,',', col)

0 , 0
1 , 1
2 , 2
3 , 0


In [35]:
# Select one element from each row
print(array2[row_indices,col_indices])
#this goes according to the col_indices 



[11 22 33 41]


In [36]:
# Change one element from each row using the indices selected
array2[row_indices,col_indices]+=1000
print(array2)

[[1011   12   13]
 [  21 1022   23]
 [  31   32 1033]
 [1041   42   43]]


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
Boolean Indexing

<br><br></p>
<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
</p>

In [None]:
# create a 3x2 array


In [42]:
# create a filter which will be boolean values for whether each element meets this condition
filter = (array2 > 15)
filter

array([[ True,  True, False],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

Notice that the filter is a same size ndarray as an_array which is filled with True for each element whose corresponding element in an_array which is greater than 15 and False for those elements whose value is less than 15.

In [43]:
# we can now select just those elements which meet that criteria
print(array2[filter])

[1011  112   21 1122   23   31  132 1033 1041  142   43]


In [None]:
# For short, we could have just used the approach below without the need for the separate filter array.
array2[array2%2 == 0]

What is particularly useful is that we can actually change elements in the array applying a similar logical filter.  Let's add 100 to all the even values.

In [40]:
array2[array2%2 == 0]+=100
print(array2)

[[1011  112   13]
 [  21 1122   23]
 [  31  132 1033]
 [1041  142   43]]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Arithmetic Array Operations:

</p>

In [46]:
x = np.array([[1,2],[12,22]], dtype=np.int)
y = np.array([[21.1,22.1],[1.1,2.1]], dtype=np.float64)


In [47]:
#plus
print(x+y)
print(np.add(x,y))#corresponding items get added

[[22.1 24.1]
 [13.1 24.1]]
[[22.1 24.1]
 [13.1 24.1]]


In [48]:
# subtract
print(x-y)
print(np.subtract(x,y))

[[-20.1 -20.1]
 [ 10.9  19.9]]
[[-20.1 -20.1]
 [ 10.9  19.9]]


In [49]:
# multiply

print(x*y)
print(np.multiply(x,y))

[[21.1 44.2]
 [13.2 46.2]]
[[21.1 44.2]
 [13.2 46.2]]


In [50]:
# divide
print(x/y)
print(np.divide(x,y))

[[ 0.04739336  0.09049774]
 [10.90909091 10.47619048]]
[[ 0.04739336  0.09049774]
 [10.90909091 10.47619048]]


In [51]:
# square root

print(np.sqrt(x,y))

[[1.         1.41421356]
 [3.46410162 4.69041576]]


In [52]:
# exponent (e ** x)

print(np.exp(x,y))

[[2.71828183e+00 7.38905610e+00]
 [1.62754791e+05 3.58491285e+09]]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Let's explore the efficiency of universal functions

</p>

In [55]:
# Using loop to compute the reciprocal of each element of an array
np.random.seed(0)
#this is without universal functions

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0/values[i]
        return output
rarray = np.random.randint(1,10, size = 5)
compute_reciprocals(rarray)


array([1.66666667e-001, 1.12000000e+002, 1.21000000e+002, 1.22000000e+002,
       3.33823170e+151])

In [56]:
#with universal function

big_array = np.random.randint(1,100,size = 1000000)
%time compute_reciprocals(big_array)

Wall time: 0 ns


array([0.1, 0. , 0. , ..., 0. , 0. , 0. ])

In [57]:
%time (1/big_array)

Wall time: 9.02 ms


array([0.1       , 0.01190476, 0.04545455, ..., 0.01428571, 0.01098901,
       0.01149425])

<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Aggregation functions <br><br>
</p>

In [66]:
# setup a random 2 x 4 matrix

array3 = np.random.randn(2,4)
array3

array([[ 0.91052841,  1.36337822,  0.03155953, -0.46966792],
       [ 1.0923219 , -0.55155027,  0.78930459, -0.73209118]])

In [64]:
#find a list of numbers and count the number of odd or even numbers
numbers = [2,3,4,5,6,7,8]
def evenodd():
    numbers = input("Enter the list of numbers you want to know are even or odd: ")
    numbers =numbers.split(",")
    print(numbers)
    even = 0
    odd = 0
    for i in range(len(numbers)):
        if numbers[i]%2 == 0:
            even+=1
            
        else :
            odd+=1
    print("There are "+ str(len(even))+ "even number and "+ str(len(odd))+ " odd numbers ")
    

evenodd()
            
    

Enter the list of numbers you want to know are even or odd: 1,2,3,4,5,6,7,8,9
['1', '2', '3', '4', '5', '6', '7', '8', '9']


TypeError: not all arguments converted during string formatting

In [62]:
numbers = [1,2,3,4,5,45,67,69]
count_even = 0
count_odd = 0
for i in numbers:
    if i%2 == 0:
        count_even+=1
    else:
        count_odd+=1
print(count_even)
print(count_odd)


2
6


In [67]:
# compute the mean for all elements
array3.mean()

0.30422290864076057

In [71]:
# compute the means by row
#row, axis =0 #column, axis = 1

array3.mean(axis = 1)#the first row actually is the first element of every column, we jump from one column to the next for row operations

array([0.45894956, 0.14949626])

In [73]:
# compute the means by column
array3.mean(axis = 0)

array([ 1.00142515,  0.40591398,  0.41043206, -0.60087955])

In [74]:
# sum all the elements
array3.sum()

2.4337832691260846

In [76]:
# compute the medians
np.median(array3, axis = 1) #median by row

array([0.47104397, 0.11887716])

In [77]:
#sorting
# create a 10 element array of randoms
array4 = np.random.randn(10)
print(array4)
array4.sort()
print(array4)


[-0.81585044 -0.09495964 -0.11991924 -1.27360037 -0.20817263 -1.00912111
 -1.60382836 -1.22308313  0.42019376  0.49824839]
[-1.60382836 -1.27360037 -1.22308313 -1.00912111 -0.81585044 -0.20817263
 -0.11991924 -0.09495964  0.42019376  0.49824839]


In [78]:
#Find unique elements
array5 = np.array([1,2,3,45,3,4,2,1])
np.unique(array5)

array([ 1,  2,  3,  4, 45])

<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Broadcasting:
<br><br>
</p>

Introduction to broadcasting. <br>
For more details, please see: <br>
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations.

Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
Two dimensions are compatible when:
they are equal, or
one of them is 1
Rules:
Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.


In [79]:
#Create a 4X3 array
start = np.zeros((4,3))
start

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [80]:
# create a rank 1 ndarray with 3 values
add_rows = np.array([1,0,2])
print(add_rows.shape)

(3,)


In [81]:
#Add together
y = start + add_rows
print(y)

[[1. 0. 2.]
 [1. 0. 2.]
 [1. 0. 2.]
 [1. 0. 2.]]


In [82]:
# create an ndarray which is 4 x 1 to broadcast across columns
add_cols = np.array([[0,1,2,3]])
add_cols = add_cols.transpose()
print(add_cols)

[[0]
 [1]
 [2]
 [3]]


In [83]:
# add to each column of 'start' using broadcasting
y = start+add_cols
y

array([[0., 0., 0.],
       [1., 1., 1.],
       [2., 2., 2.],
       [3., 3., 3.]])

In [91]:
# this will just broadcast in both dimensions
add_scalar = np.array([1])
print(start+add_scalar)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [87]:
# create our 3x4 matrix
start1 = np.zeros((3,4))

In [84]:
# create our (4,) array 
addA = np.array([1,2,3,4])

In [89]:
# add the two together using broadcasting
b = start1 + addA
print(b)

[[1. 2. 3. 4.]
 [1. 2. 3. 4.]
 [1. 2. 3. 4.]]


In [90]:
#Application of broadcasting - centering an array

#center a variable means subtracting the mean from it
#center a vaariable 

X = np.random.random((10,3))
Xmean = X.mean(axis = 0)
print(Xmean)
X_centered = X-Xmean
print(X_centered)

[0.6888451  0.42462478 0.60378153]
[[ 0.30220871 -0.14324592 -0.58612966]
 [-0.0328593   0.19302938 -0.4240501 ]
 [ 0.15865028  0.04357788  0.37080955]
 [-0.08581491 -0.11252546  0.02627851]
 [ 0.25108006 -0.05571031  0.26354846]
 [-0.23775539  0.25351349  0.2185732 ]
 [-0.47146868 -0.21004024  0.15309102]
 [ 0.2031229   0.49359209  0.25567847]
 [-0.18165762 -0.31801883  0.03716178]
 [ 0.09449395 -0.14417208 -0.31496123]]
