# Introduction to Data Science with Python

### The limitations of the built-in python list 

Lists are flexible, dynamic python objects that do their job quite well. But they do not support some mathematical operations in an intuitive way. Consider the summation of two lists, $l_1$ and $l_2$:

In [3]:
l_1 = [1,2,3] # A basic list of ints
l_2 = [3,4,5] # Another basic list of ints
print('Sum of Lists: %s' % (l_1 + l_2))

Sum of Lists: [1, 2, 3, 3, 4, 5]


In fact, the difference of lists throws an error that complains that the difference operation isn't supported at all

In [4]:
print('Difference of Lists: %s' % (l_1 - l_2))

TypeError: unsupported operand type(s) for -: 'list' and 'list'

If we wanted to sum lists elementwise, we could write our own function that does the job entirely within the framework of python

In [5]:
def add_lists(list_1, list_2):
    assert(len(list_1) == len(list_2)) # Lists must be the same length for this to make sense
    length = len(list_1) # can be either
    
    # Main loop
    ret = []
    for i in range(length):
        ret.append(list_1[i] + list_2[i])
    return ret

In [6]:
ans = add_lists(l_1, l_2)
print('Sum of Lists: %s' % ans)

Sum of Lists: [4, 6, 8]


We would have to write a similar function for all the possible operands that we could consider for list arithmatic. This is time consuming and inefficient. Moreover, once the lists in question become nested, mimicing the behavior of true matrices, the problem gets worse. Complicated indexing is necessary, just to allow for the most basic matrix operations common throughout science and engineering. Imagine writing a matrix multiplication function using python syntax in a general way, such that it returns a matrix-matrix or matrix-vector product:

\begin{align}
(n \times x) \times (x \times m) \rightarrow (n \times m)
\end{align}

\begin{align}
\begin{bmatrix}
c_{0,0} & ... & c_{0,n} \\
\vdots & \ddots & \vdots \\
c_{m,0} & ... & c_{m,n}
\end{bmatrix}
=
\begin{bmatrix}
a_{0,0} & ... & a_{0,x} \\
\vdots & \ddots & \vdots \\
a_{n,0} & ... & a_{n,x}
\end{bmatrix}
\begin{bmatrix}
b_{0,0} & ... & b_{0,m} \\
\vdots & \ddots & \vdots \\
b_{x,0} & ... & b_{x,m}
\end{bmatrix}
\end{align}

Let is instantiate a matrix $\mathcal{M}$ and a vector $\vec{v}$ and write a function that does the multiplication ourselves.


In [7]:
def matrix_multiply(A, B):
    ret = [ [0 for i in range(len(B[0]))] for i in range(len(A))] # number of rows in the result
    
    inner_dim = len(A[0])
    n_dim = len(ret)
    m_dim = len(ret[0])
    
    
    for i in range(n_dim):
        for j in range(m_dim):
            element = 0
            for x in range(inner_dim):
                    element += A[i][x] * B[x][j]
            ret[i][j] = element
    
    return ret
M = [[0,1,0],[0,2,0],[0,3,0]]
v = [[1],[2],[3]]

In [8]:
print(matrix_multiply(M, v))

[[2], [4], [6]]


We needed a complex list comprehension and a nontrivial nested list structure to be able to perform the necessary computation. Moreover, the function contains a triple for-loop, this type of code doesn't scale well to large matrix products under the Python memory model. Lastly, writing functions like this for every operation in linear algebra, especially when matrices become tensors, is impractical. NumPy fills this gap with a huge number of fast, amazing functions.

### NumPy and the NDArray

In [9]:
import numpy as np # By convention, this is how NumPy is used

NumPy is based around a class called the $\textit{NDArray}$, which is a flexible vector / matrix class that implements the intuitive matrix and vector arithmatic lacking in basic Python. Let's start by creating some NDArrays:

In [10]:
l_1 = np.array(l_1) # Cast a list explicitly to a NumPy array
l_2 = np.array(l_2)
print('Here is l_1 now: %s' % l_1)
print('Here is the type: %s' % type(l_1))

Here is l_1 now: [1 2 3]
Here is the type: <type 'numpy.ndarray'>


In [11]:
# We can do some intuitive operations now
ans = l_1 + l_2
print('%s + %s = %s' % (l_1, l_2, ans))
ans = l_1 - l_2
print('%s - %s = %s' % (l_1, l_2, ans))

print('\n')

# They support the 'broadcasting' of scalars
print('l_1 + 1 = %s' % (l_1 + 1))

[1 2 3] + [3 4 5] = [4 6 8]
[1 2 3] - [3 4 5] = [-2 -2 -2]


l_1 + 1 = [2 3 4]


### Multidimensional Arrays

NumPy seamlessly supports multidimensional arrays, matrices, and tensors of arbitrary dimension without nesting NDArrays. NDArrays themselves are flexible and extensible and may be defined with such dimensions, with a rich API of common functions to facilitate their use. Let's start by building a two dimensional 3x3 matrix by conversion from a nested group of core python lists $M = [l_0, l_1, l_2]$:

In [12]:
l_0 = [0,1,2]
l_1 = [3,4,5]
l_2 = [6,7,8]
M = [l_0, l_1, l_2]

print('Nested List Structure: %s' % M)

M = np.array(M) # cast directly. Dimensions inferred

print('Nested List Structure after conversion: \n %s' % M)
print('And it\'s type: %s' % type(M))

Nested List Structure: [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
Nested List Structure after conversion: 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]
And it's type: <type 'numpy.ndarray'>


Now, we can use the more intuitive, MATLAB-like indexing syntax to assign and access elements of multidimensional NDArrays:

In [13]:
print('Middle element %s' % M[1,1])

M[1,2] = 1.5 # A float

print('After Assignment: \n %s' % M)

Middle element 4
After Assignment: 
 [[0 1 2]
 [3 4 1]
 [6 7 8]]


Notice how we ended up with a 1 in the target element's place. This is a data type issue. The .dtype() method is supported by all NDArrays, as well as the .astype() method for casting between data types:

In [14]:
t = M.dtype
print('Data type of M: %s' % t)

M = M.astype(np.float64)

t = M.dtype
print('Data type of M: %s' % t)

M[1,2] = 1.5 # A float

print('After Assignment: \n %s' % M)

Data type of M: int64
Data type of M: float64
After Assignment: 
 [[ 0.   1.   2. ]
 [ 3.   4.   1.5]
 [ 6.   7.   8. ]]


### The Shape

The behavior and properties of an NDArray are often sensitively dependent on the $\textit{shape}$ of the NDArray itself. The shape of an array can be found by calling the .shape method, which will return a tuple containing the array's dimensions:

In [15]:
print('Our 2D matrix M: \n %s' % M)
print('Shape:')
print(M.shape)

# As long as the number of elements remains fixed, we can reshape NDArrays at will:
print('Reshaped:')
print(M.reshape( (9,1) ))

# Notice that (9,1) is not the same as (1,9) !!!

print('Alternatively...')
print(M.reshape( (1,9) ))

Our 2D matrix M: 
 [[ 0.   1.   2. ]
 [ 3.   4.   1.5]
 [ 6.   7.   8. ]]
Shape:
(3, 3)
Reshaped:
[[ 0. ]
 [ 1. ]
 [ 2. ]
 [ 3. ]
 [ 4. ]
 [ 1.5]
 [ 6. ]
 [ 7. ]
 [ 8. ]]
Alternatively...
[[ 0.   1.   2.   3.   4.   1.5  6.   7.   8. ]]


With the concept of shape firmly in mind, let's go over some useful functions within the NDArray API.

In [16]:
M = np.zeros(9)
print('A linear array of zeros')
print(M)

M = np.zeros((3,3))
print('A square arrary of zeros')
print(M)

print('A 4x4 array of consecutive integers')

M = np.arange(16)
print('M - linear')
print(M)
M = np.arange(16).reshape((4,4)) # Can build this array rapidly
print('M - reshaped')
print(M)

A linear array of zeros
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.]
A square arrary of zeros
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
A 4x4 array of consecutive integers
M - linear
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
M - reshaped
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


Multidimensional NDArrays support MATLAB-like indexing, with which you can read from or assign to portions of the multidimensional NDArray very easily. 

In [17]:
M = np.zeros((4,4))
print('Before assignment:')
print(M)

M[:,0] = 1
print('After assignment:')
print(M)

print('After assignment again:')
M[2,:] = 5
print(M)

M = M*0 # Reset the matrix 

# We can even do more complicated stuff

M[:, 1:3] = 2
print('A more complex assignment:')
print(M)

x = np.arange(4)
M = M*0

M[:,0] = x
M[2,:] = x-1

print('Assignment using other NDArrays:')
print(M)

Before assignment:
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
After assignment:
[[ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]]
After assignment again:
[[ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 5.  5.  5.  5.]
 [ 1.  0.  0.  0.]]
A more complex assignment:
[[ 0.  2.  2.  0.]
 [ 0.  2.  2.  0.]
 [ 0.  2.  2.  0.]
 [ 0.  2.  2.  0.]]
Assignment using other NDArrays:
[[ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [-1.  0.  1.  2.]
 [ 3.  0.  0.  0.]]


As long as the shapes of NDArrays are 'compatible', they can be multiplied elementwise, broadcasted, used in inner products, and much much more. 'Compatible' in this context can mean compatible in the linear algebraic sense, i.e. for inner products and other matrix multiplication, or simply sharing a dimension in such a manner that broadcasting 'makes sense'. Here are some examples of this:

In [18]:
M = M*0 # re-initialize
v = np.arange(4).reshape((4,1))

print('M Before:')
print(M)
print('v Before:')
print(v)

M = M + v
print('M After:')
print(M)

M Before:
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [-0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
v Before:
[[0]
 [1]
 [2]
 [3]]
M After:
[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 3.  3.  3.  3.]]


Definitely of note is np.where:

In [19]:
np.where(M == 1) # Find elements of a conditions

(array([1, 1, 1, 1]), array([0, 1, 2, 3]))

In [20]:
M[np.where(M > 0)] = 1.5
print(M)

[[ 0.   0.   0.   0. ]
 [ 1.5  1.5  1.5  1.5]
 [ 1.5  1.5  1.5  1.5]
 [ 1.5  1.5  1.5  1.5]]


# The Pandas Data Analysis Library

Pandas is a flexible data analysis library built on top of NumPy that is excellent for working with tabular data. It is currently the de-facto standard for Python-based data analysis, and fluency in Pandas will do wonders for your productivity and frankly your resume. It is one of the fastest ways of getting from zero to answer in existence. 

### Tabular data structures

The central object of study in Pandas is the DataFrame, which is a tabular data structure with rows and columns like an excel spreadsheet. The first point of discussion is the creation of dataframes both from native Python dictionaries, and text files through the Pandas I/O system.

In [21]:
import pandas as pd

In [32]:
names = ['Alice',
         'Bob',
         'James',
         'Beth', 
         'John', 
         'Sally',
         'Richard', 
         'Lauren',
         'Brandon', 
         'Sabrina']

ages = np.random.randint(18,35, len(names))# Some random ages between 18 and 35
my_dict = {'names':names, 'ages':ages}
print(my_dict)

{'ages': array([31, 32, 33, 29, 25, 34, 25, 33, 27, 34]), 'names': ['Alice', 'Bob', 'James', 'Beth', 'John', 'Sally', 'Richard', 'Lauren', 'Brandon', 'Sabrina']}


Let's convert our not-so-useful-for-analysis dict into a Pandas dataframe. We can ue the from_dict function to do this easily using Pandas:

In [47]:
df = pd.DataFrame.from_dict(my_dict)
print('Resulting type: %s' % type(df))
df.head(10) # Displays the first 10 elements of a dataframe

Resulting type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,ages,names
0,31,Alice
1,32,Bob
2,33,James
3,29,Beth
4,25,John
5,34,Sally
6,25,Richard
7,33,Lauren
8,27,Brandon
9,34,Sabrina


In [43]:
# The dataframe has a shape property, just like a NumPy matrix. 

print('Dataframe shape:')
print(df.shape)

# It also has an overall length property corresponding to the number of rows.

print('Dataframe length: %s' % len(df))

Dataframe shape:
(10, 2)
Dataframe length: 10


You can directly select a column of a dataframe just like you would a dict. The result is a Pandas 'Series' object

In [46]:
print('Type of a column: %s' % type(df['ages']))
print(df['ages'])

# Even though they are more complex series objects, they still support the behavior of the underlying NumPy arrays

print(df['ages'][8])

Type of a column: <class 'pandas.core.series.Series'>
0    31
1    32
2    33
3    29
4    25
5    34
6    25
7    33
8    27
9    34
Name: ages, dtype: int64
27


Along the horizontal dimension, rows of Pandas DataFrames are Row objects. You will notice there is a third column present in the DataFrame - this is the $\textit{index}$. It is automatically generated as a row number, but can be reassigned to a column of your choice using the DataFrame.set_index(colname) method. We can use it to access particular Pandas $\textit{rows}$, which are also Series objects:

In [58]:
myrow = df.loc[0]
print('Type of myrow: %s' % type(myrow))

print(myrow)

Type of myrow: <class 'pandas.core.series.Series'>
ages        31
names    Alice
Name: 0, dtype: object


In [59]:
# A series can support dict-like features when it's used as a DataFrame row. 
myrow.keys()

Index([u'ages', u'names'], dtype='object')

In [61]:
print(myrow['ages'])

31


That said, if you use slicing to grab a section of a dataframe, you'll end up with another DataFrame:

In [62]:
df[2:5]

Unnamed: 0,ages,names
2,33,James
3,29,Beth
4,25,John


The above syntax will work, but be aware, passing as single int will be interpreted as a column key

In [None]:
df[2] # Will throw a huge key error

It is possible to spend a whole week simply exploring the built-in functions supported by DataFrames in Pandas. Here however, we will simply highlight a few ones that might be useful, to give you an idea of what's possible out of the box with Pandas:

In [66]:
df.describe() # Collect summary statistics in one line

Unnamed: 0,ages
count,10.0
mean,30.3
std,3.560587
min,25.0
25%,27.5
50%,31.5
75%,33.0
max,34.0


In [67]:
df.sort('ages')

  if __name__ == '__main__':


Unnamed: 0,ages,names
4,25,John
6,25,Richard
8,27,Brandon
3,29,Beth
0,31,Alice
1,32,Bob
2,33,James
7,33,Lauren
5,34,Sally
9,34,Sabrina


In [68]:
df['ages'] > 29 # Conditional boolean dataframe

0     True
1     True
2     True
3    False
4    False
5     True
6    False
7     True
8    False
9     True
Name: ages, dtype: bool

In [69]:
df[df['ages'] > 29]

Unnamed: 0,ages,names
0,31,Alice
1,32,Bob
2,33,James
5,34,Sally
7,33,Lauren
9,34,Sabrina
